Be Newsletters - Volume 4: 1999

Issue 4-41, October 13, 1999

Be Engineering Insights: Low-Latency Timing from the User Space

By Jeff Bush

One of the more compelling things that sets BeOS apart from many other operating systems is its ability to get low-latency timing from user applications. There have been a few articles talking about high resolution timing for certain events, both from kernel and user space, but I would like to talk a little more about general strategies for writing timing critical apps, as well as giving a little more information about how BeOS works under the hood.

BeOS is probably best described as a "soft real time" operating system. It differs from "hard real time" operating systems—typically small embedded systems—in being more general purpose and having higher-level facilities, such as VM. However, it still allows very precise timing and latencies.

I will be referring specifically to scheduling latency in this article, that being the amount of time it takes an operating system to service hardware or timing requests. This is important, for example, in real time sound applications. Take a multitrack recorder app; you may want to lay down one instrument track, then go back and record another one. The app would play the first track as it recorded the second. You'd hear both tracks mixed together as they played back. But, if the sound takes too long to get from input to output, the user will notice an annoying delay.

In order to reduce this delay you need small audio buffers, which in turn require the app to wake up more often to read and write buffers to and from the audio device. If the app can't wake up fast enough you'll hear glitching and pops in the output and/or recorded data. Even for an amateur home producer, this would be unacceptable. One solution to this problem is to move the app's timing-critical code into a device driver, where it can be closer to the hardware. This is crufty, to say the least.

With BeOS you can get low latencies without resorting to this kind of trickery, as the sample code will demonstrate:

#include <OS.h>

#define SNOOZE_TIME 150000

int main()
{
  bigtime_t start, elapsed_time;

  set_thread_priority(find_thread(NULL), 120);
  for (;;) {
    start = system_time();
    snooze(SNOOZE_TIME);
    elapsed_time = system_time() - start;
    printf("%Lu microseconds late\n", elapsed_time - SNOOZE_TIME);
  }

  return 0;
}

This app simply measures how accurate snooze() is. You may be surprised to find that it's generally very close, depending on load. There are a number of reasons why BeOS is this accurate. The first is that it dynamically programs a hardware timer to go off exactly when the snooze() expires. Also, the kernel is preemptive, which means the scheduler can be invoked, even while a thread is executing kernel code. It is the convention of a number of operating systems, especially some of the Un*x flavor, to reschedule in the kernel only when a thread explicitly blocks, or when a system call returns. The reason is that these are known points that are not in a kernel critical section. This simplifies the design of the kernel, because intricate locking is not required.

For server machines, where latency is not as important, this is adequate. However, this limits the system's ability to respond to interrupts as quickly as you might like. For example, let's say thread A sleeps on a semaphore, waiting for a hardware interrupt handler to release the semaphore. Thread B begins to run, then performs a system call which requires quite a bit of processing in the kernel. While thread B is doing its thing, an interrupt comes in (such as the hardware timer interrupt in the example used above) and sets thread A runnable. On some operating systems, thread A would not actually start until thread B finished whatever it was doing and returned from the system call. BeOS has been a multiprocessor OS from its infancy, requiring fine-grained locking in the kernel, and therefore does not have this restriction. Under BeOS, Thread A would start running as soon as the interrupt was handled, leaving Thread B ready but still in the kernel.

As a programmer, there are few things you need to be aware of when trying to get tight latencies. Cyril wrote an informative article detailing how the BeOS scheduler works at

Be Engineering Insights: The Kernel Scheduler and Real-Time Threads

BeOS defines 120 priority levels, with priorities 100 and above being "real time." Being real time has two implications for threads. First, the thread will not be preempted, except by threads of higher priority. This gives you a guaranteed execution time. It also adds the responsibility that the thread get its work done in a short amount of time, lest it degrade system performance. And it says that the thread will start executing shortly after it becomes ready to run, even when the system is under load. The kernel enforces a rule that a real time thread will not sit in its ready queue unless an equal or higher real time thread is running.

VM is another issue to consider. If a thread page faults, it can easily wait 20 milliseconds for the data to be read off_the disk. While this is barely noticeable for many apps, it can be really bad for timing-critical apps. For this reason, you may want to store all your timing-critical data in locked areas (that is, create_area() with the B_FULL_LOCK flag), or use the realtime allocator defined in RealtimeAlloc.h. Also, there are two functions called media_realtime_init_image() and media_realtime_init_thread(), defined in MediaDefs.h, that lock image areas and thread stacks, respectively.

Making VM calls, including create_area(), delete_area(), resize_area(), find_area(), get_area_info(), etc. can cause unnecessary delays, as they perform quite a bit of locking in the kernel, and lower-priority threads may be holding those locks. As BeOS doesn't currently support priority inheritance, making these calls gives no guarantee that it will return as quickly as you may need it to. Try to avoid making these calls from timing sensitive code.

In many cases, it's desirable to perform file I/O associated with real time processing. Even if the disk can support the data rate you need (it generally can), disk access is very sporadic, being bound to the movement of a mechanical head, which is orders of magnitude slower than the processor. In this case, you can emulate async I/O by running another thread that reads data into a buffer, then having the real time thread work on the data within this buffer.

Note that a bit of restraint is important here. Locked memory is an expensive resource, and doing too much CPU-intensive processing in a real time thread can degrade system performance. Before trying some of these techniques, you should determine what your latency tolerances are and how well your app can handle them without locking memory or bumping up thread priorities.

Developers' Workshop: Serving It Up: Creating a Simple Server Application

By Eric Shepherd

BeOS provides a collection of C++ classes that make it a little easier to write networking software. These classes, BNetAddress, BNetBuffer, BNetEndpoint, and BNetDebug, give you friendly object-oriented access to networking.

In this article, we'll create a very simple server. It waits for connections on a specific port (hardcoded to 4242, but you could expand this to be configurable), and whenever a connection is made, it transmits the contents of a given file over the connection, then automatically closes the connection.

The program, called Responder, is launched from the command line by typing responder filename, where filename is the name of the file that should be sent when connections are made. The file can be any size, and can be binary or text. No attributes are preserved and no compression is done (this is a simple protocol, but you can expand on it if you want to). The assumption is that the connection is being established by a client application that knows what the file name and attributes should be (for example, a program that runs nightly to fetch updated data from a central server for a database solution).

You can build this program by copying the code into a source file and compiling it. Be sure to include the libnetapi.so library.

The main() function's primary purpose is to parse the command line and spawn the listener thread. In a more perfect server application, it might also set up a user interface for configuring the server or monitoring the server's status. Let's have a look:

#include <NetAddress.h>
#include <NetEndpoint.h>
#include <File.h>
#include <stdio.h>
#include <socket.h>
#include <OS.h>

static long responder_proc(void *data);
static void send_file(const char *filename, BNetEndpoint *connect);
static void send_error(BNetEndpoint *connect);

int main(int argc, char *argv[]) {
  char *filename;
  thread_id responder_thread;
  status_t err;

  if (argc != 2) {
    printf("usage: responder <filename>\n");
    return 1;
  }

  filename = argv[1];  // Filename of file to send

  responder_thread = spawn_thread(responder_proc,
        "File Responder", B_NORMAL_PRIORITY, filename);
  resume_thread(responder_thread);
  wait_for_thread(responder_thread, &err);
  return 0;
}

This is pretty basic stuff. The responder thread is spawned with B_NORMAL_PRIORITY, the filename specified on the command line is passed to it, and the thread is started by calling resume_thread(). Then main() waits for that thread to terminate before exiting the program.

The thread function, responder_proc(), sets up a network listener on port 4242 and handles incoming requests:

long responder_proc(void *data) {
  BNetEndpoint endpoint;

  if (endpoint.InitCheck() < B_OK) {
    return -1;
  }
  endpoint.Bind(4242);    // Bind to port 4242
  endpoint.Listen();      // Listen for incoming connections

  while (1) {
    BNetEndpoint *connect = NULL;
    connect = endpoint.Accept();    // Wait for a connection
    if (connect) {
      char hostname[256];
      in_addr addr;
      connect->RemoteAddr().GetAddr(hostname);
      connect->RemoteAddr().GetAddr(addr);
      printf("Connection from %s (%08X)\n", hostname, addr.s_addr);
      send_file((const char *) data, connect);
      delete connect;
    }
  }

  endpoint.Close();
}

It begins by creating a BNetEndpoint on the stack. The BNetEndpoint class represents one end of a network connection; all interactions on a connection are done through BNetEndpoint functions. BNetEndpoint::InitCheck() is called to ensure that the endpoint was created without incident; it returns -1 at once if an error occurred during construction.

Then the endpoint is bound to port 4242, and BNetEndpoint::Listen() is called to begin listening for incoming connections. By default, Listen() allows up to five connection requests to be backlogged, but you can optionally specify this value. For our purposes, five is plenty.

A "while" loop, which never terminates, then begins. In real life, you'd have this loop terminate when the user clicked a "Stop server" button, or something similar. As it is, the server can only be terminated by pressing Control+C from the terminal, or by killing it. This loop repeatedly attempts to accept incoming connections, and sends the file over them.

First, the BNetEndpoint::Accept() function is called. By default, Accept() blocks indefinitely until a connection attempt occurs; you can optionally specify a timeout. Once a connection is made, a new BNetEndpoint is returned, which duplicates the original endpoint except that there's now a connection to the remote client. This new endpoint, connect, is used to interact with the newly connected remote system.

We first use the BNetEndpoint function RemoteAddr() and the BNetAddress::GetAddr() functions to get the hostname and IP address of the remote system, and then we print that information to the terminal as a rudimentary log. Then we call send file(), which we'll see momentarily, to actually transmit the file.

Finally, we delete the connection, which closes it and terminates the interaction. The loop then continues, calling Accept() again to get the next client's request.

The send_file() function does the real work of transmitting the file over the connection:

void send_file(const char *filename, BNetEndpoint *connect) {
  status_t err;
  off_t filesize;
  uint8 buffer[65536];

  // Open the file, abort if an error occurs.

  BFile file(filename, B_READ_ONLY);
  if (file.InitCheck() < B_OK) {
    send_error(connect);
    return;
  }

  // Get the file size, abort on error.

  err = file.GetSize(&filesize);
  if (err < B_OK) {
    send_error(connect);
    return;
  }

  // Put the file size into the buffer and send it.

  char s[64];
  sprintf(s, "%Ld\n", filesize);
  int32 count = strlen(s);
  memcpy(buffer, s, count);
  connect->Send(&buffer, count);

  // Now read the file, in chunks, and stuff them into the buffer,
  // sending the buffer each time.

  while (1) {
    ssize_t size = file.Read(&buffer, 65536);
    connect->Send(&buffer, size);
    if (size < 65536) {
      break;        // We're done
    }
  }
}

A 64KB buffer is allocated on the stack, and we then open the file for reading by creating a BFile object referencing the filename specified on launch. If an error occurs (as determined by calling BFile::InitCheck(), we call send error() to transmit an error code (this is just the string "ERROR\n") and return at once.

Otherwise we get the file's size by calling BFile::GetSize(). If an error occurs here, again, we send the error code and return.

Once we know the file size, we transmit it to the client by stuffing the size (in ASCII) into the buffer and then transmitting it to the client using the BNetEndpoint::Send() function, specifying a pointer to the buffer and the number of bytes in the string, not including the null terminator.

Then we use a loop to read the file in 64KB chunks, transmitting each chunk to the client. Once the entire file is sent, the loop ends and the transfer is complete. After send_file() returns, the connection is closed by the responder thread.

Again, this is a very simple server application, but with only a little work, it can be adapted into a useful application. Add a little HTTP protocol handling, and you have a rudimentary web server, for example. The basics are there; all that's needed is some protocol handling and a user interface.

The End Is Near!

By Jean-Louis Gassée

No, this isn't my contribution to The Great Millennium (or Halloween) Scare. It's about something much smaller—the size of transistors and the end of Moore's Law.

We just heard a warning from an Intel scientist: With the next generation of silicon fabrication technology, known as .1 micron, the basic storage or switching building block might become too small to be reliable. Too small, in this case, means the state of a building block might rely on less than a hundred electrons. Several factors can affect the actual number: manufacturing variances; quantum fluctuations; and various sources, natural or artificial, of radioactivity generating stray electrons.

When a binary state is built on millions of electrons dwelling in a silicon junction, a few stray electrons can hardly make a difference. With the base level down to a hundred or so, the scientist says, the distinction between zeroes and ones becomes unreliable. As a result, Moore's Law stops "working." In other words, we can no longer expect the price/performance ratio of our favorite silicon devices to double every 18 months.

A historical note: When Gordon Moore, an Intel founder, first formulated the observation subsequently elevated and christened into a law bearing his name, the cycle was pegged at 24 months. Lately, meaning in the last 36 months, progress has been arguably swifter. Just compare a 400 Mhz Celeron with a 100 Mhz Pentium of the Summer of '96 vintage.

Let's make two assumptions here. One, the Intel scientist is right and no amount of human ingenuity, even from Intel's competition, will change the situation. Two, I blush at the mere thought, this isn't a carefully crafted message from our friends, an artful set-up for some upcoming announcement.

Taking the story at face value, does it mean the end of the silicon go-go days? Perhaps, but let's look more closely at the consequences. Today's personal computers are imbalanced. By this I mean the speed of various organs is wildly out of whack, as exhibited by all sorts of caching schemes and memory and bus standards disagreements. I still wait for the leading computer magazines to buy a good logic analyzer and tell their readers how often one part of the system sits idle while another is choking. And, while they're at it, how long does the program counter point to OS code versus user applications?

I know, I know, good OS code does a lot of work so that applications just have to call for the right OS routines... If processors stop making the easy leaps we've learned to expect, generation after generation, perhaps the rest of the beloved PC clone organ bank will receive more attention. As a result, a 1GHz processor will run at full speed, instead of twiddling its thumbs, waiting for memory.

But maturation of silicon technology might lead to other reconsiderations. Moore's Law has been a source of cheap fixes, and we're grateful for those. But with these fixes no longer available in some version of the future, we might look again at our current brute force approach. Today, it doesn't matter much if a silicon architecture is less than perfect—tomorrow's generation will have twice the price/performance. Therefore, backwards compatibility, in spite of the layers of silt it generates, is inexpensive.

Actually, this holds true for software as well. That's why we have bloated, backwards compatible software, sitting atop large, backwards compatible microprocessors. The end of Moore's Law might cause a re-evaluation of the balance between the benefits of backwards compatibility and the cost of silt. Which is to say, new hardware and software architectures might arise from such re-evaluation. This leaves us with two partially interchangeable questions: When and on which life forms might such new architectures arise? On existing computing devices such as PCs, or on emerging ones in the sense that PCs were more than just smaller minicomputers?