Issue 3-36, September 9, 1998

Be Engineering Insights: Changes in the BeOS Driver API

By Cyril Meurillon

Oh no, you think, yet another article about drivers. Are they crazy about drivers at Be, or what? Ouaire iz ze beauty in driverz? The truth is that I would have loved to write about another (hotter) topic, one that has kept me very busy for the past few months, but my boss said I couldn't (flame him at ;-). I guess I'll have wait until it becomes public information. In the meantime, please be a good audience, and continue reading my article.

Before I get on with the meat of the subject, I'd like to stress that the following information pertains to our next release, BeOS Release 4. Because R4 is still in the making, most of what you read here is subject to change in the details, or even in the big lines. Don't write code today based on the following. It is provided to you mostly as a hint of what R4 will contain, and where we're going after that.

Introduction of Version Control

That's it. We finally realized that our driver API was not perfect, and that there was room for future improvements, or "additions." That's why we'll introduce version control in the driver API for R4. Every driver built then and thereafter will contain a version number that tells which API the driver complies to.

In concrete terms, the version number is a driver global variable that's exported and checked by the device file system at load time. In Drivers.h you'll find the following declarations:

extern  _EXPORT int32  api_version;

In your driver code, you'll need to add the following definition:

#include <Drivers.h>
int32  api_version = B_CUR_DRIVER_API_VERSION.

Driver API version 2 refers to the new (R4) API. Version 1 is the R3 API. If the driver API changes, we would bump the version number to 3. Newly built drivers will have to comply to the new API and declare 3 as their API version number. Old driver binaries would still declare an old version (1 or 2), forcing the device file system to translate them to the newer API (3). This incurs only a negligible overhead in loading drivers.

But, attendez, vous say. What about pre-R4 drivers, which don't declare what driver API they comply to? Well, devfs treats drivers without version number as complying to the first version of the API—the one documented today in the Be Book. Et voila.

New Entries in the device_hooks Structure

I know you're all dying to learn what's new in the R4 driver API... Here it is, revealed to you exclusively! We'll introduce scatter-gather and (a real) select in R4, and add a few entries in the device_hooks structure to let drivers deal with the new calls.


As discreetly announced by Trey in his article Be Engineering Insights: An Introduction to the Input Server, we've added 2 new system calls, well known to the community of UNIX programmers:

struct iovec {
  void   *iov_base;
  size_t  iov_len;
typedef struct iovec iovec;

extern ssize_t   readv_pos(int fd, off_t pos,
  constiovec *vec, size_t count);
extern ssize_t   writev_pos(int fd, off_t pos,
  constiovec *vec, size_t count);

These calls let you read and write multiple buffers to/from a file or a device. They initiate an IO on the device pointed to by fd, starting at position pos, using the count buffers described in the array vec.

One may think this is equivalent to issuing multiple simple reads and writes to the same file descriptor—and, from a semantic standpoint, it is. But not when you look at performance!

Most devices that use DMA are capable of "scatter-gather." It means that the DMA can be programmed to handle, in one shot, buffers that are scattered throughout memory. Instead of programming N times an IO that points to a single buffer, only one IO needs to be programmed, with a vector of pointers that describe the scattered buffers. It means higher bandwidth.

At a lower level, we've added two entries in the device_hooks structure:

typedef status_t (*device_readv_hook)
  (void *cookie, off_t position, constiovec *vec,
   size_t count, size_t *numBytes);

typedef status_t (*device_writev_hook)
  (void *cookie, off_t position, constiovec *vec,
   size_t count, size_t *numBytes);

typedef struct {
  device_readv_hook  readv;
    /* scatter-gather read from the device */
  device_writev_hook writev;
    /* scatter-gather write to the device  */
} device_hooks;

Notice that the syntax is very similar to that of the single read and write hooks:

typedef status_t (*device_read_hook)
  (void *cookie, off_t position, void *data,
   size_t *numBytes);

typedef status_t (*device_write_hook)
  (void *cookie, off_t position, constvoid *data,
   size_t *numBytes);

Only the descriptions of the buffers differ.

Devices that can take advantage of scatter-gather should implement these hooks. Other drivers can simply declare them NULL. When a readv() or writev() call is issued to a driver that does not handle scatter-gather, the IO is broken down into smaller IO using individual buffers. Of course, R3 drivers don't know about scatter-gather, and are treated accordingly.


I'm not breaking the news either with this one. Trey announced in his article last week the coming of select(). This is another call that is very familiar to UNIX programers:

extern int select(int nbits,
      struct fd_set *rbits,
      struct fd_set *wbits,
      struct fd_set *ebits,
      struct timeval *timeout);

rbits, wbits and ebits are bit vectors. Each bit represents a file descriptor to watch for a particular event:

  • rbits: wait for input to be available (read returns something immediately without blocking)

  • wbits: wait for output to drain (write of 1 byte does not block)

  • ebits: wait for exceptions.

select() returns when at least one event has occurred, or when it times out. Upon exit, select() returns (in the different bit vectors) the file descriptors that are ready for the corresponding event.

select() is very convenient because it allows a single thread to deal with multiple streams of data. The current alternative is to spawn one thread for every file descriptor you want to control. This might be overkill in certain situations, especially if you deal with a lot of streams.

select() is broken down into two calls at the driver API level: one hook to ask the driver to start watching a given file descriptor, and another hook to stop watching.

Here are the two hooks we added to the device_hooks structure:

struct selectsync;
typedef struct selectsync selectsync;

typedef status_t (*device_select_hook)
  (void *cookie, uint8 event, uint32 ref, selectsync *sync);

typedef status_t (*device_deselect_hook)
  (void *cookie, uint8 event, selectsync *sync);

#define  B_SELECT_READ       1
#define  B_SELECT_WRITE      2

typedef struct {
  device_select_hook    select;    /* start select */
  device_deselect_hook  deselect;  /* stop select */
} device_hooks;

cookie represents the file descriptor to watch. event tells what kind of event we're waiting on for that file descriptor. If the event happens before the deselect hook is invoked, then the driver has to call:

extern void notify_select_event(selectsync *sync, uint32 ref);

with the sync and ref it was passed in the select hook. This happens typically at interrupt time, when input buffers are filled or when output buffers drain. Another place where notify_select_event() is likely to be called is in your select hook, in case the condition is already met there.

The deselect hook is called to indicate that the file descriptor shouldn't be watched any more, as the result of one or more events on a watched file descriptor, or of a timeout. It is a serious mistake to call notify_select_event() after your deselect hook has been invoked.

Drivers that don't implement select() should declare these hooks NULL. select(), when invoked on such drivers, will return an error.

Introduction of "Bus Managers"

Another big addition to R4 is the notion of "bus managers." Arve wrote a good article on this, which you'll find at: Be Engineering Insights: Splitting Device Drivers and Bus Managers

Bus managers are loadable modules that drivers can use to access a hardware bus. For example, the R3 kernel calls which drivers were using looked like this:

extern long get_nth_pci_info(long index, pci_info *info);

extern long read_pci_config(uchar bus, uchar device,
  uchar function, long offset, long size);

extern void write_pci_config(uchar bus, uchar device,
  uchar function, long offset, long size, long value);

Now, they're encapsulated in the PCI bus manager. The same happened for the ISA, SCSI and IDE bus related calls. More busses will come. This makes the kernel a lot more modular and lightweight, as only the code handling the present busses are loaded in memory.

A New Organization for the Drivers Directory

In R3, /boot/beos/system/add-ons/kernel/drivers/ and /boot/home/config/add-ons/kernel/drivers/ contained the drivers. This flat organization worked fine. But it had the unfortunate feature of not scaling very well as you add drivers to the system, because there is no direct relation between the name of a device you open and the name of the driver that serves it. This potentially causes all drivers to be searched when an unknown device is opened.

That's why we've broken down these directories into subdirectories that help the device file system locate drivers when new devices are opened.

  • ../add-ons/kernel/dev/ mirrors the devfs name space using symlinks and directories

  • ../add-ons/kernel/bin/ contains the driver binaries

For example, the serial driver publishes the following devices:


It lives under ../add-ons/kernel/bin/ as serial, and has the following symbolic link set up:

../add-ons/kernel/drivers/dev/ports/serial -> ../../bin/serial

If "fred", a driver, wishes to publish a ports/XYZ device, then it should setup this symbolic link:

../add-ons/kernel/drivers/dev/ports/fred -> ../../bin/fred

If a driver publishes devices in more than one directory, then it must setup a symbolic link in every directory in publishes in. For example, driver "foo" publishes:


then it should come with the following symbolic links:

../add-ons/kernel/drivers/dev/fred/bar/foo -> ../../../bin/foo
../add-ons/kernel/drivers/dev/greg/foo -> ../../bin/foo

This new organization speeds up device name resolution a lot. Imagine that we're trying to find the driver that serves the device /dev/fred/bar/machin. In R3, we have to ask all the drivers known to the system, one at a time, until we find the right one. In R4, we only have to ask the drivers pointed to by the links in ../add-ons/kernel/drivers/dev/fred/bar/.

Future Directions

You see that the driver world has undergone many changes in BeOS Release 4. All this is nice, but there are other features that did not make it in, which we'd like to implement in future releases. Perhaps the most important one is asynchronous IO. The asynchronous read() and write() calls don't block—they return immediately instead of waiting for the IO to complete. Like select(), asynchronous IO makes it possible for a single thread to handle several IOs simultaneously, which is sometimes a better option than spawning one thread for each IO you want to do concurrently. This is true especially if there are a lot of them.

Thanks to the driver API versioning, we'll have no problems throwing the necessary hooks into the device_hooks structure while remaining backward compatible with existing drivers.

Be Engineering Insights: Higher-Performance Display

By Jean-Baptiste Quéru

In application writing, the Interface Kit (and the Application Server which runs underneath the Kit) are responsible for handling all the display that finally goes on screen. They provide a nice, reasonably fast way to develop a good GUI for your application.

Sometimes however, they aren't fast enough, especially for game writing. Using a windowed-mode BDirectWindow sometimes helps (or doesn't slow things down, in any case), but you still have to cooperate with other applications whose windows can suddenly overlap yours or want to use the graphics accelerator exactly when you need it. Switching to a full-screen BDirectWindow improves things a little more, but you may still want even higher performance. What you need is a BWindowScreen.

The BWindowScreen basically allows you to establish an (almost) direct connection to the graphics driver, bypassing (almost) the whole Application Server. Its great advantage over BDirectWindow is that it allows you to manipulate all the memory from the graphics card, instead of just having a simple frame buffer. Welcome to the world of double- (or triple-) buffering, of high-speed blitting, of 60+ fps performance.

Looks quite exciting, hey? Unfortunately, all is not perfect. BWindowScreen is a low-level API. This means that you'll have to do many things by hand that you were used to having the Application Server do for you. BWindowScreen is also affected by some hardware and software bugs, which can make things harder than they should be.

BWindowScreen reflects the R3 graphics architecture. That architecture is going away in R4, since it was becoming dated. The architecture that replaces it will allow some really cool things in later releases. BWindowScreen is still the best way to get high-performance full screen display in R4, though it too will be replaced by something even better in a later release.

Here is a code snippet, ready for you to use and customize:

#include <Application.h>
#include <WindowScreen.h>
#include <string.h>

typedef long (*blit_hook)(long,long,long,long,long,long);
typedef long (*sync_hook)();

class NApplication:public BApplication {
  bool is_quitting;
    // So that the WindowScreen knows what to do
    // when disconnected.

  bool QuitRequested();
  void ReadyToRun();

class NWindowScreen:public BWindowScreen {

  void ScreenConnected(bool);
  long MyCode();
  static long Entry(void*);
  thread_id tid;
  sem_id sem;
  area_id area;
  uint8* save_buffer;
  uint8* frame_buffer;
  ulong line_length;
  bool thread_is_locked;
    // small hack to allow to quit the
    // app from ScreenConnected()
  blit_hook blit;
    // hooks to the graphics driver functions
  sync_hook sync;

main() {
  NApplication app;

   :BApplication("application/x-vnd.Be-sample-jbq1") {
  Run(); // see you in ReadyToRun()

void NApplication::ReadyToRun() {
  status_t ret=B_ERROR;
  NWindowScreen* ws=new NWindowScreen(&ret);
    // exit if constructing the WindowScreen failed.
  if ((ws==NULL)||(ret<B_OK))

bool NApplication::QuitRequested() {
  return true;

NWindowScreen::NWindowScreen(status_t* ret)
  :BWindowScreen("Example",B_8_BIT_640x480,ret) {
  if (*ret==B_OK) {
    // this semaphore controls the access to the WindowScreen
    sem=create_sem(0,"WindowScreen Access");

    // this area is used to save the whole framebuffer when
    // switching workspaces. (better than malloc()).

    // exit if an error occurred.
    if ((sem<B_OK)||(area<B_OK)) *ret=B_ERROR;
    else Show(); // let's go. See you in ScreenConnected.

void NWindowScreen::ScreenConnected(bool connected) {
  if (connected) {
    if ((SetSpace(B_8_BIT_640x480)<B_OK)
         ||(SetFrameBuffer(640,2048)<B_OK)) {
      // properly set the framebuffer.
      // exit if an error occurs.

  // get the hardware acceleration hooks. get them each time
  // the WindowScreen is connected, because of multiple
  // monitor support

  // cannot work with no hardware blitting
  if (blit==NULL) {

  // get the framebuffer-related info, each time the
  // WindowScreen is connected (multiple monitor)
  if (tid==0) {
    // clean the framebuffer
    // spawn the rendering thread. exit if an error occurs.
    // don't use a real-time thread. URGENT_DISPLAY is enough.
    if (((tid=spawn_thread(Entry,"rendering thread",
    } else
      for (int y=0;y<2048;y++)
        // restore the framebuffer when switching back from
        // another workspace.

    // set our color list.
    for (int i=0;i<128;i++) {
      rgb_color c1={i*2,i*2,i*2};
      rgb_color c2={127+i,2*i,254};

    // allow the rendering thread to run.
  } else {
    // block the rendering thread.
    if (!thread_is_locked) {

    // kill the rendering and clean up when quitting
    if ((((NApplication*)be_app)->is_quitting)) {
      status_t ret;
    } else {
      // set the color list black so that the screen doesn't
      // seem to freeze while saving the framebuffer
      rgb_color c={0,0,0};
      for (int i=0;i<256;i++)
      // save the framebuffer
      for (int y=0;y<2048;y++)

long NWindowScreen::Entry(void* p) {
  return ((NWindowScreen*)p)->MyCode();

long NWindowScreen::MyCode() {
  // gain access to the framebuffer before writing to it.
  for (int j=1440;j<2048;j++) {
    for (int i=0;i<640;i++) {
      // draw the background ripple pattern
      float val=63.99*(1+cos(2*PI*((i-320)*(i-320)
  ulong numframe=0;
  bigtime_t trgt=0;
  ulong y_origin;
  uint8* current_frame;
  while(true) {
    // the framebuffer coordinates of the next frame

    // and a pointer to it

    // copy the background
    int ytop=numframe%608,ybot=ytop+479;
    if (ybot<608) {
    } else {

    // calculate the circle position. doing such calculations
    // between blit() and sync() can save some time.
    uint32 x=287.99*(1+sin(numframe/72.));
    uint32 y=207.99*(1+sin(numframe/52.));
    if (sync) sync();

    // draw the circle
    for (int j=0;j<64;j++) {
      for (int i=0;i<64;i++) {
        if ((i-31)*(i-32)+(j-31)*(j-32)<=1024)

    // release the semaphore while waiting. gotta release it
    // at some point or nasty things will happen!

    // we're doing some triple buffering. unwanted things would
    // happen if we rendered more pictures than the card can
    // display. we here make sure not to render more than 55.5
    // pictures per second.
    if (system_time()<trgt) snooze(trgt-system_time());

    // acquire the semaphore back before talking to the driver
    // do the page-flipping
    // and go to the next frame!
  return 0;

There are some traps to be aware of before you begin playing with the BWindowScreen:

About BWindowScreen(), SetSpace() and SetFrameBuffer():

Choosing a good color_space and a good framebuffer size:

MoveDisplayArea() and hardware scrolling:

CardHookAt(10) ("sync"):

ScreenConnected() and multiple monitors:

MoveDisplayArea() and the R3 Matrox driver:

About 15/16bpp:

Developers Workshop: Yet Another Locking Article

By Stephen Beaulieu

It is funny, but somewhat fitting that many times the Newsletter article you intend to write is not really the Newsletter article you end up writing. With the best of intentions, I chose to follow a recent trend in articles and talk about multithreaded programming and locking down critical sections of code and resources. The vehicle for my discussion was to be a Multiple-Reader Single-Writer locking class in the mode of BLocker, complete with Lock(), Unlock(), IsLocked() and an Autolocker-style utility class. Needless to say, the class I was expecting is a far cry from what I will present today.

In the hopes of this being my first short Newsletter article, I will leave the details of the class to the sample code. For once it was carefully prepared ahead of time and is reasonably commented. I will briefly point out two neat features of the class before heading into a short discussion of locking styles. The first function to look at is the IsWriteLocked() function, as it shows a way to cache the index of a thread's stack in memory, and use it to help identify a thread faster than the usual method, find_thread(NULL).

The stack_base method is not infallible, and needs to be backed up by find_thread(NULL) when there is no match, but it is considerably faster when a match is found. This is kind of like the benaphore technique of speeding up semaphores.

The other functions to look at are the register_thread() and unregister_thread() functions. These are debug functions that keep state about threads holding a read-lock by creating a state array with room for every possible thread. An individual slot can be set aside for each thread and specified by performing an operation: thread_id % max_possible_threads. Again, the code itself lists these in good detail. I hope you find the class useful. A few of the design decisions I made are detailed in the discussion below.

I want to take a little space to discuss locking philosophies and their trade-offs. The two opposing views can be presented briefly as "Lock Early, Lock Often" and "Lock Only When and Where Necessary." These philosophies sit on opposite ends of the spectrum of ease of use and efficiency, and both have their adherents in the company (understanding that most engineers here fall comfortably in the middle ground.)

The "Lock Early, Lock Often" view rests on the idea that if you are uncertain exactly where you need to lock, it is better to be extra sure that you lock your resources. It advises that all locking classes should support "nested" calls to Lock(); in other words if a thread holds a lock and calls Lock() again, it should be allowed to continue without deadlocking waiting for itself to release the lock. This increases the safety of the lock, by allowing you to wrap all of your functions in Lock() / Unlock() pairs and allowing the lock to take care of knowing if the lock needs to be acquired or not. An extension of this are Autolocking classes, which acquire a lock in their constructor and release it in their destructor. By allocating one of these on the stack you can be certain that you will safely hold the lock for the duration of your function.

The main advantage of the "Lock Early, Lock Often" strategy is its simplicity. It is very easy to add locking to your applications: create an Autolock at the top of all your functions and be assured that it will do its magic. The downside of this philosophy is that the lock itself needs to get smarter and to hold onto state information, which can cause some inefficiencies in space and speed.

At the other end of the spectrum is the "Lock Only When and Where Necessary." This philosophy asserts that programmers using the "Lock Early, Lock Often" strategy do not understand the locking requirements of their applications, and that is essentially a bug just waiting to happen. In addition, the overhead added to applications by locking when it is unnecessary (say, in a function that is only called >from within another function that already holds the lock) and by using an additional class to manage the lock makes the application larger and less efficient. This view instead requires programmers to really design their applications and to fully understand the implications of the locking mechanisms chosen.

So, which is correct? I think it often depends on the tradeoffs you are willing to make. With locks with only a single owner, the state information needed is very small, and usually the lock's system for determining if a thread holds the lock is fairly efficient (see the stack_base trick mentioned above to make it a bit faster.) Another consideration is how important speed and size are when dealing with the lock. In a very crucial area of an important, busy program, like the app_server, increasing efficiency can be paramount. In that case it is much, much better to take the extra time to really understand the locking necessary and to reduce the overhead. Even better would be to design a global application architecture that makes the flow of information clear, and correspondingly makes the locking mechanisms much better (along with everything else.)

The MultiLocker sample code provided leans far to the efficiency side. The class itself allows multiple readers to acquire the lock, but does not allow these readers to make nested ReadLock() calls. The overhead for keeping state for each readers (storage space and stomping through that storage space every time a ReadLock() or ReadUnlock() call was made) was simply too great. Writers, on the other hand, have complete control over the lock, and may make ReadLock() or additional WriteLock() calls after the lock has been acquired. This allows a little bit of design flexibility so that functions that read information protected by the lock can be safely called by a writer without code duplication.

The class does have a debug mode where state information is kept about readers so you can be sure that you are not performing nested ReadLock()s. The class also has timing functions so that you can see how long each call takes in both DEBUG mode and, with slight modifications to the class, the benefits of the stack-based caching noted above. I have included some extensive timing information from my computers that you can look at, or you can run your own tests with the test app included. Note that the numbers listed are pretty close to the raw numbers of the locking overhead, as writers only increment a counter, and readers simply access that counter.

The sample code can be found at:

The class should be pretty efficient, and you are free to use it and make adjustments as necessary. My thanks go out to Pierre and George from the app_server team, for the original lock on which this is based, and for their assistance with (and insistence on) the efficiency concerns.

Is the A/V Space a Niche Market?

By Jean-Louis Gassée

And, if it is, are we wrong to focus on it? Can we pace off enough running room to launch the virtuous ascending spiral of developers begetting users begetting developers? Is the A/V space large enough to swing a cat and ignite a platform?

Perhaps there's another way to look at the platform question, one that's brought to mind by the latest turn of Apple's fortunes. Back in 1985, Apple had a bad episode: The founders were gone, the new Mac wasn't taking off and the establishment was dissing Apple as a toy company with a toy computer. The advice kept pouring in: reposition the company, refocus, go back to your roots, find a niche where you have a distinctive advantage. One seer wanted to position Apple as a supplier of Graphics-Based Business Systems, another wanted to make the company the Education Computer Company. Steve Jobs, before taking a twelve year sabbatical, convinced Apple to buy 20% of Adobe, and thus began the era of desktop publishing and the Gang of Four (Apple, Adobe, Aldus and Canon).

Apple focused on publishing, and is still focused on publishing (as evidenced by the other Steve—Ballmer—ardently promoting NT as *the* publishing platform). Does that make Apple a publishing niche player? Not really. iMac buyers are not snapping up the "beetle" Mac for publishing, they just want a nice general-purpose computer. Although Apple is still thrown into the publishing bin, the Mac has always strived to be an everyday personal computer, and the numbers show that this isn't mere delusion: For example, Macs outsell Photoshop ten to one. But let's assume that at the company's zenith, publishing made up as much as 25% of Apple sales. Even then, with higher margin CPUs, Apple couldn't live on publishing alone, hence the importance of a more consumer-oriented product such as the iMac and hence, not so incidentally, the importance of keeping Microsoft Office on the platform.

The question of the viability of an A/V strategy stems from us being thrown into the same sort of bin as our noble hardware predecessor. But at Be we have an entirely different business model. A hardware company such as Apple can't survive on a million units per year. Once upon a time it could, but those were the salad days of expensive computers and 66% gross margins. We, on the other hand, have a software-only business model and will do extremely well with less than a million units per year--and so will our developers. As a result, the virtuous spiral will ignite (grab a cat).

More important—and here we share Apple's "niche-yet-general" duality -- the question may be one that never needs to be answered: While BeOS shows its unique abilities in A/V, we're also starting to see applications for everyday personal computing. I'm writing this column on Gobe Productive and e-mailing it to the prose-thetic surgeon using Mail-It, both purchased using NetPositive and SoftwareValet.

Creative Commons License
Legal Notice
This work is licensed under a Creative Commons Attribution-Non commercial-No Derivative Works 3.0 License.