Be Newsletters - Volume 3: 1998

Issue 3-41, October 14, 1998

Be Engineering Insights: Where Does the Time Go?

By Rico Tudor

The star of this article is system_time(). Unlike time services of yesteryear, which were precise to 1/100th of a second, or even a mere second, BeOS can report wall-time to the nearest microsecond (µs). This allows some interesting uses in the area of benchmarking, CPU usage in particular.

Where I quantify results, note that my test machine is a uniprocessor 200MHz Pentium Pro, a mid-level performer these days. All concepts and results are applicable for BeOS users on PowerPC, although the numbers may vary.

Benchmarking a system from the outside is straightforward: hook up the logic analyzer and read off the numbers. However, having a system benchmark itself presents some real problems. Heisenberg's Uncertainty Principle, which applies to subatomic particles, states that the more you know about a particle's position, the less you know about its momentum (where it's going), and visa versa. For us to observe the particle, it must interact with its surroundings, and that changes the state as observed.

Analogously, the more you know where your program is executing, the less you know how much time it's hogging. For us to observe the program's whereabouts, we must add code, and that introduces overhead that is not part of the program.

In both arenas, we can attempt to gain better measurements by using indirect methods. In the case of benchmarking, we want to benchmark in ways that allow overhead to be minimized, measured or canceled.

Using only system_time(), I will reveal some tricks I use when measuring code, and some pitfalls of interpreting results. Special bonus: this article includes code samples which can be assembled into a complete applet, "cpuload".

Simple Example

First, let's time system_time(), using itself!

bigtime_t t = system_time( );
t = system_time( ) - t;
printf( "%f\n", (double)t);

This measures the time to execute one call of system_time(), and I get either 0µs or 1µs. For more precision, we need to average a bunch of calls:

bigtime_t t = system_time( );
for (int i=0; i<1000-1; ++i)
        system_time( );
t = system_time( ) - t;
printf( "%f\n", t/1000.);

I get about 0.28µs... good enough for government work.

The Grind Routine

Here is the first installment of our applet. The grind() routine is designed to do nothing, expensively, but to an extent controlled by argument n. For the larger purpose, it could waste time in any fashion, like call sqrt() many times. However, this implementation has properties which will allow us to conduct a sanity check.

#include        <Application.h>
#include        <math.h>
#include        <stdio.h>
#include        <string.h>

struct pointer {
        struct pointer  *p;
};
struct pointer  futile  = { &futile };
uint            z;

void
grind( uint n)
{
  struct pointer *p = &futile;
  for (uint i=0; i<n; ++i) {
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
          p = p->p; p = p->p; p = p->p; p = p->p;
  }
  z += p == 0;
}

futile is a linked list, made circular by pointing to itself. grind() traverses this list futilely until its counter is exhausted. In assembler, the inner loop looks like this:

loop:
  movl (%eax),%eax
  movl (%eax),%eax
  .
  .       28 more "movl" instructions
  .
  movl (%eax),%eax
  movl (%eax),%eax
  incl %edx
  cmpl %ecx,%edx
  jb loop

The compiler cannot eliminate this code by optimization, partly because of the dummy variable z (always 0). For each step along the list, the Intel P6 core (Pentium Pro, Pentium II) takes 3 clocks, the L1 cache latency. Loop overhead is eliminated by speculative execution (fancy hardware). Therefore, grind(1) takes 96 clocks plus entry/exit overhead, and grind(100) reduces that overhead by a factor of 100.

The situation is largely the same with the PowerPC, although the numbers are different (e.g. 2 clocks needed by the 604).

Calibration

Next, we need to calibrate grind() to determine just how much time it uses; we can influence the duration with the supplied argument. This is not as simple as calling grind() with bracketing calls to system_time(). We need to eliminate the "hidden costs" of the modern OS, including all hardware interrupts, and all time-sharing with currently running servers and apps.

To do this, we make two assumptions (uh, oh). First, that time is allocated in a fixed quantum by the scheduler. Second, if we run a short enough grind() enough times, it will complete without being interrupted.

Here is the main body of the applet. calibrate() uses a power-of-1.1 search to set gtime and gcount, where grind(gcount) will use "gtime" microseconds, with no OS overhead. For each candidate gcount, we run grind() 100 times and record the BEST performance.

bigtime_t gtime;
uint      gcount;

struct A: BApplication {
  A( ): BApplication( "application/x-cpuload") {
          Run( );
  }
  void ReadyToRun( ) {
          calibrate( );
          new W( BRect( 0, 0, 100, 100));
  }
  void calibrate( ) {
    gcount = 10;
    while (TRUE) {
      gtime = 0;
      for (uint i=0; i<100; ++i) {
        bigtime_t t0 = system_time( );
        grind( gcount);
        bigtime_t t = system_time( ) - t0;

        if (gtime==0 || t<gtime)
          gtime = t;
      }
      if (gtime > 800)
              break;
      gcount *= 1.1;
    }
    printf( "gtime=%Ld gcount=%d MHz=%f\n", gtime, gcount,
    gcount*32.*3/gtime);
  }
  void MessageReceived( BMessage *m) {
    if (m->what == 'q')
      Quit( );
  }
};

main( )
{
  A       a;
  return (z);
}

The search is over when the time quantum exceeds 800µs. Why? Because I claim there is a clock interrupt every 1ms, the first regular source of overhead that calibrate() will encounter. You can verify my claim by nudging the printf() into the while loop, extending the search to 2 seconds, and plotting the results. Here's what I get:

201 +
    |
    |
200 +  #
    |      #
    |    #  #####
    #     #      #
199 +##
    |                #
    |               #
198 +             ##   ##
    |                    # ######
    |                     #      ### ###
    |                                   ##########
197 +                                              ###
    |
    |
196 +
    |                                                   ##
    |                                                  #   #
    |                                                 #   #   #
195 +-------------+------------+-------------+------------+-#---
   100          1000         10000        100000        1e+06

This is a plot of time-slice (in microseconds) against effective CPU clock frequency (in MHz). Measurements are noisy until a few hundred microseconds, after which the CPU is reliably measured at 199.5MHz, close to the manufacturer rating. At 1000µs, it drops suddenly to 198MHz: this is evidence of the 1000Hz clock interrupt, when BeOS must perform housekeeping. The graph shows steadily increasing load between 1ms and 1s. This load includes Pulse events, and other periodic work in the servers. Work of periodicity 1s is popular since another decline is clearly visible.

By writing grind() in a certain way, and tossing in knowledge about the P6 architecture, we can denominate CPU performance in MHz. This is not a necessity, but does confirm we're on the right track. It also shows how code written to calibrate grind() can be turned into a benchmark of its own.

The CPULOAD Applet

Here is the remaining code for "cpuload". If you insert it before struct A, the applet is ready for compilation.

struct L: BLooper {
  L( ) {
    n = 0;
    Run( );
    PostMessage( 'g');
  }

  void MessageReceived( BMessage *m) {
    set_thread_priority( find_thread( 0), B_LOW_PRIORITY/5);
    while (TRUE) {
      grind( gcount);
      ++n;
    }
  };
  uint    n;
};

struct C: BView {
  C( BRect r): BView( r, 0, B_FOLLOW_ALL, B_WILL_DRAW|B_PULSE_NEEDED)
  {
    lastt = system_time( );
    memset( lastn, 0, sizeof lastn);
    system_info si;
    get_system_info( &si);
    for (lcount=0; lcount<si.cpu_count; ++lcount)
      ltab[lcount] = new L( );
  }

  void Pulse( ) {
    bigtime_t t = system_time( );
    float f = 0;

    for (uint i=0; i<lcount; ++i) {
      f += 100. * (ltab[i]->n-lastn[i]) / ((float)(t-lastt)/gtime);
      lastn[i] = ltab[i]->n;
    }

    lastt = t;
    f /= lcount;
    graph( f);
  }

  virtual void graph( float) = 0;
  bigtime_t lastt;
  uint      lastn[20],
            lcount;
  L         *ltab[20];
};

struct V0: C {
  V0( BRect r): C( r) { }

  void graph( float f) {
    printf( "%f\n", f);
  }
};

struct W: BWindow {
  W( BRect r): BWindow( r, "cpuload", B_TITLED_WINDOW, B_NOT_ZOOMABLE)
  {
    AddChild( new V0( r));
    MoveBy( 100, 100);
    SetPulseRate( 1000000);
    Show( );
  }

  bool QuitRequested( ) {
    be_app->PostMessage( 'q');
    return (TRUE);
  }
};

L does the grunt work of calling grind(), at low scheduling priority: any unused CPU cycles are soaked up and the exact number recorded. One L is created for each CPU on your machine.

C starts L and calculates the load on a per second basis. It grabs results directly from L, a different thread, which works because the datum is a single machine word. This eliminates communication overhead, which would otherwise have to be analyzed.

V0 prints the results textually. W requests Pulse events for reporting purposes, 1 per second.

Observations

A simple way to swamp the CPU for a few seconds is the following shell command:

echo '2^99999' | bc

With " cpuload " running, CPU availability drops from 97% to 4%. Why not 100% and 0%? 97% reflects actual BeOS time-sharing overhead, while 4% shows the impact of "cpuload" itself. While its scheduling priority of 1 is as low as possible, BeOS will nonetheless allocate "cpuload" a few time-slices per second. This anomaly still allows accurate measurements in the remaining range. A scheduling priority of 0 would cause scheduling conflicts with the idle thread, yielding peculiar results.

CPULOAD Variations

Text output has some advantages:the applet introduces less overhead, and results can be logged for later analysis. With averaging over a minute, precision of 4 digits can be expected. However, graphical output is compelling, so I provide two possible replacements for V0:

struct V1: C {
  V1( BRect r): C( r) {
    x = r.right;
  }
  void graph( float f) {
    ScrollBy( 1, 0);
    StrokeLine( BPoint( x, 100), BPoint( x, 100-(f+.5)));
    ++x;
  }
  uint x;
};

struct V2: C {
  V2( BRect r): C( r) {
    memset( rec, 0, sizeof rec);
    nrec = 0;
  }

  void graph( float f) {
    if (nrec == 0)
      rec[nrec] = f;
    else
      rec[nrec] = f + rec[nrec-1];

    ++nrec;

    for (uint i=0; ; ++i) {
      uint j = 1 << i;
      if (j >= nrec)
              break;

      float y = (rec[nrec-1]-rec[nrec-1-j]) / j;

      FillRect( BRect( i*8, 100-y, i*8+6, 100));
      SetHighColor( 255, 255, 255);
      FillRect( BRect( i*8, 0, i*8+6, 100-1-y));
      SetHighColor( 0, 0, 0);
    }
  }

  void Draw( BRect) {
    SetHighColor( 255, 0, 0);
    for (uint x=0; x<100; x+=20)
      FillRect( BRect( 0, x, 100, x+10-1));
  }

  float   rec[33000];
  uint    nrec;
};

V1 implements a simple horizontal scrolling plot. V2 shows the load for the last 2^n seconds, for all integers "n" from 0 to 15: each bar averages twice the historical load of the previous bar.

Another variation involves running grind() at the priority of your Be app. Now "cpuload" is measuring the amount of CPU time your app would get under current conditions, rather than measuring idle CPU cycles.

Finally, you can start multiple grinders on one CPU, and measure BeOS time-sharing efficiency. With two grinders per CPU, instead of one, I measure CPU efficiency at 99.92%. Are we having fun yet?

Conclusion

"cpuload" measures CPU usage in its own way, not necessarily a way suitable for your app. As an example, if your app needed the entire L2 cache (imagine "futile" being 60K elements), and another L2 hog was running on the system, the interaction could cause a disastrous performance degradation:I can construct cases where the slowdown is a factor of 10. Yet "cpuload" (or a logic analyzer) would show an equitable sharing of CPU time.

Understanding these interactions is the crux of benchmarking in the modern environment of complex OS and hardware. The answer is attacking the benchmarking challenge with the same custom concentration that you use to write the program. And take industry benchmark figures with more than a grain of salt.

Developers Workshop: But You Can't Make Her Think

By Doug Fulton

In the 1920's, a ranger at Yellowstone National Park found himself in a psychological battle with two of the college students the Park hired for summer work. Less than a battle, actually, it was more a series of artful annoyances played on the ranger by the students. They were careful about covering their tracks, so the ranger couldn't retaliate or even reprimand them—although avoiding punishment wasn't their main concern. They were more interested in increasing the ranger's frustration by their anonymity. Finally, however, the two blew their cover in the interest of art.

Of the hundreds of geysers in Yellowstone, a few erupt with a dependable regularity. Bleachers are set up around these geysers, where the park rangers lecture the tourists for a few minutes before the eruption. Just to the side of a tool shed near their nemesis's geyser, the two students planted a steering column, with the wheel still attached, that they had lifted from an abandoned car, jabbing it column end down into the ground.

They hid behind the shed waiting for the crowd to assemble. Seconds before the geyser erupted—just as the ranger was finishing his speech -- one of the students yelled to the other something along the lines of "Okay, let her go!", and the other fellow jumped out from behind the shed and, feigning great effort, began to spin the wheel, apparently letting loose the jet of steam and hot water.

This is an amusing, but misleading, UI.

It was leaked in these pages a few weeks ago that the UI Guidelines, our Balm of Gilead for the Control what Ails You, was back from the ghost writers and THIS CLOSE to being finished. True-ish, but, unfortunately, this propinquity is affected by daylight: All the words are there (and then some), but we're a bit thin on hours in the day that can be devoted to pushing those words into the correct order and trimming the ones that don't do much more than sit on the page and smile at you as if you yourself were a drooling idiot in need of constant reminding that the user will be less confused if you design your UI such that it is...less confusing.

Out on the documentation plantation (where we grow our own adverbs, and at night you can hear the boasts of the rodomonts as they stuff syllables back into their syncopes), we're also raising tech doc, user doc, asparagus, release notes, and miscellaneous editorialisms and articles. And of all these, the UI Guidelines, whose topic is a religious issue, needs the most scrupulous attention (of a certain type): We don't want to publish an errata sheet for the bible.

So let's resynchronize our watches: We're going to try to get the UI Guidelines—at least the parts with the sex scenes—on the R4 CD. But if they don't quite make it, they'll be posted to the web site soon thereafter. You can lead a horse to water, but don't hold your breath.

A Red Herring

By Jean-Louis Gassée

The "Red Herring" makes interesting reading—an insider view of Silicon Valley wheelers and dealers. It's well-written and opinionated, above cutting-and-pasting quotes from the usual suspects and recycling of press releases, which makes some other business magazines so bland. One doesn't have to agree with every opinion this estimable magazine offers, however. Opinions are made to be disagreed with, and I'd like to oblige.

In the "Red Herring" for November 98, in an article titled DRINKING THE KOOL-AID™, Jason Pontin opines, "Steve Jobs rescued Apple—but the point is moot." He concedes that Apple has a simpler, more successful product line, smaller inventories, and now, for the first time since 1995, a healthy bottom line. But, he contends, it doesn't matter anymore.

"The whole point of the Mac was to be a better alternative to Microsoft operating systems on Intel chips. Yet no one wants such an alternative anymore, except graphic designers who are used to it," he says. Pontin argues that iMac buyers aren't thinking "different," they're just buying a fast Internet terminal that also runs Office, probably implying a PC would do more for less. In his mind, people who rave about Apple's spectacular turnaround have drunk the famous Kool-Aid the founder was accused of serving his followers in the old days. And he concludes: "Steve Jobs has saved Apple. Good for him. It doesn't matter."

As indicated earlier, I beg to differ.

The article notes, but fails to attach much significance to, the fact that almost 10 percent of iMac buyers were PC users. Keeping in mind the iMac quirks and price, the iMac success with PC owners cannot be dismissed. Just as Apple fans were wrong to keep on dissing the growing number of PC buyers for not seeing that the Mac was so much better, PC aficionados would err if they thought the many people who buy iMacs were just drinking the Apple Kool-Aid.

Simplicity and style, exactly what Apple promoted, do matter. And, unlike the "Red Herring," the PC industry agrees and has reacted in its usual pragmatic way. Andy Grove was quoted in Time magazine on the positive effect of the iMac on the thinking of his customers, PC makers. The risk for Apple is to see the PC industry once again adopt and improve upon its ideas. On the bright side, Apple is in a leadership position again, not teetering on the edge of the grave.

I don't believe any company should survive merely because it provides "diversity" to the ecosystem, but when customers vote with their wallets for a variety of solutions, this is good news. Who would have predicted a year ago Linux and the iMac would appear as legitimate choices for enterprises and consumers?

This is ironic at the very time when the trial for the DOJ suit against Microsoft is about to start, especially when one considers the role Microsoft played in supporting Apple's turnaround. In any event, for us, this is a much better world than one where "other" platforms are unthinkable. It's much easier for us to be a specialized platform in an ecumenical world than off-color in a monochromatic one. Imagine what the "Red Herring" would think of a market where customers thought there was no need for more than one business magazine.