Be Newsletters - Volume 3: 1998

Issue 3-42, October 21, 1998

Be Engineering Insights: Optimizing Your Applications for Cache and Memory Access

By Dmitriy Budko

Today's high-performance CPUs (Pentium II, PowerPC 604) are fast, but without a good memory and cache system they are helpless. Just try using any Intel system with the cache disabled in the BIOS setting! The CPU will spend 95% of its time waiting for data from RAM or waiting until data is written to RAM.

Caches minimize this time by keeping the most recently used data in small but fast memory near the CPU. Normally, this is all an application programmer needs to know, because all details related to cache/memory coherency, bus mastering devices, and multiple CPUs for an application are transparent.

Nevertheless, if you care about the performance of your code you have to know much more, or your code will run faster on an old Pentium 166 MHz system than on a new, fast Pentium II 300 MHz system with 512K cache, SDRAM, etc.

You don't believe it—or you think that this code must be black magic, written in assembler? Wrong. The example I'm presenting here is very simple and straightforward, but *is* memory intensive. High-memory traffic is also characteristic of BeOS multimedia applications.

The example program is the primitive implementation of the classic Erastosthenes Sieve algorithm for searches of prime numbers. This implementation can be sped up easily but I preferred here to focus on cache effects.

The main loop is terse:

for(i=2; i<=sqrt_max; i++) /* don't search all numbers */
  {
    if(!list[i])   /* i is a prime number */
      {
        /* mark as non-prime all multiples of i */
        for(j=i+i; j<=nmax; j+=i)
          list[j] = 1;
      }
  }

As you see, the code is very straightforward but it runs faster on a Pentium 166 than on a Pentium II 300 MHz! Here's the full text of the test program:

#include <OS.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <string.h>

#define NMAX (8*1024*1024)

double get_sec(void)
{
  return (double)system_time()/1.0e6;
}

/* simple implementation */
void sieve1(unsigned char* list, int nmax)
{
  int i,j;
  int sqrt_max;

  sqrt_max =  (int)(sqrt(nmax)+0.5);

  for(i=2; i<=sqrt_max; i++)
  {
    if(!list[i])
    {
      for(j=i+i; j<=nmax; j+=i)
        list[j] = 1;
    }
  }
}

/* the same but without unnecessary memory writes */
void sieve2(unsigned char* list, int nmax)
{
  int i,j;
  int sqrt_max;

  sqrt_max =  (int)(sqrt(nmax)+0.5);

  for(i=2; i<=sqrt_max; i++)
  {
    if(!list[i])
    {
      for(j=i+i; j<=nmax; j+=i)
        if(!list[j])     /* it's not already marked */
          list[j] = 1;   /* mark it */
    }
  }
}

/* the same but use only 1 bit not 1 byte per number */
#define TEST_LIST_BIT(n) ( (list[n>>3] &  (1<<(n&7))) != 0)
#define SET_LIST_BIT(n)  ( (list[n>>3] |= (1<<(n&7)))  )

void sieve3(unsigned char* list, int nmax)
{
  int i,j;
  int sqrt_max;

  sqrt_max =  (int)(sqrt(nmax)+0.5);

  for(i=2; i<=sqrt_max; i++)
  {
    if(!TEST_LIST_BIT(i))
    {
      for(j=i+i; j<=nmax; j+=i)
        SET_LIST_BIT(j);
    }
  }
}

void sieve4(unsigned char* list, int nmax)
{
  int i,j;
  int sqrt_max;

  sqrt_max =  (int)(sqrt(nmax)+0.5);

  for(i=2; i<=sqrt_max; i++)
  {
    if(!TEST_LIST_BIT(i))
    {
      for(j=i+i; j<=nmax; j+=i)
        if(!TEST_LIST_BIT(j))
          SET_LIST_BIT(j);
    }
  }
}

double profile_sieve(void (*sf)(unsigned char*, int),
  const char* name, unsigned char* list, int nmax)
{
  double    t1, t2, dt;

  memset(list, 0, nmax);
  t1 = get_sec();
  sf(list, nmax);
  t2 = get_sec();
  dt = t2 - t1;

  printf("%s: %4.3f sec\n", name, dt);
  return dt;
}

unsigned char primes_list[NMAX+1];

int main()
{
  profile_sieve(sieve1, "sieve1", primes_list, NMAX);
  profile_sieve(sieve2, "sieve2", primes_list, NMAX);
  profile_sieve(sieve3, "sieve3", primes_list, NMAX);
  profile_sieve(sieve4, "sieve4", primes_list, NMAX);

  return 0;
}

The results with gcc -O3 and Metrowerks cc -O3 (smaller numbers are better) in seconds are

          PPC 604e 132    Pentium 166       PentiumII 300
sieve1       9.36            1.97              2.75
sieve2       6.97            3.83              1.76
sieve3       4.80            3.72              1.48
sieve4       3.53            2.80              1.03

On the sieve1 function, the Pentium 166 is much faster than the Pentium II 300. What's wrong with this version?

Pentium Pro and Pentium II processors have a "write allocate by read-for-ownership" cache, whereas the Pentium processor has a "no-write-allocate; write through on write miss" cache. On Pentium Pro and Pentium II processors, when a write occurs and the write misses the cache, the entire 32-byte cache line is fetched.

On the Pentium processor, when the same write miss occurs, the write is simply sent out to memory. Write allocate is generally advantageous, since sequential stores are merged into burst writes and the data remains in the cache for use by later loads. This is why P6- family processors and PowerPC 603 and 604 adopted this write strategy, and why some Pentium processor system designs implement it for the L2 cache, even though the Pentium processor uses write-through on a write miss.

However, write allocate can be a disadvantage in code where:

Just one piece of a cache line is written.
The entire cache line is not read.
Strides are larger than the 32-byte cache line.
Writes are made to a large number of addresses.

When a large number of writes occur within an application, as in the sieve1 function, and both the stride is longer than the 32-byte cache line and the array is large, every store on a Pentium Pro or Pentium II processor will cause an entire cache line to be fetched. In addition, this fetch will probably replace one (sometimes two) dirty cache lines.

What are the possible optimizations?

Sieve2 checks the value prior to writing, and thus reduces the number of writes of dirty cache lines to memory.

Sieve3 packs the byte array into bit array, thereby reducing the size of the array, which reduces memory and cache traffic.

Sieve4 use both techniques.

So, when you write performance-critical code think about cache behavior. This article illustrates only one specific speed trap—there are many more. For a better understanding of how interactions between hardware and software affect performance, I recommend the excellent textbook

Computer Architecture: A Quantitative Approach,
by David Patterson and John Hennesy, ISBN 1-55860-329-8.

Specific recommendations for Intel CPUs are in the

Intel Architecture Optimizations Manual,
http://developer.intel.com/design/pentiumii/manuals/242816.htm

which was used as a reference for this article.

Be Engineering Insights: Scrolling to Oblivion

By Steven Black

Hello again. After my initial idea for a Newsletter article fell through, (alas, it even had functional example code), I had to settle for something quick and straightforward, as QA, along with the rest of engineering, is busy preparing for R4.

So, we'll just talk about scrolling.

ftp://ftp.be.com/pub/samples/interface_kit/scrollbarapp.zip

In the Be API, there are BScrollBars and BScrollViews. Most applications use BScrollViews. A few applications that want to override specific behavior don't use BScrollViews.

Applications like the Tracker, with the horizontal scroll bar starting after the status area, and NetPositive with scroll bars that disappear and reappear, do not use BScrollViews—they handle the scroll bars themselves.

Under normal circumstances you use BScrollViews and the scroll bars are handled more or less automatically. And that's the situation this article deals with.

Let's start out with a relatively common example for text. Suppose you have a BTextView, and you decide it should have one or more scroll bars. With the Be API it's completely painless to add them.

BTextView *text;
. . .
BRect fr = frame; // derived from the window frame
fr.right -= B_V_SCROLL_BAR_WIDTH + 1;
fr.bottom -= B_H_SCROLL_BAR_HEIGHT + 1;
. . .

text = new BTextView(fr, "textView", textrect,
  B_FOLLOW_ALL_SIDES, B_WILL_DRAW | B_NAVIGABLE);

AddChild(new BScrollView("TextScroll", text,
  B_FOLLOW_ALL_SIDES, 0, true, true));

The frame must include space around it for the scroll bars. If your frame doesn't take into account the size of the scroll bars, the scroll bars may not be visible, which would make them useless.

You also need to specify your initial text rectangle. (Note that the text rectangle is specific to the text view and differs from the view's frame rectangle.) While the bottom of this rectangle will grow to fit, the left and right coordinates determine things such as where line wrap occurs and where text aligns for right and centered alignment. (If you don't use line wraps, the text disappears when it reaches the bounds of the text rectangle.)

When you create the BScrollView, you specify the BView that will be surrounded and controlled by it, followed by your resizing mode and flags as usual, and more importantly, which scroll bars you want to have (both horizontal and vertical).

BTextViews handle the scroll bars automatically. The scroll bars are automatically resized as the content of the BTextView grows and shrinks. If you can use a fixed text frame, you may only need to call SetWordWrap(true), and only if you want word wrapping.

However, many applications use text views for other purposes than simple direct user input. Medium-sized to large scrollable BTextViews are quite common. Almost all of them have some form of word wrap. This is all handled the same way if you're wrapping at a constant location. However, if your view is being resized, you'll probably want to adjust your text rectangle to match.

There are two different, nonexclusive kinds of word wrap. There's wrapping at a fixed point, and there's wrapping at the edge of the view. The key to each is when you wrap at a fixed point you emulate physical (i.e., hardware) limitations, and when you wrap at the edge of the view, you no longer require the horizontal scroll bar.

The key to wrapping at the edge of the view is to catch FrameResized() messages and properly adjust your text frame when they occur.

If you don't want to wrap at all, you need to make sure your text region is large enough for all of your text. Many programs just allocate a presumably larger-than-needed text rectangle. (Examples of this approach include at least one popular programmer's editor on BeWare.) Most people can get away with just counting the lines and finding the longest line available—though this gets slow with large files.

However, I digress—let's get back to scrolling.

BTextViews pretty much take care of themselves, after they're all configured. All the scroll bar handling is done by the BTextView. Views containing images are much more interesting, as you need to create that view and handle the scroll bar on your own.

Images are typically contained in BBitmaps in memory. Views support a variety of pleasant ways to draw BBitmaps. Another pleasantry: though image storage formats vary dramatically, with the Translation Kit it's easy to take a file and get a BBitmap from it (providing the appropriate Translator exists for that format, of course).

BBitmap *image;
image = BTranslationUtils::GetBitmapFile(name);

That's really all you need to get an image from a filename in BeOS. And -- you must remember to link in the Translation Kit, as it's not normally included. It's all in libtranslation.so.

Additionally, some scaling is often required. I identify two different types of scaling. There's scaling to fit the window, done by some arbitrary percentage. There's also scaling to fit the screen while keeping the aspect ratio, which is actually an adaptation of scaling by a fixed percentage.

The first thing to be aware of with images and scroll views is that not much is handled automatically. You're going to need to make some sort of view for the BBitmap to be in, and it will at least need a Draw method.

We've already mentioned the basic BScrollView call, so I won't rehash that. Instead I'll concentrate on things you need to do for non-BTextViews.

Images have a fixed size, and if you're not scaling the image, you'll either have the view background color, or, if you've set the background color to a transparent color (either B_TRANSPARENT_32_BIT or B_TRANSPARENT_8_BIT), you'll have garbage in the section of the view not covered by the image. You can remedy this problem by properly setting (and resetting) the size of the BWindow and view involved.

The view containing the image needs to adjust the scroll bars as needed. This includes the proportion, the range, and the steps. Instead of trying to explain what I think are rather straightforward calls to adjust the scroll bar, here's the section of the code that I used.

void TImageView::FixupScrollbars(void)
{
  BRect bounds;
  BScrollBar *sb;

  bounds = Bounds();
  float myPixelWidth = bounds.Width();
  float myPixelHeight = bounds.Height();
  float maxWidth = 1, maxHeight = 1;

  if(image!=NULL){
    // get max size of image
    GetMaxSize(&maxWidth, &maxHeight);
  } else fprintf(stderr, "Image is null\n");

  float propW = myPixelWidth/maxWidth;
  float propH = myPixelHeight/maxHeight;

  float rangeW = maxWidth - myPixelWidth;
  float rangeH = maxHeight - myPixelHeight;

  if(rangeW < 0)
    rangeW = 0;

  if(rangeH < 0)
    rangeH = 0;

  if ((sb=ScrollBar(B_HORIZONTAL))!=NULL) {
    sb->SetProportion(propW);
    sb->SetRange(0, rangeW);

    // Steps are 1/8 visible window for small steps
    // and 1/2 visible window for large steps
    sb->SetSteps(myPixelWidth / 8.0, myPixelWidth / 2.0);
  }

  if ((sb=ScrollBar(B_VERTICAL))!=NULL) {
    sb->SetProportion(propH);
    sb->SetRange(0, rangeH);

    // Steps are 1/8 visible window for small steps
    // and 1/2 visible window for large steps
    sb->SetSteps(myPixelHeight / 8.0, myPixelHeight / 2.0);
  }
}

For the views FrameResized(), as well as any other place it may change, a simple call to FixupScrollbars adjusts the scroll bars.

You might be wondering what's needed to scale the image. Would you believe: nothing! The DrawBitmap() methods allow you to specify source and destination rectangles, and if they're different in size, the source rectangle will be scaled appropriately.

The sample code is readily available. It can handle images and text files. For images it shows them maximized to perspective and zoomed. For text files, it starts out not wrapping them.

Since too few image programs actually allow you to view files maximized to perspective, the sample code should be mildly useful. (If you view it as an excuse to learn some Tracker scripting, too, it could be highly useful for those who wish to browse image files.)

I lifted some of the image-related stuff from George Hoffman's QuickPaint sample code, so credit for that goes to him.

Developers Workshop: Untangling Threads

By Michael Morrissey

If you've never given much thought to what multithreading is and how it works, chances are that your code may have routines which could greatly benefit if modified to take advantage of threading. Often applications have small, helper functions which contribute significantly to the running time of the application, and threading these functions frequently can lead to enormous speed improvements on multiprocessor machines.

In this article, we'll be looking at one such helper function, the world- famous Quicksort routine. We'll investigate what makes a routine a good candidate for threading, give some general design tips, and offer questions to keep in mind when threading your code.

Before you look at the sample code, you might want to skim through the Be Book's section on Threads, and the one on Semaphores. You can get the sample code from here:

ftp://ftp.be.com/pub/samples/intro/QSort.zip

The sample code version of Quicksort is not meant to be put into a library, but rather to be used as a tool for experimentation. Consequently, there are three command-line arguments: the first specifies the size of the array to be created and filled with random integers, the next argument specifies the maximum number of threads to spawn for sorting, and the third argument specifies a threshold variable, which indicates the size of the smallest array which is to spawn a thread. The program also creates a BStopWatch, which will give you a rough idea of how the arguments affect the runtime of the sort.

First, about the algorithm: Quicksort is the canonical divide-and-conquer algorithm. The general idea is this: given an array of numbers, choose a "pivot value"; ideally, the pivot is the mean value of the array. Then reorder the elements of the array so that elements less than the pivot appear on the low end of the array, and elements greater than the pivot appear on the high end. You now have partitioned the array into two sub-arrays: one with values less than pivot, one with values greater than pivot. Now recursively apply this algorithm to each of the two sub-arrays, partitioning each of the sub-arrays into two more sub-arrays. In this way, you order the entire array by working on consecutively smaller and smaller arrays.

Quicksort is good for large arrays, but inefficient on smaller ones. For this reason, one optimization which is generally made is to order small sub-arrays using a straight insertion-sort, rather than partitioning down to single elements. This results in a great speed up, and the sample code uses it in the Partition() function, for sub-arrays with less than 20 elements. (The value of 20 is somewhat arbitrary; try changing it to see if there's any effect.)

At this point, you might be saying to yourself, "We have the regular, unthreaded version of Quicksort. Great! Now how do we thread it?" If you're wondering about that, back up! The question that you should be wondering about is: "Is threading this algorithm going to speed up the running time of this algorithm?" Threading an algorithm doesn't always improve running time, and it can even hurt running time. There's nothing worse than rewriting a piece of code to make it run faster, only to discover that you've slowed it down. (Trust me.) Unfortunately, there's no simple way to decide whether or not threading is a good idea. You can, however, look for some tell-tale signs, use your best judgement, and experiment.

One of the most important questions to ask is, "Will the data the threads are working on overlap or be mutually exclusive?" If the data is independent, the need for synchronization between the threads is greatly reduced, which saves on both running time and on your effort in implementing the algorithm. If the data overlaps, you'll have to carefully consider how to synchronize your threads so that they don't clobber each other's work, and also so that they aren't spending all of their time competing for control of a variable rather than doing "real" work.

The next question to ask is, "Does the amount of work each thread will perform significantly outweigh the overhead of managing the thread?" As light-weight as they are, threads still have overhead, in the form of kernel calls, memory allocation, and context-switching. If your thread isn't going to spend a fair amount of time computing, the overhead might not be worth it.

I must admit that Quicksort for integers fails this test. Comparing integers is very inexpensive, and does not far outweigh the cost of spawning a new thread. It's easy to imagine, however, sorting an array of a user-defined type which has an expensive comparison operation, such as sorting an array of complex numbers. For simplicity's sake, rather than defining my own type and templatizing the QSort class, I compare the logarithms of an array of integers (rather than the integers themselves). The resulting ordering is the same as if I'd ordered the integers directly, but the logarithm operations are expensive enough to make threading worthwhile.

Your next question might be, "How will I know when my algorithm has finished?" That sounds like a funny question if you're only used to working with single-threaded algorithms, but it's quite important in the multithreaded world. Threads are not guaranteed to execute in a particular order (unless you synchronize them to do so explicitly), so you cannot predicate your termination condition on the order of execution. In Quicksort, when an element is put into its correct place in the overall array, a counter which refers to "elements yet to be sorted" is decremented. When the counter reaches zero, a semaphore is released, indicating to the managing thread that the sorting is finished and the child threads are finished.

Another crucial question is, "How many threads should I start?" Surprise -- there's no easy answer to this question either. In the case of Quicksort, it doesn't do much good to have more threads running that processors. That's because when a Partition thread is running, it can run straight through, as fast as it can, since it doesn't try to acquire variables which other threads might be using, which could cause contention. Some algorithms may benefit from having more threads than processors, since while one thread is idle, waiting to acquire a shared resource, another thread can be running.

Not having more threads than processors (for Quicksort) is only half of the situation. Ideally, you'd like to have at exactly as many threads as processors running at any point in time, so as to keep all of the processors working. Just spawning threads and giving them an initial amount of work to do is not enough—one thread might finish sorting a sub-array faster than the others. It would be good if this thread could go back and help out the other threads which are still running. So in the QSort constructor, we create a semaphore and give it a thread count of THREADS (from argv):

QSort::QSort(int32 *vector, int32 count)
{
  system_info info;
  get_system_info(&info);
  cpu_count = info.cpu_count;

  // this is our "pool" of worker threads...
  workthread = create_sem(THREADS, "workthread");

  // for real code, you'd have a thread count of cpu_count...
  // workthread = create_sem(cpu_count, "workthread");

  ...
}

The workthread semaphore controls how many threads can be started. Before a worker thread is spawned, it tries to acquire the workthread semaphore. If it is immediately available, a worker thread is spawned and resumed:

if(big &&
   acquire_sem_etc(thisclass->workthread, 1, B_TIMEOUT, 0)
     == B_NO_ERROR)
{
  // there's a worker thread available...

  tid = spawn_thread(Partition, "Partition", B_NORMAL_PRIORITY,
          new WorkUnit(start, j, thisclass, true));

  // if spawn_thread failed, we can still continue...
  if(tid < 0)
  {
    Partition(new WorkUnit(start, j, thisclass, false));
  }
  else
  {
    resume_thread(tid);
  }
}
else // no available worker threads in the pool...
{
  Partition(new WorkUnit(start, j, thisclass, false));
}

If there are no available threads in the pool, that is, if acquire_sem(workthread) would block, then we simply call Partition() directly without starting a new thread and without waiting for one to become available. Naturally, when a thread has finished processing, it must release the workthread semaphore, effectively returning the thread to the pool...

// if we were spawned, then add a "thread" back into the pool...
if(unit->spawned)
{
  release_sem(thisclass->workthread);
}

Typically, rather than creating a thread every time it is taken from the pool, and killing the thread every time it is returned to the pool, you would simply resume and suspend the same threads, and feed the awakened thread fresh data. For this sample code, I chose not to implement that strategy for reasons of clarity. You should take a few minutes and think about how you'd restructure this program to use the "traditional pool" scheme, how you'd get fresh data into the threads, and how you might modify the mechanism which detects when the sort is finished.

One final thought: keep in mind that you want your routine to scale as nicely as possible from a single-processor machine to a multi-processor machine. Compare the single thread version of your routine on a single-processor system with the multithreaded version of your routine on a single-processor system. Is the multithreaded version much slower? If it is, can you rework your threading model to reduce that discrepancy? If not, you may want to consider special-casing your routine on single-processor systems.

I strongly encourage you to play around with the arguments to this program, and see how the size of the array, the number of threads in the pool, and the threshold variable interact and affect running time of this program. Then try looking at your own code for functions which might benefit from threading. If you think you have a good candidate, experiment!

PC Shift

By Jean-Louis Gassée

This is Agenda time, a yearly industry conference which always takes place in Phoenix, Arizona, in a resort, The Phoenician, once famous for its prominence in the real estate debacle of an earlier decade. So, in taxpayer subsidized splendor we spin and schmooze—and hear preachers warning us about the danger of government intervention in our affairs.

Still on symbols, last year's Agenda opening day was made memorable by the "surprise" DOJ suit against Microsoft. During the morning's coffee break, video was piped on the screen as attendees walked in and out of the room. At first, some of us thought this was one of the spoofs we're regularly treated to, but no, this was the Bill and Janet (now Joel) reality show.

A year later, on the very same opening day, the trial opens, and, for the first time since the beginning of Agenda, Bill Gates isn't at the conference. This was deemed such a momentous change by Agenda organizers they e-mailed attendees in advance essentially offering their money back if Bill's absence killed their interest in the conference.

On the one hand this is laudable truth-in-conferencing on the part of the organizers. On the other hand, the implication that there is so much value in the submissive hope of being able to touch a fringe of the emperor's tunic. One is happy to report the assumption of such a supine position wasn't supported by attendance numbers: there was the usual turnout of industry grandees and smaller entrepreneurial fry as well as many last minute pleas to get admitted to this "by invitation only" conference.

Another observable change is what some call the shift from MIPS to Mbps, the feeling that what we need most isn't faster processors but faster connections to the great network in our future. Indeed, one can feel the frustration at PCs with more computing power and storage than an early Cray supercomputer, connected to the Internet via a 56Kbps modem.

Dave Nagel, the head of AT&T Labs, commenting on the impact of broadband home connections, observed the experience was so good customers couldn't bear the thought of going back to the old POTS days and turned into broadband evangelists. "If it's offered where you live, get it, if not, demand it and, if you don't get it, move to a place that has it." In other words, now that we have the $500 PC running Office and Explorer, we need a $50 per month high-speed connection. Call it a "T1 in every pot," at least figuratively speaking.

This, in turn, gives rise to more connected thoughts, home networking and Internet multimedia appliances. AT&T, Philips and Cisco, from their vantage points, all stressed the importance of such technologies and products in creating new user experiences, not just smaller/cheaper versions of what we have today, and in reaching new users, not just relieving current users from their frustrations—and from more of their money.

But, of course, there is the other side of the issue, the side that says we need more MIPS. Or, perhaps better stated, both sides are right, we need more Mbps and more MIPS. If all I want to do is run Microsoft Office and Explorer with a POTS connection, the $500 PC offers an unbeatable price-performance combination. If, on the other hand, I want to capture, edit, produce and polish video, sound, music, 3D animations, or play really good games, the 450MHz "supercomputer" doesn't look so overpowered anymore.

That's the point Andy Grove made in his opening Monday morning talk and demos. But first, we were treated to a live videoconferenced ion-beam healing session. The patient? A processor chip in Santa Clara. Some attendees thought it a little too geeky; personally, I see it as providing a nice balance to Steve (Forbes), who shared his well-pressed, high-level vision of our industry's role in the future of our country.

Staying on the "Why We Need More MIPS" message, Intel's chairman then treated us to demonstrations of Pentium-based systems: Dell servers running Linux, newly converted Silicon Graphics systems running Windows NT, and a sample of Intel's upcoming processors running BeOS audio, video and 3D applications, drawing nice applause and looking quite smooth when compared to the SGI system. A nice milestone in fulfilling the goal set our by one of our shareholders who, before investing, teasingly called Be "The Poor Man's Silicon Graphics," promptly rephrased "SGI for the Rest Of Us" by others who wanted more than the poor man's goal.

Some found strange symbolism in the demonstration of several OS platforms just as the DOJ-MS anti-trust suit opened on the other side of the country. Personally, I hope this kind of demonstration will soon become unremarkable.