The SPARC port

The SPARC port targets various machines from Sun product lineup. The initial effort is on the Ultra 60 and Ultra 5, with plans to latter add the Sun T5120 and its newer CPU. This may change depending on hardware donations and developer interest.

Support for 32-bit versions of SPARC is currently not planned.

SPARC ABI

The SPARC architecture has 32 integer registers, divided as follows:

  • global registers (g0-g7)

  • input (i0-i7)

  • local (l0-l7)

  • output (o0-o7)

Parameter passing and return is done using the output registers, which are generally considered scratch registers and can be corrupted by the callee. The caller must take care of preserving them.

The input and local registers are callee-saved, but we have hardware assistance in the form of a register window. There is an instruction to shift the registers so that:

  • o registers become i registers

  • local and output registers are replaced with fresh sets, for use by the current function

  • global registers are not affected

Note that as a side-effect, o7 is moved to i7, this is convenient because these are usually the stack and frame pointers, respectively. So basically this sets the frame pointer for free.

Simple enough functions may end up using just the o registers, in that case nothing special is necessary, of course.

When shifting the register window, the extra registers come from the register stack in the CPU. This is not infinite, however, most implementations of SPARC will only have 8 windows available. When the internal stack is full, an overflow trap is raised, and the handler must free up old windows by storing them on the stack, likewise, when the internal stack is empty, an underflow trap must fill it back from the stack-saved data.

Misaligned memory access

The SPARC CPU is not designed to gracefully handle misaligned accesses. You can access a single byte at any address, but 16-bit access only at even addresses, 32bit access at multiple of 4 addresses, etc.

For example, on x86, such accesses are not a problem, it is allowed and handled directly by the instructions doing the access. So there is no performance cost.

On SPARC, however, such accesses will cause a SIGBUS. This means a trap handler has to catch the misaligned access and do it in software, byte by byte, then give back control to the application. This is, of course, very slow, so we should avoid it when possible.

Fortunately, gcc knows about this, and will normally do the right thing:

  • For usual variables and structures, it will make sure to lay them out so that they are aligned. It relies on stack alignment, as well as malloc returning sufficiently aligned memory (as required by the C standard).

  • On packed structure, gcc knows the data is misaligned, and will automatically use the appropriate way to access it (most likely, byte-by-byte).

This leaves us with two undesirable cases:

  • Pointer arithmetics and casting. When computing addresses manually, it’s possible to generate a misaligned address and cast it to a type with a wider alignment requirement. In this case, gcc may access the pointer using a multi byte instruction and cause a SIGBUS. Solution: make sure the struct is aligned, or declare it as packed so unaligned access are used instead.

  • Access to hardware: it is a common pattern to declare a struct as packed, and map it to hardware registers. If the alignment isn’t known, gcc will use byte by byte access. It seems volatile would cause gcc to use the proper way to access the struct, assuming that a volatile value is necessarily aligned as it should.

In the end, we just need to be careful about pointer math resulting in unalined access. -Wcast-align helps with that, but it also raises a lot of false positives (where the alignment is preserved even when casting to other types). So we enable it only as a warning for now. We will need to ceck the sigbus handler to identify places where we do a lot of misaligned accesses that trigger it, and rework the code as needed. But in general, except for these cases, we’re fine.

The Ultrasparc MMUs

First, a word of warning: the MMU was different in SPARCv8 (32bit) implementations, and it was changed again on newer CPUs.

The Ultrasparc-II we are supporting for now is documented in the Ultrasparc user manual. There were some minor changes in the Ultrasparc-III to accomodate larger physical addresses. This was then standardized as JPS1, and Fujitsu also implemented it.

Later on, the design was changed again, for example Ultrasparc T2 (UA2005 architecture) uses a different data structure format to enlarge, again, the physical and virtual address tags.

For now te implementation is focused on Ultrasparc-II because that’s what I have at hand, later on we will need support for the more recent systems.

Ultrasparc-II MMU

There are actually two separate units for the instruction and data address spaces, known as I-MMU and D-MMU. They each implement a TLB (translation lookaside buffer) for the recently accessed pages.

This is pretty much all there is to the MMU hardware. No hardware page table walk is provided. However, there is some support for implementing a TSB (Translation Storage Buffer) in the form of providing a way to compute an address into that buffer where the data for a missing page could be.

It is up to software to manage the TSB (globally or per-process) and in general keep track of the mappings. This means we are relatively free to manage things however we want, as long as eventually we can feed the iTLB and dTLB with the relevant data from the MMU trap handler.

To make sure we can handle the fault without recursing, we need to pin a few items in place:

In the TLB:

  • TLB miss handler code

  • TSB and any linked data that the TLB miss handler may need

  • asynchronous trap handlers and data

In the TSB:

  • TSB-miss handling code

  • Interrupt handlers code and data

So, from a given virtual address (assuming we are using only 8K pages and a 512 entry TSB to keep things simple):

VA63-44 are unused and must be a sign extension of bit 43 VA43-22 are the ‘tag’ used to match a TSB entry with a virtual address VA21-13 are the offset in the TSB at which to find a candidate entry VA12-0 are the offset in the 8K page, and used to form PA12-0 for the access

Inside the TLBs, VA63-13 is stored, so there can be multiple entries matching the same tag active at the same time, even when there is only one in the TSB. The entries are rotated using a simple LRU scheme, unless they are locked of course. Be careful to not fill a TLB with only locked entries! Also one must take care of not inserting a new mapping for a given VA without first removing any possible previous one (no need to worry about this when handling a TLB miss however, as in that case we obviously know that there was no previous entry).

Entries also have a “context”. This could for example be mapped to the process ID, allowing to easily clear all entries related to a specific context.

TSB entries format

Each entry is composed of two 64bit values: “Tag” and “Data”. The data uses the same format as the TLB entries, however the tag is different.

They are as follow:

Tag

Bit 63: ‘G’ indicating a global entry, the context should be ignored. Bits 60-48: context ID (13 bits) Bits 41-0: VA63-22 as the ‘tag’ to identify this entry

Data

Bit 63: ‘V’ indicating a valid entry, if it’s 0 the entry is unused. Bits 62-61: size: 8K, 64K, 512K, 4MB Bit 60: NFO, indicating No Fault Only Bit 59: Invert Endianness of accesses to this page Bits 58-50: reserved for use by software Bits 49-41: reserved for diagnostics Bits 40-13: Physical Address<40-13> Bits 12-7: reserved for use by software Bit 6: Lock in TLB Bit 5: Cachable physical Bit 4: Cachable virtual Bit 3: Access has side effects (HW is mapped here, or DMA shared RAM) Bit 2: Privileged Bit 1: Writable Bit 0: Global

TLB internal tag

Bits 63-13: VA<63-13> Bits 12-0: context ID

Conveniently, a 512 entries TSB fits exactly in a 8K page, so it can be locked in the TLB with a single entry there. However, it may be a wise idea to instead map 64K (or more) of RAM locked as a single entry for all the things that needs to be accessed by the TLB miss trap handler, so we minimize the use of TLB entries.

Likewise, it may be useful to use 64K pages instead of 8K whenever possible. The hardware provides some support for mixing the two sizes but it makes things a bit more complex. Let’s start out with simpler things.

Software floating-point support

The SPARC instruction set specifies instruction for handling long double values, however, no hardware implementation actually provides them. They generate a trap, which is expected to be handled by the softfloat library.

Since traps are slow, and gcc knows better, it will never generate those instructions. Instead it directly calls into the C library, to functions specified in the ABI and used to do long double math using softfloats.

The support code for this is, in our case, compiled into both the kernel and libroot. It lives in src/system/libroot/os/arch/sparc/softfloat.c (and other support files). This code was extracted from FreeBSD, rather than the glibc, because that made it much easier to get it building in the kernel.

Openboot bootloader

Openboot is Sun’s implementation of Open Firmware. So we should be able to share a lot of code with the PowerPC port. There are some differences however.

Executable format

PowerPC uses COFF. Sparc uses a.out, which is a lot simpler. According to the spec, some fields should be zeroed out, but they say implementation may chose to allow other values, so a standard a.out file works as well.

It used to be possible to generate one with objcopy, but support was removed, so we now use elf2aout (imported from FreeBSD).

The file is first loaded at 4000, then relocated to its load address (we use 202000 and executed there)

Openfirmware prompt

To get the prompt on display, use STOP+A at boot until you get the “ok” prompt. On some machines, if no keyboard is detected, the ROM will assume it is set up in headless mode, and will expect a BREAK+A on the serial port.

STOP+N resets all variables to default values (in case you messed up input or output, for example).

Useful commands

Disable autoboot to get to the openboot prompt and stop there

setenv auto-boot? false

Configuring for keyboard/framebuffer io

setenv screen-#columns 160
setenv screen-#rows 49
setenv output-device screen:r1920x1080x60
setenv input-device keyboard

Configuring openboot for serial port

setenv ttya-mode 38400,8,n,1,-
setenv output-device ttya
setenv input-device ttya
reset

Boot from network

The openboot bootloader supports network booting. See the Network booting guide for general information about the general network booting process. This page documents the parts specific to the openboot bootloader configuration.

In openboot, booting from the network is done simply by using the “net:” device alias in the boot command line. This lets openboot load our bootloader, which then uses the openboot ability to send and receive data over the network to load the filesystem (and kernel contained in it) over the network. The two parts are independent: it’s also possible to load the bootloader from the network but boot a local filesystem, or use the local bootloader and load the filesystem from the network.

The bootloader needs to be placed in a tftp server, I use atftpd in Debian, which serve files from /srv/tftp/ (so “somefile” in the example below will look for /srv/tftp/somefile).

static ip

This currently works best, because rarp does not let the called binary know the IP address. We need the IP address if we want to mount the root filesystem using remote_disk server.

boot net:192.168.1.2,somefile,192.168.1.89

The first IP is the server from which to download (using TFTP), the second is the client IP to use. Once the bootloader starts, it will detect that it is booted from network and look for a the remote_disk_server on the same machine.

rarp

This needs a reverse ARP server (easy to setup on any Linux system). You need to list the MAC address of the SPARC machine in /etc/ethers on the server. The machine will get its IP, and will use TFTP to the server which replied, to get the boot file from there.

boot net:,somefile

(net is an alias to the network card and also sets the load address: /pci@1f,4000/network@1,1)

This currently does not work completely: the server address is not forwarded to the bootloader, and as a result, remote filesystems will not be available. The bootloader needs to be updated to know where to find the address in this case (it is done for PowerPC, I think).

dhcp

This needs a DHCP/BOOTP server configured to send the info about where to find the file to load and boot.

boot net:dhcp

Debugging

The openboot environment provide several useful commands to assist in debugging:

202000 dis (disassemble starting at 202000 until next return instruction)
4000 1000 dump (dump 1000 bytes from address 4000)
.registers (show global registers)
.locals (show local/windowed registers)
%pc dis (disassemble code being exectuted)
ctrace (backtrace)

The backtrace provides addresses and register values (allowing to know the function arguments), there is no symbols and function names printed. objdump (on the build machine) can be used to disassemble the kernel or bootloader and find the corresponding code:

./cross-tools-sparc/bin/sparc64-unknown-haiku-objdump -d objects/haiku/sparc/release/system/kernel/kernel_sparc |c++filt|less
./cross-tools-sparc/bin/sparc64-unknown-haiku-objdump -d objects/haiku/sparc/release/system/boot/openfirmware/boot_loader_openfirmware |c++filt|less