The Odd Kid on the Block

Running ARM in BE8 Big Endian Mode

Martin Husemann
martin@NetBSD.org

Introduction

Modern ARM CPUs offer bi-endian support:
the CPU can switch between little and big endian mode at run time.
When I got a CubieTruck board for my test lab, I expected lots of things to break when trying a big endian setup:
"All the world's a VAX" and ARM is little endian.
And lots of things did break, but many in unexpected ways.
In the end we were able to fix all bugs found.

Thanks to

In the end we were able to fix ...
and that "we" includes:

Matt Thomas
Nick Hudson
Jared McNeill
Jörg Sonnenberger
FUKAUMI Naoki

Special thanks to everyone involved (even if I happened to forgot to mention you in above list)!

New Device Fun

The CubieTruck is a tiny board based on the Allwiner A20 SoC with lots of gadgets and useful peripherals on board.

I put it into an old SCSI enclosure, together with a SATA disk.

Missing drivers

Image from cubieboard.org under creative commons license.

Initially quite a few drivers where missing, but as a few other developers received a CubieBoard or CubieTruck at the same time, this got fixed quickly.

Missing network

The most important driver missing was the gigabit ethernet core on the CubieTruck:

Based on a Synopsis Designware IP core
Used in slightly different variations in other Allwinner based devices (BanaPI, Hummingbird A31 and A80)
Later also found on Odroid-C1
No usable documentation available from Synopsis or Sunxi (the Allwinner manufacturer)
But luckily we found some documentation from other SoC vendors using a later version of the same core.

With the help of others, I wrote the "awge" driver for network part.

Now the only important missing parts are:

NAND (work in progress, Sunxi just released a source code for a Linux driver; before only a binary blob was available)
WiFi (no docs available).

Otherwise the CUBIETRUCK support is pretty complete.

The ARM BE8 ELF Image Format

ARM has supported big endian CPUs for quite a while, but the old "big endian mode" had some quirks build into it that got in the way when they moved on to scalar features, SIMD, and in the end the AArch 64 bit ARM architecture.

So ARM dropped support for the old big endian mode, renamed the object format used to "BE32" and created a new object format, called "BE8".

New Object Format?

Marketing required compatibility to old object files.

But the new cores required all instructions to be encoded in little endian byte order!

Luckily all ARM provided tools already marked code sections with special symbols.

So the "compatibility magic" was put into the linker, and the ABI extended by a subset of the special symbols always used by ARM tools.

Special Symbols

Simple idea: find all 32bit code parts and swap the instructions accordingly, find all 16bit thumb instruction parts and swap those.

$a, $a.1 .. $a.N
marks the start of a sequence of 32bit "arm" instructions
$t, $t.1 .. $t.N
marks the start of a sequence of 16bit "thumb" instructions
$d, $d.1 .. $d.N
marks the start of other data (like static strings).

Problems Found

After solving basic support issues and having the machine boot up to multiuser, various issues showed up:

Endianes related problems
Other issues

Early console did not work

To allow sending debug output to the serial port before attaching any drivers, a simple polled "early console" is setup.

This did not work, but printed garbage and caused a long delay before the kernel com driver attached and took over console output.

Early console did not work

After noticing that the early console worked well for little endian kernels, the bug was easy to spot:

The three minimalistic functions used for polled console do direct hardware access (no bus_space abstraction involved), and those accesses did not provide byte swapping if needed.

Early console did not work

After fixing, the code to wait for the com device to become ready to transmit a character looks like this:

 while ((le32toh(uart_base[com_lsr])&& LSR_TXRDY) == 0 && --timo > 0)
     ;

Adding a few le32toh() and htole32() calls, like in the example above, fixed the issue.

MMC driver's DMA

The MMC driver uses DMA to transfer data from the SD card to memory.

The setup for the DMA descriptors passed to the device needs to explicitly swap the data passed to the DMA engine into the endianes expected by the device - which is always little endian on this SoC.

A typical DMA descriptor consists of a few status/command bits and a target address for the operation. Imagine what goes wrong if the address is in opposite byte order: the engine will overwrite arbitrary memory.

Again, adding a few htole32() calls fixed the issue.

Kernel Modules

The kernel module loadable files are not finally linked (with -be8 option), but instead handled by the kernel object loader.

The loader had to be taught about the magic $a (and friends) symbols and do byte swapping post-load.

Luckily we have a proper machine dependent function called after symbol loading, usually only doing some data/instruction cache consistency flushing. Inserting a fix up call there looks like this:

Swapping Instruction Byte Order

 int
 kobj_machdep(kobj_t ko, void *base,
 size_t size, bool load)
 {
      if (load) {
 #if __ARMEB__
          if (CPU_IS_ARMV7_P())
               kobj_be8_fixup(ko);
 #endif

Then we need a simple function to categorize symbols:

Categorizing Special Symbols

 static enum be8_magic_sym_type
 be8_sym_type(const char *name, int info)
 {
     if (ELF_ST_BIND(info) != STB_LOCAL)
         return Other;
     if (ELF_ST_TYPE(info) != STT_NOTYPE)
         return Other;
     if (name[0] != '$' || name[1] == '\0' ||
        (name[2] != '\0' && name[2] != '.'))
         return Other;
     switch (name[1]) {
     case 'a':
         return ArmStart;
     case 'd':
         return DataStart;
     case 't':
         return ThumbStart;
     default:
         return Other;
     }
 }

Iterating Symbols

The following code iterates all symbols in the new loaded module:

 /*
  * Count all special relocations symbols
  */
 ksyms_mod_foreach(ko->ko_name, be8_ksym_count, &relsym_cnt);

where relsym_cnt is:

 long relsym_cnt = 0;

and the callback function be8_ksym_count looks like:

Count Special Symbols

 static int
 be8_ksym_count(const char *name, int symindex, void *value,
     uint32_t size, int info, void *cookie)
 {
     size_t *res = cookie;
     enum be8_magic_sym_type t = be8_sym_type(name, info);

     if (t != Other)
         (*res)++;
     return 0;
 }

Swapping the Instructions

After counting we allocate storage for all the relevant symbols, and run another iteration where each symbol, together with type and address is stored.

This array then is sorted by address (calling kheapsort()).

Finally we run through the array in ascending address order, and for all sections, depending on type of the symbol describing it, swap 32 bits, swap 16 bits, or do nothing.

kobj_machdep() will flush caches after we are done.

libgcc

In a standard build, NetBSD does not use the gcc provided build infrastructure to build libgcc, instead all "configury" is done upfront during a step called "mknative", and the resulting makefile fragments and header files are then committed to the NetBSD tree.

To make sure the symbols in libgcc are all created with visibility "hidden", some tricks are played on the intermediate object files that used to include a "strip" and a "ld -r" invocation.

Now the resulting libgcc, while still being a linkable object, had all $a, $t and $d symbols stripped.

libgcc

But: when finally linking executables, the linker is invoked with -be8 option and tries to do the magic byte swapping for thumb and arm instructions.

If during this links it pulls in some function from libgcc, the swapping will not work for that function, as we stripped the "unused" local symbols in the libgcc pre-linking step.

The first program affected during a multi-user boot was fc-cache (after installation or X/font related updates), it uses __popcountsi2 on armv7.

Trivial Fix

The obvious fix: do not strip.

But other symbols had to be removed, as we would get duplicated symbols (it is a mess, maybe better do not ask).

So strip got replaced by a slightly more magic objcopy invocation that removed all local symbols but left the special ones in place.

Run in GDB

Gdb did work on core files, but not when trying to start programs from within - by using the "run" command.

Trying to "run" anything caused weired failure inside ld.elf_so, which caused me to make an "educated guess" that was spot-on.

When gdb tries to run a child process, it inserts a breakpoint in a know-to-gdb dummy function in the dynamic loader, on NetBSD this function is (in ARM assembly):

 0x21e8 <_rtld_debug_state>: bx lr

(in C it is just an empty function with some magic to prevent the compiler from optimizing it too much)

ld.elf_so and gdb interaction

This breakpoint is hit whenever a new shared object is loaded by ld.elf_so (and the main program binary is the first one to trigger). Gdb then extracts all necessary information about the new shared library (or main module) and continues from the breakpoint.

On ARM, a gdb breakpoint is done by replacing the instruction temporarily with a special illegal instruction and then trap the SIGILL via ptrace. The replacement instruction is coded as an array of bytes in gdb - and there is a little endian and a big endian variant.

Out of sync

Code inspection showed that Gdb upstream had added a new member "byte_order_for_code" in the struct "gdbarch_info", but the NetBSD specific breakpoint instruction was still selected by the old "byte_order" member. Upstream had fixed it for all other ARM targets, but somehow the NetBSD specific code is not in sync and did not get updated.

This should not happen at various levels, but...

What happened?

So gdb selected the different endian encoding for the break point illegal instruction, and instead of causing a trap, this just did some random arithmetic and continued in whatever function happened to live in ld.elf_so next to it:

 0x21e8 <_rtld_debug_state>: smlattne r0, r6, r0, r0
 0x21ec <_rtld_objlist_clear>: mov r12, sp
 0x21f0 <_rtld_objlist_clear+4>: push {r3, r4, r11, r12, lr, pc}

(smlattne = signed multiply long accumulate, top half * top half, conditional if "not equal")

This corrupted internal ld.elf_so data structures immediately before returning to ld.elf_so internal fixup work.

Obvious fix: select breakpoint according to new member (and better sync with upstream)

Non Endian Related Problems

We found a few issues that were unrelated to endianes, and especially the latter two would have shown up on similar tests with a little endian system as well:

C++ exceptions did not work "sometimes"
FPU exceptions missing
Unaligned data access unexpectedly works

C++ exceptions

NetBSD offers the standard Itanium interface for unwinding stacks (but not the convoluted HP version of the API).

The implementation is derived from LLVM's Compiler-RT, but it had a bug that got triggered during the ATF (automatic test framework) internal tests.

CFI and unwinding

ATF is written in C++ and heavily relies on exceptions, e.g. to abort a failing test case.

One of the steps in exception unwinding is to identify the call frame from the current %pc value and then use the corresponding Call Frame Information stored by the compiler to unwind the stack properly and find the parent call frame.

Code details were just slightly different on BE8 to trigger a bug in the binary search to identify the relevant CFI entry.

Hardware differences

The remaining issues could be called "false positives" - test failures that got fixed by fixing the tests.

The NEON FPU in Cortex-A7 do not support raising IEEE exceptions.

Userland can detect this by setting FP_X_INV in the exception mask and reading it back: on Cortex NEON it will not "stick".

Exceptionless FPUs

So tests grew code like this:

 #elif defined(__arm__) && !__SOFTFP__
     /*
      * Some NEON fpus do not implement IEEE exception handling,
      * skip these tests if running on them and compiled for
      * hard float.
      */
     if (0 == fpsetmask(fpsetmask(FP_X_INV)))
          atf_tc_skip("FPU does not implement exception handling");
 #endif

Unaligned access

ARM CPUs post version 5 can do unaligned data access.

Some of the signal tests explicitly try to trigger a SIGBUS for this, and failed on this CPUs.

Other architectures already provide a sysctl, sometimes even writable, to controll/detect this behavior, so this was added to ARM as well and tests grew code like:

 #if defined(__alpha__) || defined(__arm__)
     int rv, val;
     size_t len = sizeof(val);
     rv = sysctlbyname("machdep.unaligned_sigbus", &val, &len, NULL, 0);
     ATF_REQUIRE(rv == 0);
     if (val == 0)
          atf_tc_skip("No SIGBUS signal for unaligned accesses");
 #endif

Conclusions

In retrospect the issues found and fixed were less than expected. Most time was spent on typical problems when bringing up new hardware.

After basic testing works, the automatic tests (via ATF) proved very valuable again, but some issues did not get noticed - so there is further room for improvement.

Next challenge: Firefox on BE8 ARM.

The Odd Kid on the Block

Running ARM in BE8 Big Endian Mode

Introduction

Thanks to

New Device Fun

Missing drivers

Missing network

The ARM BE8 ELF Image Format

New Object Format?

Special Symbols

Problems Found

Early console did not work

Early console did not work

Early console did not work

MMC driver's DMA

Kernel Modules

Swapping Instruction Byte Order

Categorizing Special Symbols

Iterating Symbols

Count Special Symbols

Swapping the Instructions

libgcc

libgcc

Trivial Fix

Run in GDB

ld.elf_so and gdb interaction

Out of sync

What happened?

Non Endian Related Problems

C++ exceptions

CFI and unwinding

Hardware differences

Exceptionless FPUs

Unaligned access

Conclusions

Questions?

Thanks!