High-level languages offer great advantages in general by hiding many mundane and repetitive details from programmers, allowing them to concentrate on their goals. However, sometimes programmers must use a lower-level language, such as when writing code that deals directly with hardware or that is extremely performance sensitive. Assembly language is the programming language closest to the hardware, which makes it a natural last resort in such situations.
This article assumes a basic understanding of computer design (for example, you should know that a processor has registers and can access memory) and of operating systems (system calls, exceptions, process stacks). This article should be useful to PowerPC programmers unfamiliar with assembly as well as programmers who already know ia32 assembly and want to broaden their horizons.
Introduction to PowerPC
The PowerPC Architecture Specification, released in 1993, is a 64-bit specification with a 32-bit subset. Almost all PowerPCs generally available (with the exception of late-model IBM RS/6000 and all IBM pSeries high-end servers) are 32-bit.
PowerPC processors have a wide range of implementations, from high-end server CPUs such as the Power4 to the embedded CPU market (the Nintendo Gamecube uses a PowerPC). PowerPC processors have a strong embedded presence because of good performance, low power consumption, and low heat dissipation. The embedded processors, in addition to integrated I/O like serial and ethernet controllers, can be significantly different from the “desktop” CPUs. For example, the 4xx series PowerPC processors lack floating point, and also use a software-controlled TLB for memory management rather than the inverted pagetable found in desktop chips.
PowerPC processors have 32 (32- or 64-bit) GPRs (General Purpose Registers) and various others such as the PC (Program Counter, also called the IAR/Instruction Address Register or NIP/Next Instruction Pointer), LR (link register), CR (condition register), etc. Some PowerPC CPUs also have 32 64-bit FPRs (floating point registers).
PowerPC architecture is an example of a RISC (Reduced Instruction Set Computing) architecture. As a result:
- All PowerPCs (including 64-bit implementations) use fixed-length 32-bit instructions.
- The PowerPC processing model is to retrieve data from memory, manipulate it in registers, then store it back to memory. There are very few instructions (other than loads and stores) that manipulate memory directly.
Application binary interfaces
Technically, a developer can use any GPR for anything. For example, there is no “stack pointer register”; a programmer could use any register for that purpose. In practice, it is useful to define a set of conventions so that binary objects can interoperate with different compilers and pre-written assembly code.
Calling conventions are determined by the ABI (Application Binary Interface) used. ppc32 Linux and NetBSD implementations use the SVR4 (System V R4) ABI, but ppc64 Linux follows AIX and uses the PowerOpen ABI. The ABI specifies which registers are considered volatile (caller-save) and non-volatile (callee-save) when calling subroutines, and a lot more.
Some concrete examples of behavior specified by the SVR4 ABI:
- Since the PowerPC has so many GPRs (32 compared to ia32’s 8), arguments are passed in registers starting with
gpr12are volatile (caller-save) registers that (if necessary) must be saved before calling a subroutine and restored after returning.
gpr1is used as the stack frame pointer.
Many of the SVR4 features are identical to the PowerOpen ABI, which greatly aids interoperability.
When to use assembly
All the pros and cons listed in the “Assembly HOWTO” (see Related topics for a link) apply to PowerPC.
Sometimes you must touch CPU registers that higher-level languages are completely unaware of. This is especially true in the course of writing an operating system. One simple example is assigning your code its own stack — on a PowerPC, you must set
r1. A C compiler will only increment or decrement
r1, so if your application is running directly on the hardware, you must set
r1 before calling C code. Another example is an operating system’s exception handlers, which must carefully save and restore state one register at a time until it’s safe to call higher-level code.
Nonetheless, when faced with a situation in which you must use low-level hardware features, you should implement as little as possible in assembly:
- C code is portable and understood by a large number of developers; assembly code (especially PowerPC assembly) is not.
- Higher-level code is frequently much easier to debug than assembly.
- Higher-level code is by definition more expressive than assembly; in other words you can do more with less code (and in less time).
If you find yourself writing high-level constructs such as loops or C structures in assembly, take a step back and consider if this could be done more easily in another language. A general rule is to use just enough assembly to allow you to use a higher-level language.
One of the most common reasons people want to use assembly language is to make a slow program run faster. But in these cases, assembly should be the absolute last place you turn.
General advice on optimization is beyond the scope of this document, but here are some places to start:
You must profile your code before starting any optimization work. Not only will this tell you where the hotspots are (they’re frequently not where you expect!), it will also give you proof that you’ve sped anything up once you’re done. Once you find hotspots, you can begin optimizing the high-level code (rather than attempting to rewrite it in assembly).
No matter how tight your assembly is, if you’re using an n4algorithm, you’re still going to be incredibly slow. Some other techniques you should try first include using a more appropriate data structure. If you iterate repeatedly over a linked list, think about using a hash table, binary tree, or whatever is appropriate for your application.
Your compiler can almost always do a much better job than you can at writing assembly! Rather than attempting to rewrite high-level code in assembly, make judicious use of optimization options such as
-O3 and C directives like
__inline__. The compiler is aware of tricks like instruction scheduling, which considers the internals of the processor and tries to keep all pipelines full at all times. That may involve moving loads earlier in the instruction stream than required to keep the pipeline from stalling as the CPU waits for memory accesses to catch up. Unless you’ve been coding assembly for many years, these are tasks that most people cannot correctly perform by hand.
How to learn assembly
gcc is the best place to start learning assembly (for any architecture).
gcc -O3 -S file.c will produce
file.s in gas-compilable format (gas is the GNU Assembler). Open
file.s in your favorite editor and you can see the assembly output from your C code.
You’ll probably see instructions you don’t understand. You can look them up in The PowerPC Architecture: A Specification for a New Family of RISC Processors, 2nd. Ed and PowerPC Microprocessor Family: The Programming Environments for 32-bit Microprocessors (see Related topics for links to these documents). However, like learning any (spoken) language, there are certain words that are important and that you should know, and others that can be safely ignored until you’ve figured out more important features of the code. A good example of an important instruction is the branch family of instructions, such as
Hello World — ia32 assembly
Listing 1 is copied directly from the gas example in the Assembly HOWTO, which unfortunately is completely ia32-specific. It makes two direct system calls: the first writes to stdout; the second exits the application (with a return code of
0). It is very unusual to make system calls directly; normally applications link with a libc library, which wraps all the system calls.
Listing 1. ia32 assembly
.data # section declaration msg: .string "Hello, world!\n" len = . - msg # length of our dear string .text # section declaration # we must export the entry point to the ELF linker or .global _start # loader. They conventionally recognize _start as their # entry point. Use ld -e foo to override the default. _start: # write our string to stdout movl $len,%edx # third argument: message length movl $msg,%ecx # second argument: pointer to message to write movl $1,%ebx # first argument: file handle (stdout) movl $4,%eax # system call number (sys_write) int $0x80 # call kernel # and exit movl $0,%ebx # first argument: exit code movl $1,%eax # system call number (sys_exit) int $0x80 # call kernel
Hello World — PPC32 assembly
Listing 2 is a straightforward translation of the same code into PowerPC assembly.
Listing 2. PPC32 assembly
.data # section declaration - variables only msg: .string "Hello, world!\n" len = . - msg # length of our dear string .text # section declaration - begin code .global _start _start: # write our string to stdout li 0,4 # syscall number (sys_write) li 3,1 # first argument: file descriptor (stdout) # second argument: pointer to message to write lis 4,msg@ha # load top 16 bits of &msg addi 4,4,msg@l # load bottom 16 bits li 5,len # third argument: message length sc # call kernel # and exit li 0,1 # syscall number (sys_exit) li 3,1 # first argument: exit code sc # call kernel
General notes about Listing 2
PowerPC assembly requires a destination register for all register-to-register operations (because it is a RISC architecture). This register is always the first in the argument list.
Under PPC Linux, system calls are made with the syscall number in
gpr0 and arguments beginning with
gpr3. The syscall number, order of arguments, and number of arguments may differ under other PowerPC operating systems (NetBSD, Mac OS, etc.), which is one reason programmers typically make system calls through a libc library (which handles the OS-specific details).
PowerPC registers have numbers, not names. For the learner, this can sometimes be confusing since literals aren’t easily distinguishable from registers. “
3” could mean the value 3 or the register
gpr3, or floating point
fpr3, or special purpose register
spr3. Get used to it. 🙂
li means “load immediate”, which is a way of saying “take this constant value known at compile time and store it in this register”. Another example of an immediate instruction is
addi, for example
addi 3,3,1 would increment the contents of
gpr3 by 1, then store the result back into
gpr3. Contrast this with
add 3,3,1, which increments the contents of
gpr3 by the _contents of
gpr1, storing the result back into
Instructions ending in “i” are usually immediate instructions.
li isn’t really an instruction; it’s actually a mnemonic. A mnemonic is a bit like a preprocessor macro: it’s an instruction that the assembler will accept but secretly translate into other instructions. In this case,
li 3,1 is really defined as
The sharp-eyed will notice that those instructions aren’t necessarily the same thing:
addi is really adding 1 to the contents of
gpr0, storing the result into
gpr3, right? That would be true, except the PowerPC spec says
gpr0 sometimes has a value, and sometimes is read as 0, depending on the context. In this case (and the
addi description states this explicitly), the 0 means value 0 rather than register
Mnemonics shouldn’t matter at all to anyone other than assembler developers, but mnemonics can be confusing when you’re looking at disassembly output. However, GNU
objdump -d is quite good at displaying the original mnemonic rather than the instruction actually present in the file. For example,
objdump will display the mnemonic
nop rather than
ori 0,0,0 (the actual instruction used).
The most interesting part of our Hello World example is how we load the address of
msg. As mentioned earlier, PowerPC uses fixed-length 32-bit instructions (in contrast to ia32, which uses variable-length instructions). That 32-bit instruction is just a 32-bit integer. This integer is divided into fields of different sizes:
Listing 3. addi machine code format
-------------------------------------------------------------------------- | opcode | src register | dest register | immediate value | | 6 bits | 5 bits | 5 bits | 16 bits | --------------------------------------------------------------------------
The number of fields and their sizes will vary by instruction, but the important point here is that these fields take up space in the instruction. In the case of
addi, after just those three fields are placed into the instruction, there are only 16 bits left for the immediate value you’re adding!
That means that
li can only load 16-bit immediates. You cannot load a 32-bit pointer into a GPR with just one instruction. You must use two instructions, loading first the top 16 bits and then the bottom. That is exactly the purpose of the
@ha (“high”) and
@l (“low”) suffixes. (The “a” part of
@ha takes care of sign extension.) Conveniently,
lis (meaning “load immediate shifted”) will load directly into the high 16 bits of the GPR. Then all that’s left to do is add in the lower bits.
This trick must be used whenever you load an absolute address (or any 32-bit immediate value). The most common use is in referencing globals.
Listing 4. Hello World — PPC64 assembly
Listing 4 is almost identical to the 32-bit PowerPC example (Listing 2) above. PowerPC was designed as a 64-bit specification with 32-bit implementations, and not only that, PowerPC user-level programs are more or less binary-compatible across those implementations. Under Linux, ppc32 binaries run perfectly well on 64-bit hardware (with a little munging here and there for variable types visible to both 32-bit userland and the 64-bit kernel).
Listing 4. PPC64 assembly
.data # section declaration - variables only msg: .string "Hello, world!\n" len = . - msg # length of our dear string .text # section declaration - begin code .global _start .section ".opd","aw" .align 3 _start: .quad ._start,.TOC.@tocbase,0 .previous .global ._start ._start: # write our string to stdout li 0,4 # syscall number (sys_write) li 3,1 # first argument: file descriptor (stdout) # second argument: pointer to message to write # load the address of 'msg': # load high word into the low word of r4: lis 4,msg@highest # load msg bits 48-63 into r4 bits 16-31 ori 4,4,msg@higher # load msg bits 32-47 into r4 bits 0-15 rldicr 4,4,32,31 # rotate r4's low word into r4's high word # load low word into the low word of r4: oris 4,4,msg@h # load msg bits 16-31 into r4 bits 16-31 ori 4,4,msg@l # load msg bits 0-15 into r4 bits 0-15 # done loading the address of 'msg' li 5,len # third argument: message length sc # call kernel # and exit li 0,1 # syscall number (sys_exit) li 3,1 # first argument: exit code sc # call kernel
There are only two differences between the ppc32 code (Listing 2) and the ppc64 code (Listing 4). The first is the way we load pointers, and the second is those assembler directives about an .opd section. It’s worth pointing out that the ppc32 code works perfectly under ppc64 Linux when compiled as a ppc32 binary.
On ppc32 it took two instructions to load a 32-bit immediate value into a register. On ppc64 it takes 5! Why?
We still have 32-bit fixed-length instructions, which can only load 16 bits worth of immediate value at a time. Right there you need a minimum of four instructions (64 bits / 16 bits per instruction = 4 instructions). But there are no instructions that can load directly into the high word of a 64-bit GPR. So we have to load up the low word, shift it to the high word, then load the low word again.
The rotate instructions (like the
rlicr seen here) are notoriously complicated, and having jokingly been called Turing-complete. If all you need to do is load 64-bit immediate values, don’t worry about it — just convert these five instructions into a macro and never think about it again.
One last note: we used
@h here instead of
@ha in the ppc32 example because we then use
ori rather than
addi to supply the low 16 bits. On RISC machines it’s frequently possible to accomplish something in many different ways (for example, there are many possibilities for
Function descriptors — the .opd section
Under ppc64 Linux, when you define and call a C function
foo, that is not actually the address of the function’s code. In assembly if you try to
bl foo, you will quickly find your program crashing. The label
foo is actually the address of foo’s function descriptor. Function descriptors are described in detail in the ppc64 ELF ABI (see Related topics), but very briefly you must have a function descriptor (which is simply a structure containing 3 pointers) if your assembly will be called from C code, because the compiler expects it.
We don’t have any C code here, but the ELF ABI also says that the ELF file’s entry point (
_start by default) points to a function descriptor. So we must have one, and that is what goes into the .opd section.
Those assembler directives were copied almost directly from the output of
gcc -S. This is another excellent candidate for a preprocessor macro in your assembly code.
Where to learn more
For those of you interested in learning more about PowerPC, you can start by compiling tiny programs with
gcc -S— provided that you have a PowerPC box handy. If you do not, check out the PPC cross-compiling mini-howto, as well as the other sites and documents listed in the Resources section. Also try experimenting with gdb’s psim (PowerPC simulator) target. It’s easier than you may think!