Go arm64 Function Call Assembly
I am currently working on implementing frame pointer unwinding for the Go execution tracer. This involves debugging various problems and crashes caught by TestTraceSymbolize on arm64.
One of these crashes seems to be caused by a goroutine overwriting the frame pointer on the stack of another goroutine. To debug this, I'm trying to study the assembly output of the compiler. (Un)fortunately I don't read or write Go assembly every day, so this process usually involves chaotic jumping between the following resources:
- A Quick Guide to Go's Assembler (Go's Assembly Language is quite quirky)
- Go ARM64 Assembly Instructions Reference Manual (arm64 specific quirks)
- Go internal ABI specification (Go's register based calling convention)
- Arm Architecture Reference Manual (11.952 pages PDF)
- Arm A64 Instruction Set Architecture (HTML subset of the above)
- ChatGPT (with a very healthy amount of mistrust,- it's often wrong)
- Various Google Search Results
To make this process a little easier for my future self, I decided it's time to create some high quality notes for my own needs. In particular, I'll try to explain every assembly instruction the Go compiler (go1.20 darwin/arm64
) emits for the following code in great detail:
//go:noinline
func foo() { bar() }
//go:noinline
func bar() {}
To get the assembly, I use the steps shown below.
$ go build main.go
$ go tool objdump -gnu -s 'main.(foo|bar)' ./main
The -s 'main.(foo|bar)'
flag filters the output down the two functions we are interested in, and the -gnu
flag adds GNU assembly comments which make it easier to lookup instructions in the official arm documentation. Normally I'd also include the -S
flag to see each line of source code above the instructions it generated, but that's not needed here.
After some manual trimming, the objdump
output looks like this:
TEXT main.foo(SB) /Users/felixge/Desktop/main.go
MOVD 16(R28), R16 // ldr x16, [x28,#16]
CMP R16, RSP // cmp sp, x16
BLS 8(PC) // b.ls .+0x20
MOVD.W R30, -16(RSP) // str x30, [sp,#-16]!
MOVD R29, -8(RSP) // stur x29, [sp,#-8]
SUB $8, RSP, R29 // sub x29, sp, #0x8
CALL main.bar(SB) // bl .+0x28
LDP -8(RSP), (R29, R30) // ldp x29, x30, [sp,#-8]
ADD $16, RSP, RSP // add sp, sp, #0x10
RET // ret
MOVD R30, R3 // mov x3, x30
CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54
JMP main.foo(SB) // b .+0xffffffffffffffd0
?
?
?
TEXT main.bar(SB) /Users/felixge/Desktop/main.go
RET // ret
?
?
?
Before we dig into this output I want to mention that I'm not using Compiler Explorer for this, because its output seems to be missing important instructions for some reason. I'm also not using go tool compile -S main.go
or go build -gcflags=-S ./main.go
because those output additional pseudo instructions that are used by the compiler and linker, but don't remain in the final assembly.
Ok, let's dig in!
Prologue 1: Check if the stack needs growing
MOVD 16(R28), R16 // ldr x16, [x28,#16]
CMP R16, RSP // cmp sp, x16
BLS 8(PC) // b.ls .+0x20
- Store the value of
g.stackguard0
in theR16
register. - Compare the value of
R16
with the value of the stack pointerRSP
. - If
RSP <= R16
, jump8
instructions forward to grow the stack
In Detail
MOVD
, aka LDR (immediate), loads the data from the memory address [R28
+16
] intoR16
. TheR28
register points to the current goroutine, and thestackguard0
field is found at offset16
in theg
struct because thestack
field above it is16
bytes wide.CMP
, aka CMP (extended register), comparesR16
withRSP
and stores the results in the condition flags.BLS
, aka B.cond, checks if the condition flags match the condition codels
(lower or same). If yes, the CPU is instructed to jump8
instructions (0x20
=32
bytes) forward. We'll cover those instructions in the Prologue 3: Growing the stack section. However, usually this is not the case, and execution continues with the instructions below.
Prologue 2: Setup foo's frame
MOVD.W R30, -16(RSP) // str x30, [sp,#-16]!
MOVD R29, -8(RSP) // stur x29, [sp,#-8]
SUB $8, RSP, R29 // sub x29, sp, #0x8
- Store the return address from the link register
R30
at16
bytes below the stack pointer registerRSP
and updateRSP
to that location. - Store the frame pointer register
R29
of the caller8
bytes below the stack pointer. - Set the frame pointer register
R29
to point to the the previous frame pointer we just pushed onto the stack.
In Detail
MOVD.W
, aka STR (immediate), stores the value of theR30
register at the memory address of [RSP
-16
].R30
is the link register which holds the return address of the caller. The operation is pre-indexed, which means that the memory address is computed before the memory is accessed and that theRSP
pointer is updated to the computed address afterwards (i.e.RSP = RSP - 16
). Technically this frame only uses8
bytes aboveRSP
(for storing the return address), but the architecture requires the stack pointer to be16
-byte aligned, so8
bytes of memory are wasted here.MOVD
, aka STUR, stores the value of theR29
register at the memory address of [RSP
-8
].R29
is the frame pointer register that is holding the caller's frame pointer.SUB
, aka SUB (immediate), subtracts8
from the stack pointer registerRSP
and stores the result in the frame pointer registerR29
. In other words, it points the frame pointer register to the caller's frame pointer that was just pushed onto the stack by the previous operation.
Body: Call bar
CALL main.bar(SB) // bl .+0x28
- Call
main.bar
.
In Detail
CALL
, aka BL, implicitly adds4
bytes (the size of an instruction) to the value of the current program counter and stores it in the link registerR30
. After that it jumps to the first instruction of themain.bar
function which is located10
instructions (0x28
=40
bytes) below this instruction.
Epilogue: Return from foo
LDP -8(RSP), (R29, R30) // ldp x29, x30, [sp,#-8]
ADD $16, RSP, RSP // add sp, sp, #0x10
RET // ret
- Restore the frame pointer register
R29
and the link registerR30
from the values stored8
bytes below the stack pointerRSP
. - Restore the stack pointer
RSP
to its original value by adding16
to it. - Return to the caller.
In Detail
LDP
, aka LDP, loads16
bytes from the memory address [RSP
-8
] into a pair of registers. The first8
bytes of the data are loaded into the frame pointer registerR29
, and the second8
bytes are loaded into the link registerR30
. This is needed because these registers are call-preserved, so we need to restore them to their original values. Remember: The link registerR30
was overwritten by Body: Call bar and the frame pointer registerR29
was overwritten in Prologue 2: Setup foo's frame.ADD
, aka ADD (immediate), adds16
to the stack pointer registerRSP
and stores the result in the stack pointer registerRSP
. This is needed becauseRSP
is another call-preserved register that we overwrote in Prologue 2: Setup foo's frame.RET
, aka RET returns to the caller. The implied jump location is the value of the link registerR30
.
Prologue 3: Growing the stack
MOVD R30, R3 // mov x3, x30
CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54
JMP main.foo(SB) // b .+0xffffffffffffffd0
?
?
?
This is the continuation of Prologue 1: Check if the stack needs growing.
- Store the value of the
R30
link register in theR3
register. - Call the
runtime.morestack_noctxt
function to grow the stack. - Jump back to the beginning of the
main.foo
function. ?
indicates zero padding.
In Detail
MOVD
, aka MOV (register), copies the value of the link registerR30
to theR3
register. This is done to make it available toruntime.morestack_noctxt
. As far as I can tell the value is only used to help with printing debug information, especially when a function is tyring to split the stack when it's not supposed to.CALL
, aka BL, implicitly adds4
bytes (the size of an instruction) to the value of the current program counter and stores it in the link registerR30
. After that it jumps to the first instruction of theruntime.morestack_noctxt
function which is located2219
instructions (0xffffffffffffdd54
=-8876
bytes in two's complement) above this instruction. The called function usually grows the stack of the goroutine before it implicitly returns. Another use case is the preemption of goroutines.JMP
, aka B, jumps jumps to the first instruction of the current function (main.foo
) which is located12
instructions (0xffffffffffffffd0
=-48
bytes in two's complement) above this instruction. This causes the function to be executed again, now with a big enough stack to hold all of its values.- The three
?
each indicate4
bytes of zero padding which are emitted by the compiler to16
-byte align all function entry points. I'm unable to find an official reference that mentions this value. But since it was provided by an arm employee, and another arm employee added a similar patch to gcc citing performance benefits, this alignment is probably a good idea.
Function bar
RET // ret
?
?
?
- Return to the caller.
?
indicates zero padding.
In Detail
RET
, aka RET returns to the caller. The implied jump location is the value of the link registerR30
.- The three
?
indicate zero padding as explained above.
Final Thoughts
I only covered arm64
in this article because that's the architecture used by my laptop and perhaps also the future of the cloud. Writing a similar article for amd64
would be nice, but I'm not sure if I'll find the time.
Anyway, I hope that this information will be useful to others as well as to my future self. Please let me know if you spot any mistakes or have any questions!
Member discussion