Go arm64 Function Call Assembly
I am currently working on implementing frame pointer unwinding for the Go execution tracer. This involves debugging various problems and crashes caught by TestTraceSymbolize on arm64.
One of these crashes seems to be caused by a goroutine overwriting the frame pointer on the stack of another goroutine. To debug this, I'm trying to study the assembly output of the compiler. (Un)fortunately I don't read or write Go assembly every day, so this process usually involves chaotic jumping between the following resources:
- A Quick Guide to Go's Assembler (Go's Assembly Language is quite quirky)
- Go ARM64 Assembly Instructions Reference Manual (arm64 specific quirks)
- Go internal ABI specification (Go's register based calling convention)
- Arm Architecture Reference Manual (11.952 pages PDF)
- Arm A64 Instruction Set Architecture (HTML subset of the above)
- ChatGPT (with a very healthy amount of mistrust,- it's often wrong)
- Various Google Search Results
To make this process a little easier for my future self, I decided it's time to create some high quality notes for my own needs. In particular, I'll try to explain every assembly instruction the Go compiler (go1.20 darwin/arm64) emits for the following code in great detail:
//go:noinline
func foo() { bar() }
//go:noinline
func bar() {}To get the assembly, I use the steps shown below.
$ go build main.go
$ go tool objdump -gnu -s 'main.(foo|bar)' ./mainThe -s 'main.(foo|bar)' flag filters the output down the two functions we are interested in, and the -gnu flag adds GNU assembly comments which make it easier to lookup instructions in the official arm documentation. Normally I'd also include the -S flag to see each line of source code above the instructions it generated, but that's not needed here.
After some manual trimming, the objdump output looks like this:
TEXT main.foo(SB) /Users/felixge/Desktop/main.go
MOVD 16(R28), R16 // ldr x16, [x28,#16]
CMP R16, RSP // cmp sp, x16
BLS 8(PC) // b.ls .+0x20
MOVD.W R30, -16(RSP) // str x30, [sp,#-16]!
MOVD R29, -8(RSP) // stur x29, [sp,#-8]
SUB $8, RSP, R29 // sub x29, sp, #0x8
CALL main.bar(SB) // bl .+0x28
LDP -8(RSP), (R29, R30) // ldp x29, x30, [sp,#-8]
ADD $16, RSP, RSP // add sp, sp, #0x10
RET // ret
MOVD R30, R3 // mov x3, x30
CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54
JMP main.foo(SB) // b .+0xffffffffffffffd0
?
?
?
TEXT main.bar(SB) /Users/felixge/Desktop/main.go
RET // ret
?
?
? Before we dig into this output I want to mention that I'm not using Compiler Explorer for this, because its output seems to be missing important instructions for some reason. I'm also not using go tool compile -S main.go or go build -gcflags=-S ./main.go because those output additional pseudo instructions that are used by the compiler and linker, but don't remain in the final assembly.
Ok, let's dig in!
Prologue 1: Check if the stack needs growing
MOVD 16(R28), R16 // ldr x16, [x28,#16]
CMP R16, RSP // cmp sp, x16
BLS 8(PC) // b.ls .+0x20- Store the value of
g.stackguard0in theR16register. - Compare the value of
R16with the value of the stack pointerRSP. - If
RSP <= R16, jump8instructions forward to grow the stack
In Detail
MOVD, aka LDR (immediate), loads the data from the memory address [R28+16] intoR16. TheR28register points to the current goroutine, and thestackguard0field is found at offset16in thegstruct because thestackfield above it is16bytes wide.CMP, aka CMP (extended register), comparesR16withRSPand stores the results in the condition flags.BLS, aka B.cond, checks if the condition flags match the condition codels(lower or same). If yes, the CPU is instructed to jump8instructions (0x20=32bytes) forward. We'll cover those instructions in the Prologue 3: Growing the stack section. However, usually this is not the case, and execution continues with the instructions below.
Prologue 2: Setup foo's frame
MOVD.W R30, -16(RSP) // str x30, [sp,#-16]!
MOVD R29, -8(RSP) // stur x29, [sp,#-8]
SUB $8, RSP, R29 // sub x29, sp, #0x8- Store the return address from the link register
R30at16bytes below the stack pointer registerRSPand updateRSPto that location. - Store the frame pointer register
R29of the caller8bytes below the stack pointer. - Set the frame pointer register
R29to point to the the previous frame pointer we just pushed onto the stack.
In Detail
MOVD.W, aka STR (immediate), stores the value of theR30register at the memory address of [RSP-16].R30is the link register which holds the return address of the caller. The operation is pre-indexed, which means that the memory address is computed before the memory is accessed and that theRSPpointer is updated to the computed address afterwards (i.e.RSP = RSP - 16). Technically this frame only uses8bytes aboveRSP(for storing the return address), but the architecture requires the stack pointer to be16-byte aligned, so8bytes of memory are wasted here.MOVD, aka STUR, stores the value of theR29register at the memory address of [RSP-8].R29is the frame pointer register that is holding the caller's frame pointer.SUB, aka SUB (immediate), subtracts8from the stack pointer registerRSPand stores the result in the frame pointer registerR29. In other words, it points the frame pointer register to the caller's frame pointer that was just pushed onto the stack by the previous operation.
Body: Call bar
CALL main.bar(SB) // bl .+0x28 - Call
main.bar.
In Detail
CALL, aka BL, implicitly adds4bytes (the size of an instruction) to the value of the current program counter and stores it in the link registerR30. After that it jumps to the first instruction of themain.barfunction which is located10instructions (0x28=40bytes) below this instruction.
Epilogue: Return from foo
LDP -8(RSP), (R29, R30) // ldp x29, x30, [sp,#-8]
ADD $16, RSP, RSP // add sp, sp, #0x10
RET // ret- Restore the frame pointer register
R29and the link registerR30from the values stored8bytes below the stack pointerRSP. - Restore the stack pointer
RSPto its original value by adding16to it. - Return to the caller.
In Detail
LDP, aka LDP, loads16bytes from the memory address [RSP-8] into a pair of registers. The first8bytes of the data are loaded into the frame pointer registerR29, and the second8bytes are loaded into the link registerR30. This is needed because these registers are call-preserved, so we need to restore them to their original values. Remember: The link registerR30was overwritten by Body: Call bar and the frame pointer registerR29was overwritten in Prologue 2: Setup foo's frame.ADD, aka ADD (immediate), adds16to the stack pointer registerRSPand stores the result in the stack pointer registerRSP. This is needed becauseRSPis another call-preserved register that we overwrote in Prologue 2: Setup foo's frame.RET, aka RET returns to the caller. The implied jump location is the value of the link registerR30.
Prologue 3: Growing the stack
MOVD R30, R3 // mov x3, x30
CALL runtime.morestack_noctxt.abi0(SB) // bl .+0xffffffffffffdd54
JMP main.foo(SB) // b .+0xffffffffffffffd0
?
?
? This is the continuation of Prologue 1: Check if the stack needs growing.
- Store the value of the
R30link register in theR3register. - Call the
runtime.morestack_noctxtfunction to grow the stack. - Jump back to the beginning of the
main.foofunction. ?indicates zero padding.
In Detail
MOVD, aka MOV (register), copies the value of the link registerR30to theR3register. This is done to make it available toruntime.morestack_noctxt. As far as I can tell the value is only used to help with printing debug information, especially when a function is tyring to split the stack when it's not supposed to.CALL, aka BL, implicitly adds4bytes (the size of an instruction) to the value of the current program counter and stores it in the link registerR30. After that it jumps to the first instruction of theruntime.morestack_noctxtfunction which is located2219instructions (0xffffffffffffdd54=-8876bytes in two's complement) above this instruction. The called function usually grows the stack of the goroutine before it implicitly returns. Another use case is the preemption of goroutines.JMP, aka B, jumps jumps to the first instruction of the current function (main.foo) which is located12instructions (0xffffffffffffffd0=-48bytes in two's complement) above this instruction. This causes the function to be executed again, now with a big enough stack to hold all of its values.- The three
?each indicate4bytes of zero padding which are emitted by the compiler to16-byte align all function entry points. I'm unable to find an official reference that mentions this value. But since it was provided by an arm employee, and another arm employee added a similar patch to gcc citing performance benefits, this alignment is probably a good idea.
Function bar
RET // ret
?
?
? - Return to the caller.
?indicates zero padding.
In Detail
RET, aka RET returns to the caller. The implied jump location is the value of the link registerR30.- The three
?indicate zero padding as explained above.
Final Thoughts
I only covered arm64 in this article because that's the architecture used by my laptop and perhaps also the future of the cloud. Writing a similar article for amd64 would be nice, but I'm not sure if I'll find the time.
Anyway, I hope that this information will be useful to others as well as to my future self. Please let me know if you spot any mistakes or have any questions!
Member discussion