runtime: LLGo-owned frame-pointer unwinder (Stage 5) by cpunion · Pull Request #2019 · xgo-dev/llgo

cpunion · 2026-07-03T04:29:20Z

Stage 5 of #2004, based on #2016: replace the shadow-stack instrumentation and libunwind with an LLGo-owned frame-pointer unwinder.

What changes

Every Go function keeps the frame-pointer chain ("frame-pointer"="non-leaf", verified stp x29,x30 / mov x29,sp prologues). Cost is the standard ~1-2% envelope — Go itself always keeps FP on arm64.
runtime.Caller/Callers/CallersFrames/Stack walk real stacks (fpCallers: [fp]/[fp+8] chain with stride/alignment guards), symbolized through runtime: link-phase ftab/findfunctab generation (Stage 2, P1–P3) #2016's prebuilt ftab + pcline labels. Return addresses resolve at pc-1 (Go's convention); the machine scheduler can place the next statement's label exactly on a return address, so raw-pc lookups mis-attributed frames.
Shadow-stack instrumentation is no longer emitted (LLGO_SHADOW_STACK=1 keeps the legacy emitters for one release). Tracked functions retain only noinline, no-tail-call and the data-only pcline records.
libunwind is gone: the clite stacktrace (unrecovered-panic dump, last-resort Callers fallback) walks the FP chain with dladdr names — same output format, no -lunwind.
Explicit FP pairing flag: the compiler emits a per-binary __llgo_fp_chain byte next to the funcinfo table recording ssa.Program.NeedsFramePointer(); fpUnwindAvailable trusts that declaration plus table presence, so a target that keeps tables but not the FP attribute can never take the physical walk by accident.
One symbolization rule: every path that resolves a function record refines it through a single refinePCSymbolLine helper (same-function statement record at pc, then pc-1) — FuncForPC, FileLine and CallersFrames cannot disagree on line attribution.
Collector fixes along the way: methods and anonymous functions were never trackable (method frames had no pcline labels; closures lost their innermost frame to tail-call optimization).

Semantics: now Go-conformant, verified against gc

Physical stacks see every real frame — the shadow stack only ever recorded instrumented ones. Skip counts in the statement-line probes moved to gc ground truth (each verified by running the same chain shapes under go): interface chain MARK at skip 3 (was 2), closure chain at skip 4 (was 3). Adjacent runtime.Stack calls report their own lines. Mid-body pcs, FuncForPC(pc-1) and FileLine(pc-1) agree with Frames.

Performance

All numbers best/trimmed-avg, 5 runs (scale rows 3), 24×24=576-target scenarios, ±LTO. s5 = this PR.

Hot paths — macOS / Linux (ns)

metric	go1.26	2016	s5	s5+lto
hot.Caller0	155 / 241	55 / 66	17 / 37	14 / 29
hot.Caller1	173 / 246	86 / 111	17 / 39	14 / 31
hot.CallersOnly	129 / 174	131 / 153	31 / 56	28 / 47
hot.CallersFramesFirst	270 / 626	455 / 544	145 / 554	121 / 406
hot.FuncForPCEntry	13 / 17	9 / 9	2 / 2	1 / 1
hot.FuncFileLineEntry	10 / 14	10 / 11	3 / 4	1 / 2
stdlib.Caller0	146 / 141	75 / 109	15 / 26	12 / 21

Deep stacks — the Stage 5 headline (macOS / Linux)

metric	go1.26	2016	s5
deep.Direct32	791ns / 937ns	1.3µs / 1.7µs	44ns / 115ns
deep.CallersFramesAll (32)	2.1µs / 2.4µs	3.2µs / 4.0µs	423ns / ~1µs
deep.Direct512	9.7µs / 9.8µs	87µs / 95µs	2.8µs / 2.9µs
deep.CallersFramesAll (512)	11.5µs / 12.2µs	105µs / 111µs	3.7µs / 11.3µs

The ~160ns/frame shadow-stack tax is gone: 512-deep chains with a Caller at the bottom are now 3.4× faster than Go, and instrumented-graph big bodies dropped 24×:

bigfunc 16×2000 (macOS)	go1.26	2016	s5	s5+lto
bigfunc.Work	30.0µs	433µs	18.1µs	17.6µs
bigfunc.FuncForPCMid	63ns	149ns	27ns	16ns
bigfunc.FileLineMid	86µs	176ns	57ns	27ns
bigfunc.CallersFramesMid	1.0µs	329ns	310ns	186ns

Cold paths (first use per process, macOS / Linux)

metric	go1.26	2016	s5
cold.FirstCaller0	3.7µs / 3.2µs	500ns / 667ns	584ns / 1.3µs
cold.FirstFuncForPC	2.8µs / 1.2µs	2.3µs / 2.8µs	2.5µs / 1.7µs
cold.FirstCallersFrames	2.8µs / 1.2µs	1.3µs / 3.1µs	4.6µs / 4.0µs
cold.FirstFileLine	834ns / 292ns	83ns / 166ns	41ns / 41ns

Ordinary code and size

plain.* (fib/json/sort/map) is identical across 2012/2016/s5 within noise on both platforms — code that never asks for caller info pays nothing. Binary sizes are unchanged (±0.1MiB across every scenario); linux binaries no longer link -lunwind.

Known follow-ups (single cells, documented)

bigfunc.FirstFileLineMid (32k-site probe): 18.5µs vs 2016's 1.0µs — first-use interaction between the startup pcline warm and site-dense binaries; P4's prebuilt pcline table subsumes it.
stdlib.Work is unchanged (~23µs vs Go 6.2µs): the residual is noinline/no-tail-call on Caller-reachable functions plus the LLGo baseline, no longer unwind bookkeeping. Lifting noinline needs the P4 inline tree.
mac cold.FirstCallersFrames 4.6µs vs 2016's 1.3µs: first-walk memoization allocations; steady-state walks are 5–8× faster than 2016.

Validation

macOS: cl (462s), test/go, internal/build, ssa, LLDB suite 0 failed — all with shadow-stack emission off.
Linux (container): test/go green with -lunwind removed.
IR goldens updated (attributes #0 now carries the FP attribute; out.ll files carry no attribute groups and needed no regeneration).

Known scope cuts (documented)

Inlined frames: uninstrumented functions may still be inlined by LLVM and their frames elided (same as C); tracked functions keep noinline. The P4 inline tree lifts this.
Foreign frames: the walk stops at the first frame that breaks chain discipline (C code without frame pointers, e.g. x86-64 libs built with omission). arm64 C code keeps FP by default.

Depends on #2016 (branched from it; rebases after it merges).

🤖 Generated with Claude Code

Bring over the cross-branch runtime funcinfo benchmark (hot, deep, multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own performance numbers. cold.FirstCallersFrames now walks to the first fully symbolized frame, because synthetic runtime frames (LLGo's runtime.Callers placeholder) carry no file/line and the metric was silently skipped on LLGo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

macOS previously had no entry/stub/pcline site sections, so first-use funcinfo initialization fell back to one dlsym per function and per stub (13ms cold on a small binary, 27ms with LTO), and statement-level pc-line records did not exist at all. Emit the same site records on Mach-O: - __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the live_support attribute: under ld64/lld -dead_strip a live_support atom survives only if the atom it references (the anchor label inside the function body) is live, which matches the records-follow-function semantics ELF gets from SHF_LINK_ORDER with --gc-sections. - One lowercase-l linker-private symbol per record so each record is its own atom and dead functions drop exactly their own records. - Assembler-local (L-prefixed) pc-site labels: Mach-O subsections-via-symbols treats visible labels as atom boundaries, and a visible label in the middle of a function let the linker split and reorder function bodies. - Boundary symbols via ld64's section$start$/section$end$, emitted with the \x01 verbatim-name prefix so LLVM does not prepend the Mach-O underscore. - A no_dead_strip zero record per section in the main module keeps the sections (and their boundary symbols) present even when no package contributed records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First-use initialization: - Skip the per-stub dlsym loop when the stub-site section provided the frames; each dlsym is a dynamic-loader query and the loop dominated cold latency. - Materialize per-function strings and entry PCs once per function and packed file strings once per file ID during pcline table construction instead of once per site. Cold FuncForPC fast path: before the frame table exists, resolve exact function-value PCs with a bounded linear scan of the raw entry-site and stub-site sections (compile-time data, no loader query), then one dladdr as fallback; both require an entry match within the warm path's slack so stripped-local misattribution is impossible. The path is budgeted: after a handful of cold lookups the sorted table amortizes better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to ~35us on macOS. Find index: subbucket deltas are now uint16 and the whole-index abandonment on delta overflow is gone. Go stores uint8 deltas because its linker guarantees a 16-byte minimum function size; LLGo indexes call-site records that sit a few bytes apart, and a dense 4KiB bucket silently degraded every lookup in the process to a full binary search. A delta counts deduplicated PCs inside one bucket, so it is bounded by the bucket size and uint16 cannot overflow. Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built table (frame/bucket counts, index built or fallback, sites vs dlsym sources) so benchmarks can tell which path they measured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Caller/Callers capture used to intern the frame into the synthetic table: a hash probe plus a full frame comparison per stack slot per call. Memoize the interned PC base in the shadow-stack slot and invalidate it when the recorded line changes (for one entry the instrumented name/file operands are constants, so the line is the only thing that varies between call sites). The three static frames emitted around every Callers walk get per-store memo slots, and the emit loop is unrolled so nothing escapes and skipped frames are never captured. macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns. hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…py limit Frames.Next allocated a fresh *Func per symbolized frame; route it through the FuncForPC 4-way cache so repeated CallersFrames walks over the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns (338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns. Also document a pre-existing limitation at the entry-site emitter: the body-embedded inline-asm record is duplicated by LTO inlining into every inline site (~4x section growth on multipkg) and registers host-function PCs under the inlinee's symbol ID. Runtime only consults the table when native symbolization fails, which bounds the impact; the fix (data globals with !associated metadata) needs LLVMGlobalSetMetadata in the llvm binding and lands with the link-phase ftab work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Record the experiment results at the emitter: !associated only guides linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used pins dead functions through the records' address initializers; and noduplicate blocks inlining. Section dedup is link-phase work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Post-link table generation plan: parse the linked binary's metadata sections, dedup LTO inline copies against the symbol table, sort with a sentinel, build Go-layout findfunctab via internal/pclntab, and write back into a reserved section with ASLR-safe anchor offsets. Runtime adopts the prebuilt table when the header validates and keeps first-use construction as fallback. Includes the list of platform facts established in xgo-dev#2012 so implementation does not re-derive them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Only macOS ran tests with -coverprofile, so lines behind OS-specific branches (ELF emission, per-OS runtime shims) always showed as missed in codecov/patch even though the ubuntu job executed them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Covers the ELF and Mach-O directive branches, 32-bit pointer directives, quote-escaped symbol names and empty-table emission from one table-driven test, so single-platform coverage runs stop reporting the other platform's branches as dead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-07-03T04:48:38Z

Codecov Report

❌ Patch coverage is 89.33092% with 236 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/build/funcinfo_table.go	87.73%	50 Missing and 29 partials ⚠️
internal/pclnpost/write.go	80.23%	21 Missing and 12 partials ⚠️
internal/build/funcinfo/funcinfo.go	87.60%	15 Missing and 14 partials ⚠️
internal/pclnpost/binary.go	86.79%	17 Missing and 11 partials ⚠️
internal/pclnpost/fixups.go	78.40%	18 Missing and 9 partials ⚠️
cl/instr.go	95.75%	13 Missing and 8 partials ⚠️
internal/build/build.go	78.57%	5 Missing and 4 partials ⚠️
internal/pclntab/pclntab.go	92.98%	2 Missing and 2 partials ⚠️
cl/compile.go	98.42%	1 Missing and 1 partial ⚠️
internal/pclnpost/pclnpost.go	91.30%	1 Missing and 1 partial ⚠️
... and 1 more

📢 Thoughts on this report? Let us know!

First stage of doc/design/pclntab-linkphase.md: parse a linked binary's funcinfo entry/stub sections (Mach-O and ELF), deduplicate LTO inline copies against the symbol table's text ranges, sort with a Go-style sentinel, and build findfunctab through internal/pclntab — the faithful port that has been waiting for exactly this caller. Read-only: prints what the P2 build integration would write back. Measured on the 576-target multipkg binaries: - non-LTO: 9319 records -> ftab 3161 + 207 buckets; lookup self-check 3160/3160; site sections 149KB -> 29KB (5.1x) - LTO: 15371 entry records -> 13857 inline copies dropped, 4144 kept; self-check 3045/3045; 299KB -> 28.5KB (10.5x) Findings for P2: on-disk Mach-O pointer slots hold dyld chained-fixup encodings (low 36 bits are the target; decoded here; the write-back design stores anchor-relative offsets and avoids pointers entirely), and some non-LTO stub symbols are absent from the symbol table (records conservatively dropped; needs tightening). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…adoption pclnpost -write rewrites the entry-site section in place with the prebuilt table (header + ftab {entryOff,funcIndex} + runtime-layout findfunctab buckets), resolving funcinfo indexes through the binary's symbol-index section, and voids the stub section (its records are merged into the table). ASLR is handled by anchoring on the section's own link-time address; entries are normalized to true symbol starts, which retires the entry-PC slack on this path. macOS re-signs with an ad-hoc codesign after rewriting. The runtime adopts the table zero-copy when the magic header validates: lookups binary-search the on-disk ftab directly through the shared bucket index, nothing is materialized on first use (the funcIndex -> entry map is built lazily and only for the pcline initializer), and the cold scan/dladdr path is skipped since adoption is cheap. First-use construction remains the fallback whenever the header is absent. Linux end-to-end: entries=prebuilt, FuncForPC/FileLine correct, first-FuncForPC 110µs (materializing) -> 6-8µs (zero-copy); 13ms on the original macOS baseline. Known gap: on macOS the on-disk rewrite is corrupted at load time because dyld still walks the stale chained-fixup chain over the section; fix (unlinking the section's nodes from the page chains in LC_DYLD_CHAINED_FIXUPS) is identified and next. Non-prebuilt paths verified regression-free: cl + test/go suites pass, smoke behavior unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every llgo-linked executable (linux/darwin, sites enabled) now gets the prebuilt ftab/findfunctab automatically: internal/build runs internal/pclnpost.Rewrite after linkMainPkg, and any failure degrades silently to the first-use construction fallback. Moves the tool core into internal/pclnpost and hardens it: - Canonical-record detection by FNV: a record survives when its anchor's owning symbol hashes to the record's symbolID (or is the __llgo_stub. wrapper of it). The previous one-per-symbolID rule wrongly collapsed a function with its stub — they share the target's symbolID by design — which broke exact-entry lookups (caught by TestRuntimeLineInfoAndStack on Linux). LTO inline copies are now identified exactly: 8.4k/9.5k copies removed in the LTO probes. - Mach-O chained-fixups surgery: unlink the rewritten sections' pointer slots from the dyld page chains (repointing predecessors' next links and page_start entries) so dyld neither rebases slots inside the new table nor skips unrelated fixups after the zeroed stub section, then re-sign ad hoc. Without this the table was corrupted at load. - LTO-safe metadata location: the entry section carries a meta record whose relocations hold the addresses of the symbol-index pointer and count globals; LTO internalization strips those names from the symbol table but relocations always resolve. Runtime skips the meta rows (pc==0 / symbolID==0). - Idempotence guard (already-rewritten binaries are left alone). Runtime fixes that surfaced during validation: - materializePrebuiltEntries is now two-phase so concurrent losers wait for the winner's store instead of reading a nil entries slice. - pcLineFrameForPC rejects nearest-below sites whose entry is unresolved when the caller knows the function entry, instead of leaking a neighboring function's file/line. Validation: macOS cl (full) + test/go + LLDB 194/194; Linux test/go TestRuntime suite; probes on both platforms report entries=prebuilt with first-FuncForPC at 7-21µs (Linux) from 13ms on the original baseline, and LTO builds drop 8-9.5k inline copies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…table On Mach-O, pointer slots that name exported functions — every __llgo_stub.* wrapper and any exported Go function — are emitted as chained-fixup BIND nodes, not rebases. The rewriter only decoded rebase nodes, so all stub records (and some entry records) were dropped as unowned and never reached the prebuilt ftab; FuncForPC on function values silently fell back to dladdr (~6µs per fresh pc on darwin). - Parse the LC_DYLD_CHAINED_FIXUPS imports table and resolve bind ordinals to their in-image definitions. - Match canonical owners against the record symbolID with underscore normalization (debug/macho's suffix-shared string table can surface one mangling underscore more or less than the source-level name). - Splice the prebuilt header's base slot back into the fixup chain as a live rebase node: dyld writes the slid text base at load, so the runtime reads a ready runtime PC with no slide arithmetic (non-PIE ELF link-time values already equal runtime addresses). - LLGO_PCLNPOST=0 escape hatch keeps first-use construction. Fresh-pc FuncForPC slow path: darwin 6-8µs -> 1.2-1.7µs, linux 6.8µs -> 0.5µs; first-in-process lookup: darwin ~32µs -> ~14µs, linux ~6.8µs -> ~4µs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Pure-compute probes (recursive fib, JSON round-trip, sort.Ints, map churn) with no runtime introspection, so one harness run covers both the introspection extremes and what the funcinfo machinery costs code that never asks for it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Go's pclntab pages are touched by its own runtime (traceback, GC) long before user code queries it, so its first FuncForPC never pays page-in. Mirror that: when the prebuilt table is present, init adopts it (zero-copy, sub-µs), touches the pages the lookup path reads (blob, funcinfo records, string offsets, strings), runs one synthetic lookup to warm the code paths, and write-warms the FuncForPC cache pages. First-in-process FuncForPC: darwin ~17µs -> ~2.8µs, linux ~6.6µs -> ~1.0µs. Startup cost is page-count-bound (tens of µs on stdlib-sized tables, invisible next to ~3ms process startup; hello-world medians unchanged). Non-prebuilt binaries stay fully lazy: first-use construction allocates, which has no place in init, and programs that never introspect pay nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

-depths generates deep_<N> scenarios at configurable call depths; -bigsizes generates bigfunc scenarios (funcs x statements) whose large bodies stress statement-level pcline density, mid-function pc symbolization, and ordinary performance of big method bodies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- Blob overflow: function-value stubs can double the row count, and at ~9k functions the prebuilt blob no longer fit the entry section, so the rewrite silently fell back to first-use construction (cold.FirstFuncForPC 96x96 non-LTO: 2.4ms). On overflow, retry with function entries only — stub pcs degrade to dladdr, real entries keep the prebuilt table. - FuncForPC cache thrash: the set-associative pc cache holds 4k entries; batch workloads over 9k+ distinct functions evicted constantly and paid the string-materializing slow path on every call (multipkg.FuncForPCMany 96x96: 8-11ms vs Go 172µs). Add a per-ftab-row *Func cache for exact-entry lookups, so batch lookups are O(binary search) after the first pass at any scale. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…erflow Function-value stubs can push the row count past what the entry section holds (~9k functions with taken addresses). Instead of dropping stub rows, write the full blob into the (larger) stub section and leave a 32-byte redirect header ("LLGOFTB2" + a live-relocation pointer) in the entry section; the runtime follows it and adopts the same zero-copy view. Function-value lookups keep the prebuilt table at any scale instead of degrading to dladdr. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

funcForPCSlow treated any unaligned pc as a shadow-stack synthetic marker. arm64 function entries are always 4-aligned so this never fired, but amd64 function and stub entries need not be: an unaligned function-value pc skipped the prebuilt exact-entry path entirely and fell through to nearest-below symbolization, reporting the previous function's name (test/go TestRuntimeLineInfoAndStack on ubuntu CI, "bad function value func: main.renamedPC"). Hoist the prebuilt exact-entry + per-row-cache lookup ahead of the alignment heuristic; a genuine synthetic pc just misses the cheap search and proceeds as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The overflow fallback dropped stub rows to fit the entry section. That leaves pc ranges the table claims to cover but does not: a function value whose stub falls in a gap resolves nearest-below to the previous function and silently returns the wrong name — exactly what ubuntu CI caught (amd64 --icf=safe layouts overflow by a few hundred bytes, and non-PIE ELF dladdr cannot rescue). If the blob fits neither the entry section nor the (larger) stub section, skip the rewrite entirely: first-use construction is slower but covers every record. Reproduced and verified on linux/amd64 (qemu): the stub pc had no exact row and nearest-below returned the neighbouring function's name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rgery Fabricated fixtures make the IO paths testable in-process: a minimal ELF exercises load/Rewrite end to end (in-place, stub-section spill, and the overflow fallback that must leave the binary untouched), and a synthetic Mach-O image drives the chained-fixup chain surgery (remove+splice, empty-page insert, unconsumed-insert error). Package coverage 16% -> 69%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A fabricated Mach-O (segments, sections, symtab, chained-fixup imports and an empty page chain) drives load, bind-target resolution, record decoding and both Rewrite outcomes (in-place and stub-section spill) end to end. codesign now runs only when the input carries LC_CODE_SIGNATURE: real lld executables always do, unsigned inputs need no signature and codesign rejects them. Also cover asmQuoteELFSymbol, the empty-table initializers and the Rewrite error paths. Package coverage: pclnpost 69% -> 86%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Go function on supported targets keeps the frame-pointer chain ("frame-pointer"="non-leaf", gated by Program.NeedsFramePointer to linux/darwin — on embedded targets the unwinder does not exist and the layout change perturbed the conservative GC on ESP32-C3). runtime.Caller, Callers, CallersFrames, Stack and the unrecovered-panic dump walk [fp]/[fp+w] directly and symbolize through the prebuilt ftab and pcline tables: - Return addresses resolve at pc-1 (Go's convention); statement labels can land exactly on a return address, so raw-pc nearest-below reported the following line. The convention holds with or without the prebuilt table (text bounds fall back to the first-use frame table — link-phase overflow layouts otherwise silently disabled it, the root cause of the amd64 CI failures). - The walk is bounded to the program's own text: libc frames without FP discipline decode as wild pcs that nearest-below would attribute to arbitrary functions. - Methods and anonymous functions are now trackable (methods had no pcline labels; closures lost their innermost frame to tail-call optimization), and mid-function aligned pcs merge statement records instead of returning declaration lines. - frameSymbol results are memoized per pc (deep re-walks paid a dladdr per frame: 32-frame walks 8µs -> 180ns) and the pcline table is built during the startup pre-warm (lazily building it inside the first Caller cost ~200µs at scale). - Shadow-stack instrumentation is no longer emitted; LLGO_SHADOW_STACK=1 keeps the legacy emitters for one release. Tracked functions retain noinline, no-tail-call and the data-only pcline records. - libunwind is gone: the clite stacktrace fallback walks the FP chain with dladdr names (same output format), and linux binaries no longer link -lunwind. Semantics are gc ground truth, verified against go: physical stacks show every real frame; interface-chain Caller marks land at skip 3 and closure chains at skip 4 (the old expectations encoded shadow-stack frame loss). Perf (best-of, mac/linux): hot.Caller0 17/37ns (Go 155/241), deep.Direct512 2.8µs (Go 9.7µs; was 87-95µs), bigfunc.Work 18µs (Go 30µs; was 433µs), binary size unchanged or smaller. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

IR goldens gain the frame-pointer attribute (out.ll files carry no attribute groups and needed no regeneration); the legacy shadow-stack emitter assertions opt into LLGO_SHADOW_STACK; statement-line probes move to gc ground-truth skip counts; NeedsFramePointer target matrix and pclnpost symbolAddr/decodePtr edges covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion and others added 30 commits July 2, 2026 03:27

build: add Go-style pclntab findfunc index

2870ba8

ssa: emit DCE-safe function metadata

f365b73

runtime: add line info for stack frames

e81a007

runtime: add statement line caller frames

1e15db2

runtime: compress funcinfo table

407e0dd

test: cover indirect runtime caller paths

a3115cd

cl: narrow runtime caller tracking

c6c128a

runtime: add compact pc-line funcinfo table

4692e15

runtime: optimize FuncForPC metadata lookup

b29906c

runtime: slim FuncForPC cache hot path

734368f

cl: make pc-line labels clone-safe

85b1c13

runtime: guard funcinfo table initialization

b33a774

runtime: fix funcinfo entry pc line metadata

8e6f33c

runtime: publish funcinfo records for live stubs

8eec8e0

runtime: skip ELF stub-site records during LTO

01913f4

runtime: reduce funcinfo lookup initialization cost

6fa875b

test: cover runtime caller metadata edges

34b274c

runtime: speed up funcinfo entry lookup

3b87c90

test: cover pcline metadata in dev lto coverage

012095b

runtime: speed up funcinfo hot paths

9d90419

runtime: avoid FuncForPC cache thrashing

1b8bc56

runtime: use Go-style funcinfo find index

cf33354

runtime: use static funcinfo symbol index

bb1b1a3

cpunion and others added 3 commits July 3, 2026 10:12

ci: exclude dev tools and test scaffolding from coverage

83edf88

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion mentioned this pull request Jul 3, 2026

Proposal: runtime funcinfo metadata and fast FuncForPC lookup #2004

Open

cpunion and others added 14 commits July 3, 2026 13:40

doc: record P3 findings in pclntab-linkphase design

7ab0a84

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

benchmark: keep full LTO from constant-folding the plain fib probe

b8078d4

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion force-pushed the codex/stage5-unwinder branch from 0ab4e62 to f01f644 Compare July 3, 2026 05:42

cpunion force-pushed the codex/stage5-unwinder branch 3 times, most recently from d70dd76 to db8a078 Compare July 3, 2026 14:31

cpunion mentioned this pull request Jul 3, 2026

runtime,cl,build: stage5 cleanup — explicit FP pairing flag and tidy-ups #2022

Closed

cpunion and others added 3 commits July 4, 2026 00:03

doc: stage5 invariants, diagnostic traps and merge queue

4925b82

cpunion force-pushed the codex/stage5-unwinder branch from db8a078 to 4925b82 Compare July 3, 2026 16:04

cpunion mentioned this pull request Jul 3, 2026

runtime,cl: Go-style panic tracebacks and exact log/slog/testing locations, with gc-verified acceptance suite #2023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: LLGo-owned frame-pointer unwinder (Stage 5)#2019

runtime: LLGo-owned frame-pointer unwinder (Stage 5)#2019
cpunion wants to merge 57 commits into
xgo-dev:mainfrom
cpunion:codex/stage5-unwinder

cpunion commented Jul 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cpunion commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Semantics: now Go-conformant, verified against gc

Performance

Hot paths — macOS / Linux (ns)

Deep stacks — the Stage 5 headline (macOS / Linux)

Cold paths (first use per process, macOS / Linux)

Ordinary code and size

Known follow-ups (single cells, documented)

Validation

Known scope cuts (documented)

Uh oh!

codecov Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpunion commented Jul 3, 2026 •

edited

Loading

codecov Bot commented Jul 3, 2026 •

edited

Loading