Skip to content

Proposal: runtime funcinfo metadata and fast FuncForPC lookup #2004

Description

@cpunion

Proposal: Runtime Funcinfo Metadata and Fast Runtime Lookup

Goal

Give LLGo Go-compatible runtime introspection — runtime.Caller, Callers, CallersFrames, FuncForPC, Func.FileLine, runtime.Stack, panic tracebacks — without DWARF, llvm-symbolizer or libunwind on the primary path, with DCE safety, controlled binary size, and performance at or beyond Go.

Architecture (three independent layers)

  1. Metadata model (runtime: add Go-style funcinfo find index #2012, Stage 1) — Go-1.26-style compact function records + shared string pools + statement-level pcline records, emitted as DCE-safe sections: ELF SHF_LINK_ORDER (honored by --gc-sections), Mach-O live_support + one linker-private atom symbol per record (verified under ld64/lld -dead_strip, incl. LTO). findfunctab geometry follows Go (4KiB buckets × 16 subbuckets) with one deliberate deviation: uint16 subbucket deltas — LLGo has no 16-byte MINFUNC guarantee, so Go's uint8 could overflow; uint16 makes it structurally impossible. Runtime falls back to first-use table construction when the link-phase table is absent.
  2. Link-phase table (runtime: link-phase ftab/findfunctab generation (Stage 2, P1–P3) #2016, Stage 2 P1–P3) — after linking, internal/pclnpost deduplicates records against the symbol table (FNV: a record is canonical iff its anchor's owning symbol hashes to its symbolID, or is that symbol's __llgo_stub. wrapper — LTO inline copies fail exactly), sorts, and rewrites the entry section in place with a prebuilt ftab+findfunctab the runtime adopts zero-copy. Mach-O specifics proved out: chained-fixup chain surgery, BIND-node decoding through the imports table (exported-function anchors are binds, not rebases), the header base slot spliced back as a live rebase so dyld writes the slid address, ad-hoc re-sign only when the input was signed. Overflow spills the blob into the larger stub section behind a LLGOFTB2 redirect — rows are never dropped (a gap-y table silently mis-names function-value pcs). Startup pre-warm (adopt + page-touch + one synthetic lookup) mirrors Go, whose pclntab is warm by main.
  3. Physical unwinder (runtime: LLGo-owned frame-pointer unwinder (Stage 5) #2019, Stage 5) — every Go function on supported targets keeps the frame-pointer chain ("frame-pointer"="non-leaf"); Caller/Callers/CallersFrames/Stack and the unrecovered-panic dump walk [fp]/[fp+w] directly, symbolized through layer 2's tables with Go's pc-1 return-address convention. The walk is bounded to the program's own text (libc frames without FP discipline decode as wild pcs). Shadow-stack instrumentation is no longer emitted (LLGO_SHADOW_STACK=1 keeps the legacy emitters one release); linux binaries no longer link -lunwind (the clite fallback walks FP with dladdr names). Semantics are gc ground truth — physical stacks show every real frame; skip counts verified against go (interface chain MARK at skip 3, closure at 4). Two hardening rules: the compiler records its FP decision in a per-binary __llgo_fp_chain byte the runtime trusts (no inference from table presence), and every lookup path refines function records through one refinePCSymbolLine helper (statement record at pc, then pc-1) so FuncForPC/FileLine/CallersFrames cannot disagree on line attribution — the class of bug behind all six amd64 debug rounds.

Stage / PR status

Stage Scope Status
1 Metadata model #2012 — all green (codecov 91.3%), merge-ready
1.5 ns monotonic clock #2015 — done, carried in #2012; close after it merges
2 Link-phase table #2016 — all green (codecov 88.9%), merge-ready
3 LTO policy data collected; PR pending (main+LTO mac crash worth root-causing there)
4 Zero-copy names, prebuilt pcline, !pcsections, section shrink, stub removal staged (ordered below)
5 FP unwinder #2019 — all green (48 checks incl. codecov), merge-ready; review cleanup folded in (#2022 closed)

What each PR owns (with final numbers, best-of, mac/linux)

#2012 #2016 #2019
Contribution metadata + first-use tables + shadow stack prebuilt table, cold/scale, LTO integrity, pre-warm physical stacks, instrumentation & libunwind removal
hot.Caller0 54/65ns 55/66ns 17/37ns, +LTO 14/29
hot.CallersOnly 129/154ns 131/153ns 31/56ns, +LTO 28/47
hot.FuncForPC / FileLine (mac) 9/10ns 9/10ns 2/3ns, +LTO 1/1
deep.Direct512 87–95µs 87–95µs 2.8µs (Go 9.7µs)
deep.CallersFramesAll(512) 105–120µs 105–120µs 3.7µs (Go 12µs)
cold.FirstFuncForPC 40.7/13µs 3–5/2.9µs 2.5/1.5µs (Go 2.8/1.2µs)
bigfunc.Work (16×2000, mac) 433µs 433µs 18.1µs (Go 30µs)
bigfunc.FileLineMid 176ns (+LTO 49) 57ns, +LTO 27 (Go 86µs)
Binary size +18–29% vs main = #2012 #2016; mac all scenarios ≤ Go
Semantics instrumented frames only same gc-conformant full physical stacks

Reference bar: Go 1.26 hot.Caller0 155/241ns; per-fresh-pc FuncForPC 40–125ns (zero-copy pclntab views) vs LLGo 0.4–1.7µs (string materialization — the headline P4 item). plain.* (fib/json/sort/map) is identical across all variants: code that never introspects pays nothing. Scale sweeps (48×48/96×96 pkgs×methods, depth 128/512, big methods 32×200/16×2000) hardened #2016 (overflow spill, per-row Func cache — 96×96 batch lookups 8–19ms → 147µs +LTO, beating Go's 179µs) and quantified the shadow-stack tax (~160ns/frame) that #2019 removed.

Platform support (three gates: FP attr → walker → symbolization)

Target Status
darwin arm64/amd64, linux arm64/amd64 green in CI; AAPCS64 mandates FP, attr forces it on x86-64
embedded (ESP32…) / wasm attr off by design (NeedsFramePointer): unwinder doesn't exist there (runtime gated !baremetal && !wasm); FP layout change also perturbed the conservative GC on ESP32-C3. Behavior = pre-Stage-5; LLGO_SHADOW_STACK=1 restores caller tracking — on wasm the shadow stack is the only possible mechanism
BSDs / android / ios same ELF/Mach-O + SysV/AAPCS mechanics; expected to be a GOOS-whitelist change only
windows needs COFF equivalents of the DCE-safe sections first (bigger than the unwinder); then either clang-kept RBP chains with the same walker, or SEH RtlVirtualUnwind. Currently attr off → graceful degradation
riscv64 / arm32 known gap: frame layout differs (RISC-V: ra at fp−8, prev fp at fp−16), walker offsets are arm64/x86-64-specific; the FP gate should whitelist arm64/amd64 explicitly until per-arch offsets land

Obstacle assessment for full support: BSD/android/ios — mechanical (GOOS whitelist). riscv/arm32 — per-arch walker offsets (~10 lines each; arm32 thumb/arm FP registers unify under clang's attr). Windows — bounded, path known: COFF IMAGE_COMDAT_SELECT_ASSOCIATIVE is the direct link-order analog, PE .reloc rewriting is simpler than the Mach-O chained-fixups we already handle, and clang keeps RBP chains on Win x64 so the same walker works (SEH optional); the real gate is LLGo's Windows runtime base, not funcinfo. wasm — physical unwinding is impossible by spec: full support means shadow stack as the permanent wasm backend (escape hatch exists; flip to default there), and wasm-ld section-GC semantics for the metadata need dedicated verification. Embedded — full support should be opt-in metadata (LLGO_FUNCINFO=0 exists) + P4d shrink given flash budgets; the conservative-GC/FP interaction is gated off today and only a precise-scanning project would lift it.

Fallback ladder guarantees no platform regresses below Stage 1: tables (FuncForPC/FileLine) work everywhere; Callers falls back shadow-stack → clite FP walk.

Follow-ups (priority order)

  1. P4a zero-copy joined names (unsafe.String views over pre-joined strings): removes the 0.4–1.7µs per-function materialization — the only remaining structural gap vs Go on fresh-pc and >4k-target batch lookups; also enables per-funcIndex Func caching. Size cost ~+150–250KB stdlib (still ≪ Go's 1.41MB pclntab).
  2. P4b prebuilt pcline + pcvalue-style instruction lines: kills the first-use pcline build (~55–400µs once at scale) and the two remaining first-use cells in runtime: LLGo-owned frame-pointer unwinder (Stage 5) #2019.
  3. P4c !pcsections as the site mechanism: removes the ±% inline/layout perturbation of body-embedded site asm.
  4. P4d section shrink: name pool dominates at extreme function counts (96×96: 15.4 vs Go 9 MiB; typical scenarios already ≤ Go).
  5. Stage 4 stub removal (stubs already merged into the prebuilt ftab; mainly deletes generation), Stage 3 LTO policy.
  6. Unwinder arch ports per the platform table; inline-tree (with P4b) to lift noinline from tracked functions (stdlib.Work's residual ~23µs vs Go 6.2µs is noinline + LLGo baseline, not unwind cost).

PR housekeeping / merge order

Acceptance criteria

  • DCE-safe metadata — holding (both linkers, incl. LTO).
  • API coverage by tests — holding (cl, test/go, LLDB 194/194, ssa, internal/build, both platforms + amd64).
  • Hot lookups ≤ Go — exceeded (2–10× faster across hot/deep/statement-level).
  • First-use ms → µs — met (first-in-process ≈ Go ±; per-fresh-pc gap is P4a).
  • Size ≤ Go — met on mac all scenarios & linux typical; extreme function counts are P4d.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions