Skip to content

runtime: add Go-style funcinfo find index#2012

Draft
cpunion wants to merge 39 commits into
xgo-dev:mainfrom
cpunion:codex/runtime-pclntab-llvm
Draft

runtime: add Go-style funcinfo find index#2012
cpunion wants to merge 39 commits into
xgo-dev:mainfrom
cpunion:codex/runtime-pclntab-llvm

Conversation

@cpunion

@cpunion cpunion commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Implements the runtime funcinfo path on top of current xgo-dev/main, moving LLGo toward Go's pclntab/findfunctab model while keeping the metadata DCE-safe, plus the optimization round that followed review: Mach-O site sections, cold-path elimination, uint16 find index, Caller/Callers memoization, Frames.Next allocation removal, and debug-build gating.

Changes

  • Compact funcinfo + PC-line metadata pipeline used by runtime.Caller, runtime.CallersFrames, runtime.FuncForPC, Func.FileLine.
  • internal/pclntab: LLVM-independent faithful port of Go's findfunctab algorithm (uint8 deltas, overflow error, forward scan, sentinel), kept as the reference for link-phase work.
  • Go-style bucket/subbucket find index in the runtime for function-entry and PC-line tables. Deliberate deviation: uint16 subbucket deltas — Go guarantees a 16-byte minimum function size in its linker; LLGo indexes call-site records a few bytes apart, and a delta counts deduplicated PCs inside one 4KiB bucket (≤4096), so uint16 makes overflow structurally impossible and the index unconditional.
  • Mach-O site sections (__DATA,__llgo_fie/__llgo_stub/__llgo_pcl, live_support, one linker-private atom symbol per record, section$start$ boundaries with the \x01 verbatim-name prefix): under ld64/lld -dead_strip a record survives exactly when its function is live — same semantics ELF gets from SHF_LINK_ORDER with --gc-sections. macOS previously had no sites at all and fell back to one dlsym per function+stub on first use (13ms cold), and had no statement-level line info.
  • Cold fast path: bounded scan of raw entry/stub sections + dladdr exact-entry fallback, budgeted (8 lookups) before building the sorted table; first-use table build no longer does per-stub dlsym and materializes strings once per function/file instead of per site.
  • Caller/Callers memoize interned synthetic PCs in shadow-stack slots (invalidated on line change); Frames.Next reuses the FuncForPC 4-way cache instead of allocating a *Func per frame.
  • Debug builds (LLGO_DEBUG/LLGO_DEBUG_SYMBOLS) keep the funcinfo tables but drop the body-embedded site records — the in-body asm anchors shift instruction/scope layout enough to confuse debuggers (LLDB suite now 194/194). runtime.Caller is unaffected (instrumentation is independent); FuncForPC(entry)/FileLine(entry) work via the dlsym fallback; statement/call-site line granularity is disabled in debug builds.
  • Observability: LLGO_FUNCINFO_DEBUG=1 prints per-table frames/buckets/index=built|fallback/entries=sites|dlsym so benchmarks can prove which path they measured. LLGO_FUNCINFO_SITES=0 keeps tables but drops all site asm (see Known limitations).

Benchmarks (final HEAD, 5–7 runs, best/trimmed avg)

macOS arm64 (host) and Linux arm64 (container); main = merge base; 576 target functions (24 pkgs × 24 methods). Index proof: LLGO_FUNCINFO_DEBUG=1 reports index=built for func and pcline tables in all measured scenarios.

metric mac go1.26 mac this PR mac +LTO linux go1.26 linux this PR linux +LTO
hot.Caller0 148ns 56ns 44ns 161ns 67ns 57ns
hot.Caller1 166ns 86ns 67ns 174ns 109ns 94ns
hot.CallersOnly 124ns 130ns 101ns 123ns 156ns 131ns
hot.CallersFramesFirst 419ns 452ns 341ns 437ns 533ns 444ns
hot.FuncForPCEntry 12ns 9ns 6ns 12ns 9ns 7ns
multipkg.FuncForPCMany (576) 9.9µs 3.8µs 879ns 10.6µs 4.2µs 893ns
multipkg.FileLineMany (576) 31.4µs 5.2µs 1.3µs 31.8µs 5.4µs 1.4µs
cold.FirstCaller0 2.4µs 666ns 458ns 3.5µs 1.7µs 2.1µs
cold.FirstFileLine 167ns 167ns 42ns 333ns 83ns 125ns
cold first-use table build¹ ~45µs ~25µs ~55µs ~55µs
stdlib.Work² 5.4µs 17.7µs 14.6µs 6.2µs 20.3µs 17.6µs

¹ One-time per process; shows up in whichever cold.First* window exhausts the cold-lookup budget (platform-dependent). Go needs none because its linker ships the sorted table — this is the link-phase follow-up.
² Shadow-stack instrumentation tax; the main baseline is equally behind Go (18–20µs), not introduced by this PR — unwinder follow-up.

Scope of the comparison: FileLineMany compares LLGo's statement/call-site granularity against Go's dense per-instruction pcvalue tables; "faster than Go" holds for this granularity, not for full pcvalue semantics.

Scaling: per-target batch cost is flat from 144 → 576 targets (6.5 → 6.8ns per FuncForPC target, LTO ~1.9ns), consistent with the bucket index.

Binary size and build time (main → this PR)

scenario mac mac LTO linux linux LTO vs go1.26
multipkg (576) 2.4→3.1 MiB (+29%) 1.8→2.7 (+50%) 2.4→3.3 (+38%) 2.2→3.6 (+64%) above (2.7–2.8)
stdlib 3.4→4.0 MiB (+18%) 2.7→3.5 (+30%) 3.3→4.0 (+21%) 3.1→4.5 (+45%) below (5.0–5.1)

Composition (stdlib, +660KB): strings ~270KB (content-deduped, comparable density to Go's funcnametab), stub sites 115KB, entry sites 70KB, records+symbol-index+hash ~156KB, pcline <5KB. Anchor: the same program's Go __gopclntab is 1.41MB — this PR's total metadata is ~47% of Go's equivalent structure. Build time +2–5% (non-LTO), +8–17% (LTO).

Impact on ordinary code (no runtime introspection)

Pure-compute benchmark (json/sort/map/fib), interleaved runs. Narrow A/B via LLGO_FUNCINFO_SITES=0 (tables kept, only in-body site asm removed):

metric main sites-off sites-on attribution
fib30 2307µs 2305µs 1624µs (−29.5%) site asm shifts inline/layout decisions
json 47.9ms 48.8ms 47.8ms (−2.1%) noise
sort 120.2ms 121.4ms 121.1ms (−0.3%) noise
map 104.5ms 102.7ms 108.5ms (+5.7%) site asm in runtime-internal hot funcs

sites-off ≈ main within noise on every metric: the metadata tables have zero runtime impact; all perturbation comes from the body-embedded volatile asm, in both directions. Go does not have this problem — its pclntab is generated by the linker and never enters the optimizer. Against Go, ordinary-code performance is dominated by the LLGo baseline (json 2.3×, sort 4.9×, map 3.1× slower on both main and this PR; fib 39% faster) — funcinfo's ±5% is second-order there.

Known limitations / follow-ups

  1. First-use table construction (~45–55µs once per process) and ~35% of the size delta (site sections + symbol index + hash) are removed together by link-phase ftab/findfunctab generation — design doc in doc/design/pclntab-linkphase.md, internal/pclntab is the ready algorithm base.
  2. LTO inlining duplicates body-embedded entry records (frames 5726 → 15106 at 576 targets; ELF section 4.1×), registering host-function PCs under the inlinee's symbol ID; runtime only consults the table when native symbolization fails, which bounds the impact. IR-level fixes were tried and ruled out (!associated is linker-GC-only and GlobalDCE deletes the records; llvm.compiler.used pins dead functions; noduplicate blocks inlining) — dedup is link-phase work; !pcsections metadata is the candidate replacement for the asm mechanism (also removes the codegen perturbation above).
  3. Shadow-stack instrumentation tax (stdlib.Work ~3×, deep-stack CallersFrames 1.5–1.8×, non-LTO CallersOnly gap on Linux) — frame-pointer unwinder follow-up.
  4. Observed: main+LTO binaries fail at runtime on macOS in this harness; this PR's LTO passes all scenarios (root cause of the baseline failure not investigated here).

Local validation

macOS arm64 + Linux arm64 container: go test ./cl (full), go test ./internal/build ./internal/pclntab ./ssa, LLGO_ROOT=... go test ./test/go -run TestRuntime..., bash _lldb/runtest.sh (194/194), go fmt/go vet clean. Benchmark harness included under benchmark/runtime_funcinfo (reproduces all numbers above).

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

@cpunion cpunion changed the title build: add Go-style pclntab findfunc index runtime: add Go-style funcinfo find index Jul 1, 2026
cpunion and others added 5 commits July 2, 2026 09:52
Bring over the cross-branch runtime funcinfo benchmark (hot, deep,
multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own
performance numbers. cold.FirstCallersFrames now walks to the first
fully symbolized frame, because synthetic runtime frames (LLGo's
runtime.Callers placeholder) carry no file/line and the metric was
silently skipped on LLGo.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
macOS previously had no entry/stub/pcline site sections, so first-use
funcinfo initialization fell back to one dlsym per function and per
stub (13ms cold on a small binary, 27ms with LTO), and statement-level
pc-line records did not exist at all.

Emit the same site records on Mach-O:

- __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the
  live_support attribute: under ld64/lld -dead_strip a live_support
  atom survives only if the atom it references (the anchor label inside
  the function body) is live, which matches the records-follow-function
  semantics ELF gets from SHF_LINK_ORDER with --gc-sections.
- One lowercase-l linker-private symbol per record so each record is
  its own atom and dead functions drop exactly their own records.
- Assembler-local (L-prefixed) pc-site labels: Mach-O
  subsections-via-symbols treats visible labels as atom boundaries, and
  a visible label in the middle of a function let the linker split and
  reorder function bodies.
- Boundary symbols via ld64's section$start$/section$end$, emitted
  with the \x01 verbatim-name prefix so LLVM does not prepend the
  Mach-O underscore.
- A no_dead_strip zero record per section in the main module keeps the
  sections (and their boundary symbols) present even when no package
  contributed records.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First-use initialization:

- Skip the per-stub dlsym loop when the stub-site section provided the
  frames; each dlsym is a dynamic-loader query and the loop dominated
  cold latency.
- Materialize per-function strings and entry PCs once per function and
  packed file strings once per file ID during pcline table construction
  instead of once per site.

Cold FuncForPC fast path: before the frame table exists, resolve exact
function-value PCs with a bounded linear scan of the raw entry-site and
stub-site sections (compile-time data, no loader query), then one
dladdr as fallback; both require an entry match within the warm path's
slack so stripped-local misattribution is impossible. The path is
budgeted: after a handful of cold lookups the sorted table amortizes
better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to
~35us on macOS.

Find index: subbucket deltas are now uint16 and the whole-index
abandonment on delta overflow is gone. Go stores uint8 deltas because
its linker guarantees a 16-byte minimum function size; LLGo indexes
call-site records that sit a few bytes apart, and a dense 4KiB bucket
silently degraded every lookup in the process to a full binary search.
A delta counts deduplicated PCs inside one bucket, so it is bounded by
the bucket size and uint16 cannot overflow.

Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built
table (frame/bucket counts, index built or fallback, sites vs dlsym
sources) so benchmarks can tell which path they measured.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every Caller/Callers capture used to intern the frame into the
synthetic table: a hash probe plus a full frame comparison per stack
slot per call. Memoize the interned PC base in the shadow-stack slot
and invalidate it when the recorded line changes (for one entry the
instrumented name/file operands are constants, so the line is the only
thing that varies between call sites). The three static frames emitted
around every Callers walk get per-store memo slots, and the emit loop
is unrolled so nothing escapes and skipped frames are never captured.

macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns.
hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cpunion and others added 2 commits July 2, 2026 16:08
…py limit

Frames.Next allocated a fresh *Func per symbolized frame; route it
through the FuncForPC 4-way cache so repeated CallersFrames walks over
the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns
(338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns.

Also document a pre-existing limitation at the entry-site emitter: the
body-embedded inline-asm record is duplicated by LTO inlining into
every inline site (~4x section growth on multipkg) and registers
host-function PCs under the inlinee's symbol ID. Runtime only consults
the table when native symbolization fails, which bounds the impact;
the fix (data globals with !associated metadata) needs
LLVMGlobalSetMetadata in the llvm binding and lands with the
link-phase ftab work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Record the experiment results at the emitter: !associated only guides
linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used
pins dead functions through the records' address initializers; and
noduplicate blocks inlining. Section dedup is link-phase work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cpunion and others added 6 commits July 2, 2026 16:58
Post-link table generation plan: parse the linked binary's metadata
sections, dedup LTO inline copies against the symbol table, sort with a
sentinel, build Go-layout findfunctab via internal/pclntab, and write
back into a reserved section with ASLR-safe anchor offsets. Runtime
adopts the prebuilt table when the header validates and keeps first-use
construction as fallback. Includes the list of platform facts
established in xgo-dev#2012 so implementation does not re-derive them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The monotonic time source had two problems:

- On Linux, runtimeNano passed clite's CLOCK_MONOTONIC, whose value is
  Darwin's clock id (6). Linux interprets 6 as CLOCK_MONOTONIC_COARSE,
  a millisecond-granularity clock: consecutive time.Now() readings were
  identical 100% of the time and the smallest nonzero delta was 1ms.
- On Darwin, clock_gettime(CLOCK_MONOTONIC) itself only has microsecond
  granularity (96% identical consecutive readings, 1us minimum delta).

Mirror Go's runtime structure with a per-OS nanotime1 in the runtime
package itself, keeping the hot path free of clite indirection and clite
unchanged: Darwin reads CLOCK_UPTIME_RAW through clock_gettime_nsec_np
(the same clock Go's nanotime uses there), Linux uses clock_gettime with
the OS-correct CLOCK_MONOTONIC id as a local constant, and remaining
platforms keep the previous behavior.

Measured with consecutive time.Now() deltas (min nonzero / zero-frac):
- macOS arm64: 1us / 96.5%  ->  41ns / 26%  (Go 1.26: 41ns / 22%)
- Linux arm64: 1ms / 100%   ->  41ns / 21%

time.Sleep, Timer and Ticker behave identically before and after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The macOS CI LLDB step caught the funcinfo entry/stub site anchors
shifting instruction/scope layout: with the records emitted at function
entry, LLDB reported variables from an inner lexical block (ScopeIf's
b, c) as in scope before the block began. Debug builds carry full
DWARF, so the funcinfo tables are redundant there; gate the metadata
pipeline on !IsDbgEnabled(). Caller-frame instrumentation is
independent of this switch, so runtime.Caller keeps working in debug
builds. _lldb/runtest.sh: 194/194 pass.

This also covers Linux, where the same interference existed since the
sites were introduced but the LLDB suite only runs on the macOS jobs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Refine the previous commit: instead of disabling the whole funcinfo
metadata pipeline under LLGO_DEBUG/LLGO_DEBUG_SYMBOLS, add a separate
Program.EnableFuncInfoSites switch and turn off just the body-embedded
site records (entry/stub anchors and pc-line labels) — they are what
shifts instruction/scope layout and confused LLDB. The funcinfo tables
are plain data globals and stay enabled, so runtime.FuncForPC keeps its
normalized name and Func.FileLine keeps file/line in debug builds (via
the dlsym fallback path); runtime.Caller/Callers were never affected
because caller-frame instrumentation is independent of both switches.

Debug builds lose only the section fast paths (first-use latency) and
statement-level pc-line granularity, both redundant next to full DWARF.
_lldb/runtest.sh: 194/194; cl and test/go suites pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
frameFuncForPC could cache a Func built from a pcline frame whose entry
resolution failed (entry == 0); a later FuncForPC on the same PC would
then observe Entry() == 0 where its own constructor falls back to pc.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
LLGO_FUNCINFO_SITES=0 keeps the funcinfo metadata tables but drops the
body-embedded entry/stub/pc-line inline-asm sites. This is the narrow
A/B needed to isolate codegen perturbation caused by the in-body asm
anchors: with sites off, plain-code benchmarks match the no-funcinfo
baseline within noise, while sites on shifts hot runtime-internal
loops by -30%..+6% through inline/layout decisions.

Semantics with sites off: FuncForPC(entry) and Func.FileLine(entry)
keep working through the dlsym fallback path; statement/call-site
granularity PC line lookup is disabled, and first-use table
construction loses the section fast path.

Tests assert the split: tables still materialize while entry/stub
section asm, boundary symbols, and pc-line site labels are all absent.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cpunion and others added 4 commits July 3, 2026 09:47
codecov/patch was failing at 51.77% (target 88.68%), but the shortfall
was almost entirely benchmark/runtime_funcinfo/main.go — a standalone
measurement harness with no unit tests by design (600 of 639 missed
lines). Compiler-side changes were already covered (cl/instr.go 478/493,
cl/compile.go 125/127). Ignore benchmark/** in codecov and cover the
remaining internal/pclntab validation/lookup edges directly (96.2%).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only macOS ran tests with -coverprofile, so lines behind OS-specific
branches (ELF emission, per-OS runtime shims) always showed as missed
in codecov/patch even though the ubuntu job executed them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Covers the ELF and Mach-O directive branches, 32-bit pointer
directives, quote-escaped symbol names and empty-table emission from
one table-driven test, so single-platform coverage runs stop reporting
the other platform's branches as dead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant