runtime: add Go-style funcinfo find index by cpunion · Pull Request #2012 · xgo-dev/llgo

cpunion · 2026-07-01T19:27:53Z

Implements the runtime funcinfo path on top of current xgo-dev/main, moving LLGo toward Go's pclntab/findfunctab model while keeping the metadata DCE-safe, plus the optimization round that followed review: Mach-O site sections, cold-path elimination, uint16 find index, Caller/Callers memoization, Frames.Next allocation removal, and debug-build gating.

Changes

Compact funcinfo + PC-line metadata pipeline used by runtime.Caller, runtime.CallersFrames, runtime.FuncForPC, Func.FileLine.
internal/pclntab: LLVM-independent faithful port of Go's findfunctab algorithm (uint8 deltas, overflow error, forward scan, sentinel), kept as the reference for link-phase work.
Go-style bucket/subbucket find index in the runtime for function-entry and PC-line tables. Deliberate deviation: uint16 subbucket deltas — Go guarantees a 16-byte minimum function size in its linker; LLGo indexes call-site records a few bytes apart, and a delta counts deduplicated PCs inside one 4KiB bucket (≤4096), so uint16 makes overflow structurally impossible and the index unconditional.
Mach-O site sections (__DATA,__llgo_fie/__llgo_stub/__llgo_pcl, live_support, one linker-private atom symbol per record, section$start$ boundaries with the \x01 verbatim-name prefix): under ld64/lld -dead_strip a record survives exactly when its function is live — same semantics ELF gets from SHF_LINK_ORDER with --gc-sections. macOS previously had no sites at all and fell back to one dlsym per function+stub on first use (13ms cold), and had no statement-level line info.
Cold fast path: bounded scan of raw entry/stub sections + dladdr exact-entry fallback, budgeted (8 lookups) before building the sorted table; first-use table build no longer does per-stub dlsym and materializes strings once per function/file instead of per site.
Caller/Callers memoize interned synthetic PCs in shadow-stack slots (invalidated on line change); Frames.Next reuses the FuncForPC 4-way cache instead of allocating a *Func per frame.
Debug builds (LLGO_DEBUG/LLGO_DEBUG_SYMBOLS) keep the funcinfo tables but drop the body-embedded site records — the in-body asm anchors shift instruction/scope layout enough to confuse debuggers (LLDB suite now 194/194). runtime.Caller is unaffected (instrumentation is independent); FuncForPC(entry)/FileLine(entry) work via the dlsym fallback; statement/call-site line granularity is disabled in debug builds.
Observability: LLGO_FUNCINFO_DEBUG=1 prints per-table frames/buckets/index=built|fallback/entries=sites|dlsym so benchmarks can prove which path they measured. LLGO_FUNCINFO_SITES=0 keeps tables but drops all site asm (see Known limitations).

Benchmarks (final HEAD, 5–7 runs, best/trimmed avg)

macOS arm64 (host) and Linux arm64 (container); main = merge base; 576 target functions (24 pkgs × 24 methods). Index proof: LLGO_FUNCINFO_DEBUG=1 reports index=built for func and pcline tables in all measured scenarios.

metric	mac go1.26	mac this PR	mac +LTO	linux go1.26	linux this PR	linux +LTO
hot.Caller0	148ns	56ns	44ns	161ns	67ns	57ns
hot.Caller1	166ns	86ns	67ns	174ns	109ns	94ns
hot.CallersOnly	124ns	130ns	101ns	123ns	156ns	131ns
hot.CallersFramesFirst	419ns	452ns	341ns	437ns	533ns	444ns
hot.FuncForPCEntry	12ns	9ns	6ns	12ns	9ns	7ns
multipkg.FuncForPCMany (576)	9.9µs	3.8µs	879ns	10.6µs	4.2µs	893ns
multipkg.FileLineMany (576)	31.4µs	5.2µs	1.3µs	31.8µs	5.4µs	1.4µs
cold.FirstCaller0	2.4µs	666ns	458ns	3.5µs	1.7µs	2.1µs
cold.FirstFileLine	167ns	167ns	42ns	333ns	83ns	125ns
cold first-use table build¹	—	~45µs	~25µs	—	~55µs	~55µs
stdlib.Work²	5.4µs	17.7µs	14.6µs	6.2µs	20.3µs	17.6µs

¹ One-time per process; shows up in whichever cold.First* window exhausts the cold-lookup budget (platform-dependent). Go needs none because its linker ships the sorted table — this is the link-phase follow-up.
² Shadow-stack instrumentation tax; the main baseline is equally behind Go (18–20µs), not introduced by this PR — unwinder follow-up.

Scope of the comparison: FileLineMany compares LLGo's statement/call-site granularity against Go's dense per-instruction pcvalue tables; "faster than Go" holds for this granularity, not for full pcvalue semantics.

Scaling: per-target batch cost is flat from 144 → 576 targets (6.5 → 6.8ns per FuncForPC target, LTO ~1.9ns), consistent with the bucket index.

Binary size and build time (`main` → this PR)

scenario	mac	mac LTO	linux	linux LTO	vs go1.26
multipkg (576)	2.4→3.1 MiB (+29%)	1.8→2.7 (+50%)	2.4→3.3 (+38%)	2.2→3.6 (+64%)	above (2.7–2.8)
stdlib	3.4→4.0 MiB (+18%)	2.7→3.5 (+30%)	3.3→4.0 (+21%)	3.1→4.5 (+45%)	below (5.0–5.1)

Composition (stdlib, +660KB): strings ~270KB (content-deduped, comparable density to Go's funcnametab), stub sites 115KB, entry sites 70KB, records+symbol-index+hash ~156KB, pcline <5KB. Anchor: the same program's Go __gopclntab is 1.41MB — this PR's total metadata is ~47% of Go's equivalent structure. Build time +2–5% (non-LTO), +8–17% (LTO).

Impact on ordinary code (no runtime introspection)

Pure-compute benchmark (json/sort/map/fib), interleaved runs. Narrow A/B via LLGO_FUNCINFO_SITES=0 (tables kept, only in-body site asm removed):

metric	main	sites-off	sites-on	attribution
fib30	2307µs	2305µs	1624µs (−29.5%)	site asm shifts inline/layout decisions
json	47.9ms	48.8ms	47.8ms (−2.1%)	noise
sort	120.2ms	121.4ms	121.1ms (−0.3%)	noise
map	104.5ms	102.7ms	108.5ms (+5.7%)	site asm in runtime-internal hot funcs

sites-off ≈ main within noise on every metric: the metadata tables have zero runtime impact; all perturbation comes from the body-embedded volatile asm, in both directions. Go does not have this problem — its pclntab is generated by the linker and never enters the optimizer. Against Go, ordinary-code performance is dominated by the LLGo baseline (json 2.3×, sort 4.9×, map 3.1× slower on both main and this PR; fib 39% faster) — funcinfo's ±5% is second-order there.

Known limitations / follow-ups

First-use table construction (~45–55µs once per process) and ~35% of the size delta (site sections + symbol index + hash) are removed together by link-phase ftab/findfunctab generation — design doc in doc/design/pclntab-linkphase.md, internal/pclntab is the ready algorithm base.
LTO inlining duplicates body-embedded entry records (frames 5726 → 15106 at 576 targets; ELF section 4.1×), registering host-function PCs under the inlinee's symbol ID; runtime only consults the table when native symbolization fails, which bounds the impact. IR-level fixes were tried and ruled out (!associated is linker-GC-only and GlobalDCE deletes the records; llvm.compiler.used pins dead functions; noduplicate blocks inlining) — dedup is link-phase work; !pcsections metadata is the candidate replacement for the asm mechanism (also removes the codegen perturbation above).
Shadow-stack instrumentation tax (stdlib.Work ~3×, deep-stack CallersFrames 1.5–1.8×, non-LTO CallersOnly gap on Linux) — frame-pointer unwinder follow-up.
Observed: main+LTO binaries fail at runtime on macOS in this harness; this PR's LTO passes all scenarios (root cause of the baseline failure not investigated here).

Local validation

macOS arm64 + Linux arm64 container: go test ./cl (full), go test ./internal/build ./internal/pclntab ./ssa, LLGO_ROOT=... go test ./test/go -run TestRuntime..., bash _lldb/runtest.sh (194/194), go fmt/go vet clean. Benchmark harness included under benchmark/runtime_funcinfo (reproduces all numbers above).

codecov · 2026-07-01T19:50:17Z

Codecov Report

❌ Patch coverage is 91.33618% with 142 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/build/funcinfo_table.go	85.39%	63 Missing and 29 partials ⚠️
internal/build/funcinfo/funcinfo.go	87.60%	15 Missing and 14 partials ⚠️
cl/instr.go	96.95%	10 Missing and 5 partials ⚠️
internal/pclntab/pclntab.go	92.98%	2 Missing and 2 partials ⚠️
cl/compile.go	98.42%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Bring over the cross-branch runtime funcinfo benchmark (hot, deep, multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own performance numbers. cold.FirstCallersFrames now walks to the first fully symbolized frame, because synthetic runtime frames (LLGo's runtime.Callers placeholder) carry no file/line and the metric was silently skipped on LLGo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

macOS previously had no entry/stub/pcline site sections, so first-use funcinfo initialization fell back to one dlsym per function and per stub (13ms cold on a small binary, 27ms with LTO), and statement-level pc-line records did not exist at all. Emit the same site records on Mach-O: - __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the live_support attribute: under ld64/lld -dead_strip a live_support atom survives only if the atom it references (the anchor label inside the function body) is live, which matches the records-follow-function semantics ELF gets from SHF_LINK_ORDER with --gc-sections. - One lowercase-l linker-private symbol per record so each record is its own atom and dead functions drop exactly their own records. - Assembler-local (L-prefixed) pc-site labels: Mach-O subsections-via-symbols treats visible labels as atom boundaries, and a visible label in the middle of a function let the linker split and reorder function bodies. - Boundary symbols via ld64's section$start$/section$end$, emitted with the \x01 verbatim-name prefix so LLVM does not prepend the Mach-O underscore. - A no_dead_strip zero record per section in the main module keeps the sections (and their boundary symbols) present even when no package contributed records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First-use initialization: - Skip the per-stub dlsym loop when the stub-site section provided the frames; each dlsym is a dynamic-loader query and the loop dominated cold latency. - Materialize per-function strings and entry PCs once per function and packed file strings once per file ID during pcline table construction instead of once per site. Cold FuncForPC fast path: before the frame table exists, resolve exact function-value PCs with a bounded linear scan of the raw entry-site and stub-site sections (compile-time data, no loader query), then one dladdr as fallback; both require an entry match within the warm path's slack so stripped-local misattribution is impossible. The path is budgeted: after a handful of cold lookups the sorted table amortizes better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to ~35us on macOS. Find index: subbucket deltas are now uint16 and the whole-index abandonment on delta overflow is gone. Go stores uint8 deltas because its linker guarantees a 16-byte minimum function size; LLGo indexes call-site records that sit a few bytes apart, and a dense 4KiB bucket silently degraded every lookup in the process to a full binary search. A delta counts deduplicated PCs inside one bucket, so it is bounded by the bucket size and uint16 cannot overflow. Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built table (frame/bucket counts, index built or fallback, sites vs dlsym sources) so benchmarks can tell which path they measured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Caller/Callers capture used to intern the frame into the synthetic table: a hash probe plus a full frame comparison per stack slot per call. Memoize the interned PC base in the shadow-stack slot and invalidate it when the recorded line changes (for one entry the instrumented name/file operands are constants, so the line is the only thing that varies between call sites). The three static frames emitted around every Callers walk get per-store memo slots, and the emit loop is unrolled so nothing escapes and skipped frames are never captured. macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns. hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…py limit Frames.Next allocated a fresh *Func per symbolized frame; route it through the FuncForPC 4-way cache so repeated CallersFrames walks over the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns (338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns. Also document a pre-existing limitation at the entry-site emitter: the body-embedded inline-asm record is duplicated by LTO inlining into every inline site (~4x section growth on multipkg) and registers host-function PCs under the inlinee's symbol ID. Runtime only consults the table when native symbolization fails, which bounds the impact; the fix (data globals with !associated metadata) needs LLVMGlobalSetMetadata in the llvm binding and lands with the link-phase ftab work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Record the experiment results at the emitter: !associated only guides linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used pins dead functions through the records' address initializers; and noduplicate blocks inlining. Section dedup is link-phase work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Post-link table generation plan: parse the linked binary's metadata sections, dedup LTO inline copies against the symbol table, sort with a sentinel, build Go-layout findfunctab via internal/pclntab, and write back into a reserved section with ASLR-safe anchor offsets. Runtime adopts the prebuilt table when the header validates and keeps first-use construction as fallback. Includes the list of platform facts established in xgo-dev#2012 so implementation does not re-derive them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The monotonic time source had two problems: - On Linux, runtimeNano passed clite's CLOCK_MONOTONIC, whose value is Darwin's clock id (6). Linux interprets 6 as CLOCK_MONOTONIC_COARSE, a millisecond-granularity clock: consecutive time.Now() readings were identical 100% of the time and the smallest nonzero delta was 1ms. - On Darwin, clock_gettime(CLOCK_MONOTONIC) itself only has microsecond granularity (96% identical consecutive readings, 1us minimum delta). Mirror Go's runtime structure with a per-OS nanotime1 in the runtime package itself, keeping the hot path free of clite indirection and clite unchanged: Darwin reads CLOCK_UPTIME_RAW through clock_gettime_nsec_np (the same clock Go's nanotime uses there), Linux uses clock_gettime with the OS-correct CLOCK_MONOTONIC id as a local constant, and remaining platforms keep the previous behavior. Measured with consecutive time.Now() deltas (min nonzero / zero-frac): - macOS arm64: 1us / 96.5% -> 41ns / 26% (Go 1.26: 41ns / 22%) - Linux arm64: 1ms / 100% -> 41ns / 21% time.Sleep, Timer and Ticker behave identically before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The macOS CI LLDB step caught the funcinfo entry/stub site anchors shifting instruction/scope layout: with the records emitted at function entry, LLDB reported variables from an inner lexical block (ScopeIf's b, c) as in scope before the block began. Debug builds carry full DWARF, so the funcinfo tables are redundant there; gate the metadata pipeline on !IsDbgEnabled(). Caller-frame instrumentation is independent of this switch, so runtime.Caller keeps working in debug builds. _lldb/runtest.sh: 194/194 pass. This also covers Linux, where the same interference existed since the sites were introduced but the LLDB suite only runs on the macOS jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Refine the previous commit: instead of disabling the whole funcinfo metadata pipeline under LLGO_DEBUG/LLGO_DEBUG_SYMBOLS, add a separate Program.EnableFuncInfoSites switch and turn off just the body-embedded site records (entry/stub anchors and pc-line labels) — they are what shifts instruction/scope layout and confused LLDB. The funcinfo tables are plain data globals and stay enabled, so runtime.FuncForPC keeps its normalized name and Func.FileLine keeps file/line in debug builds (via the dlsym fallback path); runtime.Caller/Callers were never affected because caller-frame instrumentation is independent of both switches. Debug builds lose only the section fast paths (first-use latency) and statement-level pc-line granularity, both redundant next to full DWARF. _lldb/runtest.sh: 194/194; cl and test/go suites pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

frameFuncForPC could cache a Func built from a pcline frame whose entry resolution failed (entry == 0); a later FuncForPC on the same PC would then observe Entry() == 0 where its own constructor falls back to pc. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

LLGO_FUNCINFO_SITES=0 keeps the funcinfo metadata tables but drops the body-embedded entry/stub/pc-line inline-asm sites. This is the narrow A/B needed to isolate codegen perturbation caused by the in-body asm anchors: with sites off, plain-code benchmarks match the no-funcinfo baseline within noise, while sites on shifts hot runtime-internal loops by -30%..+6% through inline/layout decisions. Semantics with sites off: FuncForPC(entry) and Func.FileLine(entry) keep working through the dlsym fallback path; statement/call-site granularity PC line lookup is disabled, and first-use table construction loses the section fast path. Tests assert the split: tables still materialize while entry/stub section asm, boundary symbols, and pc-line site labels are all absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov/patch was failing at 51.77% (target 88.68%), but the shortfall was almost entirely benchmark/runtime_funcinfo/main.go — a standalone measurement harness with no unit tests by design (600 of 639 missed lines). Compiler-side changes were already covered (cl/instr.go 478/493, cl/compile.go 125/127). Ignore benchmark/** in codecov and cover the remaining internal/pclntab validation/lookup edges directly (96.2%). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Only macOS ran tests with -coverprofile, so lines behind OS-specific branches (ELF emission, per-OS runtime shims) always showed as missed in codecov/patch even though the ubuntu job executed them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Covers the ELF and Mach-O directive branches, 32-bit pointer directives, quote-escaped symbol names and empty-table emission from one table-driven test, so single-platform coverage runs stop reporting the other platform's branches as dead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

build: add Go-style pclntab findfunc index

2870ba8

cpunion mentioned this pull request Jul 1, 2026

Proposal: runtime funcinfo metadata and fast FuncForPC lookup #2004

Open

cpunion added 21 commits July 2, 2026 07:45

ssa: emit DCE-safe function metadata

f365b73

runtime: add line info for stack frames

e81a007

runtime: add statement line caller frames

1e15db2

runtime: compress funcinfo table

407e0dd

test: cover indirect runtime caller paths

a3115cd

cl: narrow runtime caller tracking

c6c128a

runtime: add compact pc-line funcinfo table

4692e15

runtime: optimize FuncForPC metadata lookup

b29906c

runtime: slim FuncForPC cache hot path

734368f

cl: make pc-line labels clone-safe

85b1c13

runtime: guard funcinfo table initialization

b33a774

runtime: fix funcinfo entry pc line metadata

8e6f33c

runtime: publish funcinfo records for live stubs

8eec8e0

runtime: skip ELF stub-site records during LTO

01913f4

runtime: reduce funcinfo lookup initialization cost

6fa875b

test: cover runtime caller metadata edges

34b274c

runtime: speed up funcinfo entry lookup

3b87c90

test: cover pcline metadata in dev lto coverage

012095b

runtime: speed up funcinfo hot paths

9d90419

runtime: avoid FuncForPC cache thrashing

1b8bc56

runtime: use Go-style funcinfo find index

cf33354

cpunion changed the title ~~build: add Go-style pclntab findfunc index~~ runtime: add Go-style funcinfo find index Jul 1, 2026

cpunion and others added 5 commits July 2, 2026 09:52

runtime: use static funcinfo symbol index

bb1b1a3

cpunion and others added 2 commits July 2, 2026 16:08

cpunion mentioned this pull request Jul 2, 2026

runtime: nanosecond monotonic clock; fix Linux clock IDs #2015

Open

cpunion and others added 6 commits July 2, 2026 16:58

cpunion and others added 4 commits July 3, 2026 09:47

ci: exclude dev tools and test scaffolding from coverage

83edf88

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

This was referenced Jul 3, 2026

runtime,cl,build: stage5 cleanup — explicit FP pairing flag and tidy-ups #2022

Closed

runtime: LLGo-owned frame-pointer unwinder (Stage 5) #2019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: add Go-style funcinfo find index#2012

runtime: add Go-style funcinfo find index#2012
cpunion wants to merge 39 commits into
xgo-dev:mainfrom
cpunion:codex/runtime-pclntab-llvm

cpunion commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cpunion commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Benchmarks (final HEAD, 5–7 runs, best/trimmed avg)

Binary size and build time (main → this PR)

Impact on ordinary code (no runtime introspection)

Known limitations / follow-ups

Local validation

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpunion commented Jul 1, 2026 •

edited

Loading

Binary size and build time (`main` → this PR)

codecov Bot commented Jul 1, 2026 •

edited

Loading