runtime: add Go-style funcinfo find index#2012
Draft
cpunion wants to merge 39 commits into
Draft
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Bring over the cross-branch runtime funcinfo benchmark (hot, deep, multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own performance numbers. cold.FirstCallersFrames now walks to the first fully symbolized frame, because synthetic runtime frames (LLGo's runtime.Callers placeholder) carry no file/line and the metric was silently skipped on LLGo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
macOS previously had no entry/stub/pcline site sections, so first-use funcinfo initialization fell back to one dlsym per function and per stub (13ms cold on a small binary, 27ms with LTO), and statement-level pc-line records did not exist at all. Emit the same site records on Mach-O: - __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the live_support attribute: under ld64/lld -dead_strip a live_support atom survives only if the atom it references (the anchor label inside the function body) is live, which matches the records-follow-function semantics ELF gets from SHF_LINK_ORDER with --gc-sections. - One lowercase-l linker-private symbol per record so each record is its own atom and dead functions drop exactly their own records. - Assembler-local (L-prefixed) pc-site labels: Mach-O subsections-via-symbols treats visible labels as atom boundaries, and a visible label in the middle of a function let the linker split and reorder function bodies. - Boundary symbols via ld64's section$start$/section$end$, emitted with the \x01 verbatim-name prefix so LLVM does not prepend the Mach-O underscore. - A no_dead_strip zero record per section in the main module keeps the sections (and their boundary symbols) present even when no package contributed records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First-use initialization: - Skip the per-stub dlsym loop when the stub-site section provided the frames; each dlsym is a dynamic-loader query and the loop dominated cold latency. - Materialize per-function strings and entry PCs once per function and packed file strings once per file ID during pcline table construction instead of once per site. Cold FuncForPC fast path: before the frame table exists, resolve exact function-value PCs with a bounded linear scan of the raw entry-site and stub-site sections (compile-time data, no loader query), then one dladdr as fallback; both require an entry match within the warm path's slack so stripped-local misattribution is impossible. The path is budgeted: after a handful of cold lookups the sorted table amortizes better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to ~35us on macOS. Find index: subbucket deltas are now uint16 and the whole-index abandonment on delta overflow is gone. Go stores uint8 deltas because its linker guarantees a 16-byte minimum function size; LLGo indexes call-site records that sit a few bytes apart, and a dense 4KiB bucket silently degraded every lookup in the process to a full binary search. A delta counts deduplicated PCs inside one bucket, so it is bounded by the bucket size and uint16 cannot overflow. Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built table (frame/bucket counts, index built or fallback, sites vs dlsym sources) so benchmarks can tell which path they measured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every Caller/Callers capture used to intern the frame into the synthetic table: a hash probe plus a full frame comparison per stack slot per call. Memoize the interned PC base in the shadow-stack slot and invalidate it when the recorded line changes (for one entry the instrumented name/file operands are constants, so the line is the only thing that varies between call sites). The three static frames emitted around every Callers walk get per-store memo slots, and the emit loop is unrolled so nothing escapes and skipped frames are never captured. macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns. hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…py limit Frames.Next allocated a fresh *Func per symbolized frame; route it through the FuncForPC 4-way cache so repeated CallersFrames walks over the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns (338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns. Also document a pre-existing limitation at the entry-site emitter: the body-embedded inline-asm record is duplicated by LTO inlining into every inline site (~4x section growth on multipkg) and registers host-function PCs under the inlinee's symbol ID. Runtime only consults the table when native symbolization fails, which bounds the impact; the fix (data globals with !associated metadata) needs LLVMGlobalSetMetadata in the llvm binding and lands with the link-phase ftab work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Record the experiment results at the emitter: !associated only guides linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used pins dead functions through the records' address initializers; and noduplicate blocks inlining. Section dedup is link-phase work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Post-link table generation plan: parse the linked binary's metadata sections, dedup LTO inline copies against the symbol table, sort with a sentinel, build Go-layout findfunctab via internal/pclntab, and write back into a reserved section with ASLR-safe anchor offsets. Runtime adopts the prebuilt table when the header validates and keeps first-use construction as fallback. Includes the list of platform facts established in xgo-dev#2012 so implementation does not re-derive them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The monotonic time source had two problems: - On Linux, runtimeNano passed clite's CLOCK_MONOTONIC, whose value is Darwin's clock id (6). Linux interprets 6 as CLOCK_MONOTONIC_COARSE, a millisecond-granularity clock: consecutive time.Now() readings were identical 100% of the time and the smallest nonzero delta was 1ms. - On Darwin, clock_gettime(CLOCK_MONOTONIC) itself only has microsecond granularity (96% identical consecutive readings, 1us minimum delta). Mirror Go's runtime structure with a per-OS nanotime1 in the runtime package itself, keeping the hot path free of clite indirection and clite unchanged: Darwin reads CLOCK_UPTIME_RAW through clock_gettime_nsec_np (the same clock Go's nanotime uses there), Linux uses clock_gettime with the OS-correct CLOCK_MONOTONIC id as a local constant, and remaining platforms keep the previous behavior. Measured with consecutive time.Now() deltas (min nonzero / zero-frac): - macOS arm64: 1us / 96.5% -> 41ns / 26% (Go 1.26: 41ns / 22%) - Linux arm64: 1ms / 100% -> 41ns / 21% time.Sleep, Timer and Ticker behave identically before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The macOS CI LLDB step caught the funcinfo entry/stub site anchors shifting instruction/scope layout: with the records emitted at function entry, LLDB reported variables from an inner lexical block (ScopeIf's b, c) as in scope before the block began. Debug builds carry full DWARF, so the funcinfo tables are redundant there; gate the metadata pipeline on !IsDbgEnabled(). Caller-frame instrumentation is independent of this switch, so runtime.Caller keeps working in debug builds. _lldb/runtest.sh: 194/194 pass. This also covers Linux, where the same interference existed since the sites were introduced but the LLDB suite only runs on the macOS jobs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Refine the previous commit: instead of disabling the whole funcinfo metadata pipeline under LLGO_DEBUG/LLGO_DEBUG_SYMBOLS, add a separate Program.EnableFuncInfoSites switch and turn off just the body-embedded site records (entry/stub anchors and pc-line labels) — they are what shifts instruction/scope layout and confused LLDB. The funcinfo tables are plain data globals and stay enabled, so runtime.FuncForPC keeps its normalized name and Func.FileLine keeps file/line in debug builds (via the dlsym fallback path); runtime.Caller/Callers were never affected because caller-frame instrumentation is independent of both switches. Debug builds lose only the section fast paths (first-use latency) and statement-level pc-line granularity, both redundant next to full DWARF. _lldb/runtest.sh: 194/194; cl and test/go suites pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
frameFuncForPC could cache a Func built from a pcline frame whose entry resolution failed (entry == 0); a later FuncForPC on the same PC would then observe Entry() == 0 where its own constructor falls back to pc. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
LLGO_FUNCINFO_SITES=0 keeps the funcinfo metadata tables but drops the body-embedded entry/stub/pc-line inline-asm sites. This is the narrow A/B needed to isolate codegen perturbation caused by the in-body asm anchors: with sites off, plain-code benchmarks match the no-funcinfo baseline within noise, while sites on shifts hot runtime-internal loops by -30%..+6% through inline/layout decisions. Semantics with sites off: FuncForPC(entry) and Func.FileLine(entry) keep working through the dlsym fallback path; statement/call-site granularity PC line lookup is disabled, and first-use table construction loses the section fast path. Tests assert the split: tables still materialize while entry/stub section asm, boundary symbols, and pc-line site labels are all absent. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 2, 2026
codecov/patch was failing at 51.77% (target 88.68%), but the shortfall was almost entirely benchmark/runtime_funcinfo/main.go — a standalone measurement harness with no unit tests by design (600 of 639 missed lines). Compiler-side changes were already covered (cl/instr.go 478/493, cl/compile.go 125/127). Ignore benchmark/** in codecov and cover the remaining internal/pclntab validation/lookup edges directly (96.2%). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Only macOS ran tests with -coverprofile, so lines behind OS-specific branches (ELF emission, per-OS runtime shims) always showed as missed in codecov/patch even though the ubuntu job executed them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Covers the ELF and Mach-O directive branches, 32-bit pointer directives, quote-escaped symbol names and empty-table emission from one table-driven test, so single-platform coverage runs stop reporting the other platform's branches as dead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the runtime funcinfo path on top of current
xgo-dev/main, moving LLGo toward Go'spclntab/findfunctabmodel while keeping the metadata DCE-safe, plus the optimization round that followed review: Mach-O site sections, cold-path elimination, uint16 find index, Caller/Callers memoization, Frames.Next allocation removal, and debug-build gating.Changes
runtime.Caller,runtime.CallersFrames,runtime.FuncForPC,Func.FileLine.internal/pclntab: LLVM-independent faithful port of Go'sfindfunctabalgorithm (uint8 deltas, overflow error, forward scan, sentinel), kept as the reference for link-phase work.__DATA,__llgo_fie/__llgo_stub/__llgo_pcl,live_support, one linker-private atom symbol per record,section$start$boundaries with the\x01verbatim-name prefix): under ld64/lld-dead_stripa record survives exactly when its function is live — same semantics ELF gets fromSHF_LINK_ORDERwith--gc-sections. macOS previously had no sites at all and fell back to one dlsym per function+stub on first use (13ms cold), and had no statement-level line info.Caller/Callersmemoize interned synthetic PCs in shadow-stack slots (invalidated on line change);Frames.Nextreuses the FuncForPC 4-way cache instead of allocating a*Funcper frame.LLGO_DEBUG/LLGO_DEBUG_SYMBOLS) keep the funcinfo tables but drop the body-embedded site records — the in-body asm anchors shift instruction/scope layout enough to confuse debuggers (LLDB suite now 194/194).runtime.Calleris unaffected (instrumentation is independent);FuncForPC(entry)/FileLine(entry)work via the dlsym fallback; statement/call-site line granularity is disabled in debug builds.LLGO_FUNCINFO_DEBUG=1prints per-tableframes/buckets/index=built|fallback/entries=sites|dlsymso benchmarks can prove which path they measured.LLGO_FUNCINFO_SITES=0keeps tables but drops all site asm (see Known limitations).Benchmarks (final HEAD, 5–7 runs, best/trimmed avg)
macOS arm64 (host) and Linux arm64 (container);
main= merge base; 576 target functions (24 pkgs × 24 methods). Index proof:LLGO_FUNCINFO_DEBUG=1reportsindex=builtfor func and pcline tables in all measured scenarios.¹ One-time per process; shows up in whichever
cold.First*window exhausts the cold-lookup budget (platform-dependent). Go needs none because its linker ships the sorted table — this is the link-phase follow-up.² Shadow-stack instrumentation tax; the
mainbaseline is equally behind Go (18–20µs), not introduced by this PR — unwinder follow-up.Scope of the comparison:
FileLineManycompares LLGo's statement/call-site granularity against Go's dense per-instruction pcvalue tables; "faster than Go" holds for this granularity, not for full pcvalue semantics.Scaling: per-target batch cost is flat from 144 → 576 targets (6.5 → 6.8ns per FuncForPC target, LTO ~1.9ns), consistent with the bucket index.
Binary size and build time (
main→ this PR)Composition (stdlib, +660KB): strings ~270KB (content-deduped, comparable density to Go's funcnametab), stub sites 115KB, entry sites 70KB, records+symbol-index+hash ~156KB, pcline <5KB. Anchor: the same program's Go
__gopclntabis 1.41MB — this PR's total metadata is ~47% of Go's equivalent structure. Build time +2–5% (non-LTO), +8–17% (LTO).Impact on ordinary code (no runtime introspection)
Pure-compute benchmark (json/sort/map/fib), interleaved runs. Narrow A/B via
LLGO_FUNCINFO_SITES=0(tables kept, only in-body site asm removed):sites-off ≈ mainwithin noise on every metric: the metadata tables have zero runtime impact; all perturbation comes from the body-embedded volatile asm, in both directions. Go does not have this problem — its pclntab is generated by the linker and never enters the optimizer. Against Go, ordinary-code performance is dominated by the LLGo baseline (json 2.3×, sort 4.9×, map 3.1× slower on bothmainand this PR; fib 39% faster) — funcinfo's ±5% is second-order there.Known limitations / follow-ups
ftab/findfunctabgeneration — design doc indoc/design/pclntab-linkphase.md,internal/pclntabis the ready algorithm base.frames 5726 → 15106at 576 targets; ELF section 4.1×), registering host-function PCs under the inlinee's symbol ID; runtime only consults the table when native symbolization fails, which bounds the impact. IR-level fixes were tried and ruled out (!associatedis linker-GC-only and GlobalDCE deletes the records;llvm.compiler.usedpins dead functions;noduplicateblocks inlining) — dedup is link-phase work;!pcsectionsmetadata is the candidate replacement for the asm mechanism (also removes the codegen perturbation above).main+LTO binaries fail at runtime on macOS in this harness; this PR's LTO passes all scenarios (root cause of the baseline failure not investigated here).Local validation
macOS arm64 + Linux arm64 container:
go test ./cl(full),go test ./internal/build ./internal/pclntab ./ssa,LLGO_ROOT=... go test ./test/go -run TestRuntime...,bash _lldb/runtest.sh(194/194),go fmt/go vetclean. Benchmark harness included underbenchmark/runtime_funcinfo(reproduces all numbers above).