runtime: link-phase ftab/findfunctab generation (Stage 2, P1–P3) by cpunion · Pull Request #2016 · xgo-dev/llgo

cpunion · 2026-07-02T11:46:41Z

Follow-up to #2012, implementing Stage 2 of the proposal in #2004: link-phase ftab/findfunctab generation, per doc/design/pclntab-linkphase.md (in this branch). P1 (analysis), P2 (automatic write-back + zero-copy runtime adoption) and P3 (Mach-O bind-record resolution, startup pre-warm, scale hardening) are complete; every llgo-linked executable on linux/darwin now ships a prebuilt function table that is adopted zero-copy and pre-warmed at startup.

How it works

After linkMainPkg, internal/build runs internal/pclnpost.Rewrite on the executable: parse the funcinfo entry/stub site sections, identify canonical records by FNV (a record survives when its anchor's owning symbol hashes to its symbolID, or is that symbol's __llgo_stub. wrapper — LTO inline copies fail this test exactly), sort, build the findfunctab, and rewrite the entry section in place (header + {entryOff, funcIndex} ftab + buckets), voiding the merged stub section. Any failure degrades silently to runtime: add Go-style funcinfo find index #2012's first-use construction (LLGO_PCLNPOST=0 forces that path).
The runtime adopts the table zero-copy when the magic header validates: lookups binary-search the on-disk ftab through the shared bucket index; nothing is materialized on first use. Entries are true symbol starts, retiring the entry-PC slack on this path.
Mach-O bind records (P3): pointer slots that name exported functions — every __llgo_stub.* wrapper and any exported Go function — are chained-fixup BIND nodes, not rebases, even though they bind back into the same image. The rewriter resolves them through the LC_DYLD_CHAINED_FIXUPS imports table. Before this, all stub records were dropped as unowned, so FuncForPC on a function value silently fell back to dladdr (~6µs per fresh pc) on darwin — the real cause of the "mac cold gap" previously attributed to re-signed-binary page validation.
The prebuilt header's base slot is spliced back into the fixup chain as a live rebase node: dyld writes the slid text base at load, and the runtime reads a ready runtime PC with no slide arithmetic (on non-PIE ELF the link-time value already equals the runtime address). An insert that joins no chain page fails the rewrite loudly → first-use fallback.
Startup pre-warm: Go's pclntab pages are touched by its own runtime (traceback, GC) long before user code queries them; LLGo now mirrors that. When the prebuilt table is present, init adopts it, touches the pages the lookup path reads, runs one synthetic lookup, and write-warms the FuncForPC cache. First-in-process FuncForPC: darwin ~17µs → ~2.8µs, linux ~6.6µs → ~1.0µs. Hello-world startup medians are unchanged; non-prebuilt binaries stay fully lazy.
Overflow spill: when function-value stubs push the table past the entry section (~9k address-taken functions), the blob is written into the larger stub section behind a 32-byte LLGOFTB2 redirect header (live-relocation pointer); rows are never dropped.
Mach-O: the rewritten sections' remaining pointer slots are surgically unlinked from the dyld chained-fixup page chains before writing, then the binary is re-signed ad hoc.
LTO: metadata is located through a relocation-carried meta record in the entry section, immune to symbol-table internalization.

Full benchmark matrix

benchmark/runtime_funcinfo (committed in this branch, now covering hot/deep/multipkg/cold/stdlib and a plain ordinary-code scenario), 576 target functions (24 pkgs × 24 methods), 5 runs, cells best/trimmed avg. main = #2012's merge base (its main+lto binaries fail at runtime on macOS — observed, pre-existing). Hot paths are expected to match #2012 (same index once warm); this PR's deltas are the cold.First* rows, function-value lookups, and LTO table integrity.

macOS arm64

metric	go1.26	main	2012	2012+lto	2016	2016+lto
hot.Caller0	152/155ns	9.5µs	54/55ns	42/43ns	55/57ns	42/43ns
hot.Caller1	172/172ns	6.0µs	85/87ns	66/67ns	85/86ns	66/67ns
hot.CallersOnly	127/129ns	13.8µs	130/132ns	97/99ns	127/130ns	98/101ns
hot.CallersFramesFirst	271/273ns	n/a	441/455ns	331/337ns	452/458ns	338/343ns
hot.FuncForPCEntry	12/13ns	n/a	9/9ns	6/6ns	9/9ns	6/6ns
hot.FuncFileLineEntry	10/10ns	n/a	10/10ns	6/8ns	10/10ns	6/6ns
multipkg.FuncForPCMany	12.1/12.2µs	n/a	3.7/3.8µs	881/983ns	3.8/3.9µs	881/896ns
multipkg.FileLineMany¹	33.3/33.4µs	n/a	5.1/5.2µs	1.3/1.3µs	5.0/5.1µs	1.3/1.3µs
cold.FirstCaller0	3.8/4.1µs	FAIL	625/694ns	583/750ns	584/805ns	583/667ns
cold.FirstCallersFrames	3.0/3.4µs	FAIL	2.8/3.0µs	1.3/2.4µs	1.9/2.9µs	2.5/2.7µs
cold.FirstFileLine	750ns/1.1µs	FAIL	42/181ns	41/56ns	83/180ns	83/97ns
cold.FirstFuncForPC³	3.2/3.7µs	FAIL	40.7/41.2µs	24.2/26.3µs	17.3/18.5µs	15.9/17.5µs
stdlib.Work	5.0/5.1µs	14.9/16.6µs	13.7/14.6µs	11.7/13.0µs	13.3/14.1µs	11.8/12.1µs

Linux arm64 (container)

metric	go1.26	main	2012	2012+lto	2016	2016+lto
hot.Caller0	157/158ns	1.3µs	65/67ns	55/57ns	66/66ns	55/56ns
hot.Caller1	167/172ns	4.5µs	107/108ns	91/98ns	106/108ns	92/93ns
hot.CallersOnly	121/124ns	8.5µs	154/156ns	131/137ns	150/153ns	127/129ns
hot.CallersFramesFirst	424/429ns	n/a	531/540ns	437/446ns	531/534ns	432/439ns
hot.FuncForPCEntry	12/12ns	n/a	8/9ns	6/6ns	9/9ns	6/6ns
multipkg.FuncForPCMany	10.5/10.6µs	n/a	4.1/4.1µs	852/905ns	4.6/4.7µs	882/892ns
multipkg.FileLineMany¹	31.8/31.9µs	n/a	5.4/5.4µs	1.3/1.4µs	5.8/5.9µs	1.3/1.4µs
cold.FirstCaller0	3.5/4.2µs	FAIL	1.9/2.2µs	1.8/2.2µs	1.7/2.3µs	1.8/2.2µs
cold.FirstCallersFrames²	1.8/2.5µs	FAIL	55.9/57.4µs	55.8/56.5µs	55.1/57.7µs	53.5/55.4µs
cold.FirstFileLine	250/306ns	FAIL	84/139ns	42/139ns	125/167ns	41/83ns
cold.FirstFuncForPC³	1.6/2.1µs	FAIL	11.2/11.7µs	9.0/10.3µs	6.6/6.8µs	6.5/7.0µs
stdlib.Work	5.8/6.0µs	19.0/20.0µs	19.3/19.7µs	17.1/17.3µs	19.0/21.7µs	15.7/16.6µs

Fresh-pc slow-path lookups (function values, per-pc, split-timing probe)

The headline P3 effect. FuncForPC(reflect.ValueOf(fn).Pointer()) + Name() on a pc not yet cached, same probe built with both toolchains:

	go1.26 first-in-process	go1.26 per fresh pc	2012 per fresh pc	2016 first-in-process (warmed)	2016 per fresh pc
macOS	1.3–1.9µs	40–84ns	5–6µs (dladdr)	~2.8µs	1.2–1.7µs
Linux	~0.2µs	41–125ns	—	~1.0µs	0.4–0.7µs

Go's per-fresh-pc cost is near-zero because FuncForPC/Name() return zero-copy views into pclntab; LLGo still materializes name/file strings per function (three small allocations) — that remaining 0.5–1.7µs is the P4 zero-copy-name work, not lookup cost. History for mac cold.FirstFuncForPC: 13ms → 35–50µs (#2012) → 32.6µs (P2) → 15.9–18.5µs (bind fix) → ~3–5µs (pre-warm).

Scale benchmarks (packages × methods, call depth, big methods)

benchmark/runtime_funcinfo gained three scale dimensions (-scales, -depths, -bigsizes); the sweep below (3 runs, best) found and fixed three scale cliffs, all now covered in this PR:

Prebuilt blob overflow — function-value stubs can double the row count; at ~9k address-taken functions the blob outgrew the entry section and the rewrite silently fell back to first-use construction (cold.FirstFuncForPC 96×96 linux non-LTO: 2.4ms). Fixed by spilling the full blob into the (larger) stub section behind a 32-byte LLGOFTB2 redirect header with a live-relocation pointer; every row keeps the prebuilt table at any scale. (cold.FirstFuncForPC: 2.4ms → 4.3µs.)
FuncForPC cache thrash — the 4k-entry set-associative pc cache evicts constantly once the queried population outgrows it, and every call paid the string-materializing slow path (multipkg.FuncForPCMany 96×96: 8–19ms vs Go 172–180µs). Fixed with a per-ftab-row *Func cache: batch lookups are O(index search) after the first pass. (mac 96×96: 18.6ms → 364µs non-LTO, 147µs +LTO — the LTO cell beats Go's 179µs; mac 48×48: 20µs vs Go 46µs.)
First-lookup page-in — fixed earlier in this PR by the init pre-warm (see above); at every scale tested, cold.FirstFuncForPC is now 3–5µs on both platforms (Go: 2–3µs), and cold.FirstCaller0/FirstFileLine beat Go throughout.

Big-method scenario (`bigfunc`, funcs × statements)

Large bodies are where the compact pcline index shines — Go re-decodes its pcvalue tables per query, LLGo does an index lookup:

16 funcs × 2000 stmts (mac / linux)	go1.26	2016	2016+lto
bigfunc.FileLineMid (statement-level, mid-body pc)	84µs / 86µs	175ns / 169ns	49ns / 49ns
bigfunc.CallersFramesMid	1.0µs / 914ns	322ns / 442ns	262ns / 366ns
bigfunc.FirstFileLineMid	6.4µs / 5.2µs	583ns / 1.4µs	1.6µs / 917ns
bigfunc.FuncForPCMid	63ns / 65ns	147ns / 142ns	36ns / 37ns

(main cannot run this probe at all — mid-body pc symbolization did not exist before this line of work.)

Honest remaining gaps at scale (all quantified, none regressions of this PR)

Deep call chains (depth 512): ~87–92µs vs Go 9.7µs — ~160ns/frame shadow-stack instrumentation tax on runtime-importing packages; same on main. This is the Stage 5 unwinder motivation, now quantified. bigfunc.Work's 13.7× gap has the same cause (verified: LLGO_FUNCINFO_SITES=0 is equivalent, so it is not the site records).
Batch lookups over >4k distinct pcs, non-LTO: ~40ns/query on mac but ~440ns/query on linux hardware (cache-bound row search + pc-cache thrash writes). The P4 zero-copy name format removes the remaining per-query work.
pcline first-use build at scale (FirstFileLine after spill 96×96: ~400µs once per process; FirstCallersFrames linux ~55µs) — P4 prebuilt pcline table.
Binary size at extreme function counts: multipkg 96×96 (18k+ functions, all address-taken) 15.4–18.3 MiB vs Go 9 MiB — dominated by the name string pool; P4 section-shrinking/joined-name work. Typical scenarios stay comparable (cold_96x96: 5.8 vs Go 6.3 MiB; stdlib: 4.0 vs 5.0 MiB).

Ordinary-code performance (`plain` scenario, no runtime introspection)

Pure compute, no introspection (plain.fib30 reads its depth from the environment so full LTO cannot fold it away). This PR inherits #2012's characteristics unchanged — the rewrite only edits metadata sections:

macOS

metric	go1.26	main	2012	2012+lto	2016	2016+lto
plain.fib30	1.8ms	1.5ms	1.8ms	1.8ms	1.8ms	1.8ms
plain.json	1.2ms	3.2ms	3.2ms	2.3ms	3.2ms	2.2ms
plain.sort	9.8ms	21.2ms	21.2ms	9.6ms	21.1ms	9.6ms
plain.map	3.8ms	12.6ms	12.8ms	11.5ms	12.9ms	11.2ms

Linux

metric	go1.26	main	2012	2012+lto	2016	2016+lto
plain.fib30	2.2ms	1.5ms	1.8ms	1.9ms	1.8ms	1.8ms
plain.json	1.3ms	3.4ms	3.4ms	2.4ms	3.4ms	2.4ms
plain.sort	9.8ms	19.2ms	19.2ms	9.7ms	19.4ms	9.7ms
plain.map	4.2ms	13.1ms	13.2ms	11.8ms	13.3ms	11.9ms

2016 tracks 2012 within noise on every metric (±0–2%); vs main the only consistent delta is fib30 (+20%, both funcinfo branches equally — body-embedded site asm shifting inline/layout decisions in deep recursion).

Narrow A/B via LLGO_FUNCINFO_SITES=0 (documented in #2012) still holds: the tables are free at runtime; the ±% vs main comes from body-embedded site asm shifting inline/layout decisions, and is bidirectional. Against Go, ordinary-code deltas are dominated by the LLGo baseline itself (identical on main).

Binary size (identical between 2012 and 2016 — in-place rewrite does not change file size)

scenario	go1.26	main	2012 / 2016	+lto
mac multipkg	2.7 MiB	2.4 MiB	3.1 MiB	2.7 MiB
mac stdlib	5.0 MiB	3.4 MiB	4.0 MiB	3.5 MiB
mac plain	2.8 MiB	2.3 MiB	2.7 MiB	2.3 MiB
linux multipkg	2.7 MiB	2.4 MiB	3.3 MiB	3.6 MiB
linux stdlib	5.0 MiB	3.3 MiB	4.0 MiB	4.5 MiB
linux plain	2.7 MiB	2.2 MiB	2.7 MiB	3.0 MiB

Build time: llgo stdlib 18–20s (non-LTO) / 20–24s (+LTO) on both platforms; the post-link rewrite itself is well under a second per binary.

LTO table integrity

LTO inlining duplicates body-embedded records into host functions; FNV canonical detection removes them exactly (~8.0k inline copies dropped per LTO probe at this scale) — eliminating both misattribution risk and runtime table pollution. With P3's bind resolution the mac tables also gained all __llgo_stub.* and exported-function records (probe ftab: 1.9k → 3.3k rows).

¹ LLGo statement/call-site granularity vs Go's dense per-instruction pcvalue — faster at this granularity, not full pcvalue semantics (P4).
² cold.FirstCallersFrames on Linux measures the pcline table's first-use build, which this PR does not cover (P4); it dominates whichever First* window triggers it.
³ Measured before the startup pre-warm landed; the scale runs below (with pre-warm) show 3–5µs at every size on both platforms.

Validation

macOS: go test ./cl (full, 460s) ok, test/go ok, _lldb/runtest.sh 194/194 (0 failed).
Linux container: test/go ok; end-to-end probes (entries=prebuilt stubs=prebuilt, correct name/file/line, ±LTO).
Unit test for the canonical-owner matcher (Mach-O suffix-shared symbol strings can surface one mangling underscore more or less — matching is normalized and checked against the record's 64-bit FNV id).
Re-validated after the pre-warm and scale-hardening commits: macOS cl + test/go + internal/build + internal/pclnpost + LLDB suite; Linux test/go; 96×96 spill probe end-to-end (30.4k-row table adopted through the redirect header, names/lines correct).
Falsified along the way (recorded in the design doc): "sacrificial fixups to pre-touch table pages" — modern dyld uses page-in linking, so fixups are applied lazily at first touch; adding them does not move the cost. The earlier "re-signed-binary page validation" explanation for the mac cold gap was wrong; the actual cause was the bind-encoded stub records above.

Staging status (per #2004 / design doc)

P1 analysis + P2 automatic rewrite + P3 bind resolution / pre-warm / overflow spill / scale caches: done (this PR).
P4: prebuilt pcline table (statement sites) + pcvalue-style instruction-level lines; zero-copy joined-name strings (removes the remaining 0.5–1.7µs per-fresh-pc materialization and the batch-lookup gap vs Go); evaluate !pcsections as the site-record mechanism; section shrinking for the size delta.
Stage 5 (unwinder) now has hard numbers: the shadow-stack tax is ~160ns/frame (deep_512: 87–92µs vs Go 9.7µs) and dominates bigfunc.Work/stdlib.Work.

Depends on #2012 (branched from it; will rebase once it merges).

🤖 Generated with Claude Code

Bring over the cross-branch runtime funcinfo benchmark (hot, deep, multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own performance numbers. cold.FirstCallersFrames now walks to the first fully symbolized frame, because synthetic runtime frames (LLGo's runtime.Callers placeholder) carry no file/line and the metric was silently skipped on LLGo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

macOS previously had no entry/stub/pcline site sections, so first-use funcinfo initialization fell back to one dlsym per function and per stub (13ms cold on a small binary, 27ms with LTO), and statement-level pc-line records did not exist at all. Emit the same site records on Mach-O: - __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the live_support attribute: under ld64/lld -dead_strip a live_support atom survives only if the atom it references (the anchor label inside the function body) is live, which matches the records-follow-function semantics ELF gets from SHF_LINK_ORDER with --gc-sections. - One lowercase-l linker-private symbol per record so each record is its own atom and dead functions drop exactly their own records. - Assembler-local (L-prefixed) pc-site labels: Mach-O subsections-via-symbols treats visible labels as atom boundaries, and a visible label in the middle of a function let the linker split and reorder function bodies. - Boundary symbols via ld64's section$start$/section$end$, emitted with the \x01 verbatim-name prefix so LLVM does not prepend the Mach-O underscore. - A no_dead_strip zero record per section in the main module keeps the sections (and their boundary symbols) present even when no package contributed records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First-use initialization: - Skip the per-stub dlsym loop when the stub-site section provided the frames; each dlsym is a dynamic-loader query and the loop dominated cold latency. - Materialize per-function strings and entry PCs once per function and packed file strings once per file ID during pcline table construction instead of once per site. Cold FuncForPC fast path: before the frame table exists, resolve exact function-value PCs with a bounded linear scan of the raw entry-site and stub-site sections (compile-time data, no loader query), then one dladdr as fallback; both require an entry match within the warm path's slack so stripped-local misattribution is impossible. The path is budgeted: after a handful of cold lookups the sorted table amortizes better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to ~35us on macOS. Find index: subbucket deltas are now uint16 and the whole-index abandonment on delta overflow is gone. Go stores uint8 deltas because its linker guarantees a 16-byte minimum function size; LLGo indexes call-site records that sit a few bytes apart, and a dense 4KiB bucket silently degraded every lookup in the process to a full binary search. A delta counts deduplicated PCs inside one bucket, so it is bounded by the bucket size and uint16 cannot overflow. Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built table (frame/bucket counts, index built or fallback, sites vs dlsym sources) so benchmarks can tell which path they measured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Caller/Callers capture used to intern the frame into the synthetic table: a hash probe plus a full frame comparison per stack slot per call. Memoize the interned PC base in the shadow-stack slot and invalidate it when the recorded line changes (for one entry the instrumented name/file operands are constants, so the line is the only thing that varies between call sites). The three static frames emitted around every Callers walk get per-store memo slots, and the emit loop is unrolled so nothing escapes and skipped frames are never captured. macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns. hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…py limit Frames.Next allocated a fresh *Func per symbolized frame; route it through the FuncForPC 4-way cache so repeated CallersFrames walks over the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns (338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns. Also document a pre-existing limitation at the entry-site emitter: the body-embedded inline-asm record is duplicated by LTO inlining into every inline site (~4x section growth on multipkg) and registers host-function PCs under the inlinee's symbol ID. Runtime only consults the table when native symbolization fails, which bounds the impact; the fix (data globals with !associated metadata) needs LLVMGlobalSetMetadata in the llvm binding and lands with the link-phase ftab work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Record the experiment results at the emitter: !associated only guides linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used pins dead functions through the records' address initializers; and noduplicate blocks inlining. Section dedup is link-phase work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Post-link table generation plan: parse the linked binary's metadata sections, dedup LTO inline copies against the symbol table, sort with a sentinel, build Go-layout findfunctab via internal/pclntab, and write back into a reserved section with ASLR-safe anchor offsets. Runtime adopts the prebuilt table when the header validates and keeps first-use construction as fallback. Includes the list of platform facts established in xgo-dev#2012 so implementation does not re-derive them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov/patch was failing at 51.77% (target 88.68%), but the shortfall was almost entirely benchmark/runtime_funcinfo/main.go — a standalone measurement harness with no unit tests by design (600 of 639 missed lines). Compiler-side changes were already covered (cl/instr.go 478/493, cl/compile.go 125/127). Ignore benchmark/** in codecov and cover the remaining internal/pclntab validation/lookup edges directly (96.2%). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Only macOS ran tests with -coverprofile, so lines behind OS-specific branches (ELF emission, per-OS runtime shims) always showed as missed in codecov/patch even though the ubuntu job executed them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Covers the ELF and Mach-O directive branches, 32-bit pointer directives, quote-escaped symbol names and empty-table emission from one table-driven test, so single-platform coverage runs stop reporting the other platform's branches as dead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First stage of doc/design/pclntab-linkphase.md: parse a linked binary's funcinfo entry/stub sections (Mach-O and ELF), deduplicate LTO inline copies against the symbol table's text ranges, sort with a Go-style sentinel, and build findfunctab through internal/pclntab — the faithful port that has been waiting for exactly this caller. Read-only: prints what the P2 build integration would write back. Measured on the 576-target multipkg binaries: - non-LTO: 9319 records -> ftab 3161 + 207 buckets; lookup self-check 3160/3160; site sections 149KB -> 29KB (5.1x) - LTO: 15371 entry records -> 13857 inline copies dropped, 4144 kept; self-check 3045/3045; 299KB -> 28.5KB (10.5x) Findings for P2: on-disk Mach-O pointer slots hold dyld chained-fixup encodings (low 36 bits are the target; decoded here; the write-back design stores anchor-relative offsets and avoids pointers entirely), and some non-LTO stub symbols are absent from the symbol table (records conservatively dropped; needs tightening). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…adoption pclnpost -write rewrites the entry-site section in place with the prebuilt table (header + ftab {entryOff,funcIndex} + runtime-layout findfunctab buckets), resolving funcinfo indexes through the binary's symbol-index section, and voids the stub section (its records are merged into the table). ASLR is handled by anchoring on the section's own link-time address; entries are normalized to true symbol starts, which retires the entry-PC slack on this path. macOS re-signs with an ad-hoc codesign after rewriting. The runtime adopts the table zero-copy when the magic header validates: lookups binary-search the on-disk ftab directly through the shared bucket index, nothing is materialized on first use (the funcIndex -> entry map is built lazily and only for the pcline initializer), and the cold scan/dladdr path is skipped since adoption is cheap. First-use construction remains the fallback whenever the header is absent. Linux end-to-end: entries=prebuilt, FuncForPC/FileLine correct, first-FuncForPC 110µs (materializing) -> 6-8µs (zero-copy); 13ms on the original macOS baseline. Known gap: on macOS the on-disk rewrite is corrupted at load time because dyld still walks the stale chained-fixup chain over the section; fix (unlinking the section's nodes from the page chains in LC_DYLD_CHAINED_FIXUPS) is identified and next. Non-prebuilt paths verified regression-free: cl + test/go suites pass, smoke behavior unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every llgo-linked executable (linux/darwin, sites enabled) now gets the prebuilt ftab/findfunctab automatically: internal/build runs internal/pclnpost.Rewrite after linkMainPkg, and any failure degrades silently to the first-use construction fallback. Moves the tool core into internal/pclnpost and hardens it: - Canonical-record detection by FNV: a record survives when its anchor's owning symbol hashes to the record's symbolID (or is the __llgo_stub. wrapper of it). The previous one-per-symbolID rule wrongly collapsed a function with its stub — they share the target's symbolID by design — which broke exact-entry lookups (caught by TestRuntimeLineInfoAndStack on Linux). LTO inline copies are now identified exactly: 8.4k/9.5k copies removed in the LTO probes. - Mach-O chained-fixups surgery: unlink the rewritten sections' pointer slots from the dyld page chains (repointing predecessors' next links and page_start entries) so dyld neither rebases slots inside the new table nor skips unrelated fixups after the zeroed stub section, then re-sign ad hoc. Without this the table was corrupted at load. - LTO-safe metadata location: the entry section carries a meta record whose relocations hold the addresses of the symbol-index pointer and count globals; LTO internalization strips those names from the symbol table but relocations always resolve. Runtime skips the meta rows (pc==0 / symbolID==0). - Idempotence guard (already-rewritten binaries are left alone). Runtime fixes that surfaced during validation: - materializePrebuiltEntries is now two-phase so concurrent losers wait for the winner's store instead of reading a nil entries slice. - pcLineFrameForPC rejects nearest-below sites whose entry is unresolved when the caller knows the function entry, instead of leaking a neighboring function's file/line. Validation: macOS cl (full) + test/go + LLDB 194/194; Linux test/go TestRuntime suite; probes on both platforms report entries=prebuilt with first-FuncForPC at 7-21µs (Linux) from 13ms on the original baseline, and LTO builds drop 8-9.5k inline copies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…table On Mach-O, pointer slots that name exported functions — every __llgo_stub.* wrapper and any exported Go function — are emitted as chained-fixup BIND nodes, not rebases. The rewriter only decoded rebase nodes, so all stub records (and some entry records) were dropped as unowned and never reached the prebuilt ftab; FuncForPC on function values silently fell back to dladdr (~6µs per fresh pc on darwin). - Parse the LC_DYLD_CHAINED_FIXUPS imports table and resolve bind ordinals to their in-image definitions. - Match canonical owners against the record symbolID with underscore normalization (debug/macho's suffix-shared string table can surface one mangling underscore more or less than the source-level name). - Splice the prebuilt header's base slot back into the fixup chain as a live rebase node: dyld writes the slid text base at load, so the runtime reads a ready runtime PC with no slide arithmetic (non-PIE ELF link-time values already equal runtime addresses). - LLGO_PCLNPOST=0 escape hatch keeps first-use construction. Fresh-pc FuncForPC slow path: darwin 6-8µs -> 1.2-1.7µs, linux 6.8µs -> 0.5µs; first-in-process lookup: darwin ~32µs -> ~14µs, linux ~6.8µs -> ~4µs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Pure-compute probes (recursive fib, JSON round-trip, sort.Ints, map churn) with no runtime introspection, so one harness run covers both the introspection extremes and what the funcinfo machinery costs code that never asks for it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Go's pclntab pages are touched by its own runtime (traceback, GC) long before user code queries it, so its first FuncForPC never pays page-in. Mirror that: when the prebuilt table is present, init adopts it (zero-copy, sub-µs), touches the pages the lookup path reads (blob, funcinfo records, string offsets, strings), runs one synthetic lookup to warm the code paths, and write-warms the FuncForPC cache pages. First-in-process FuncForPC: darwin ~17µs -> ~2.8µs, linux ~6.6µs -> ~1.0µs. Startup cost is page-count-bound (tens of µs on stdlib-sized tables, invisible next to ~3ms process startup; hello-world medians unchanged). Non-prebuilt binaries stay fully lazy: first-use construction allocates, which has no place in init, and programs that never introspect pay nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

-depths generates deep_<N> scenarios at configurable call depths; -bigsizes generates bigfunc scenarios (funcs x statements) whose large bodies stress statement-level pcline density, mid-function pc symbolization, and ordinary performance of big method bodies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- Blob overflow: function-value stubs can double the row count, and at ~9k functions the prebuilt blob no longer fit the entry section, so the rewrite silently fell back to first-use construction (cold.FirstFuncForPC 96x96 non-LTO: 2.4ms). On overflow, retry with function entries only — stub pcs degrade to dladdr, real entries keep the prebuilt table. - FuncForPC cache thrash: the set-associative pc cache holds 4k entries; batch workloads over 9k+ distinct functions evicted constantly and paid the string-materializing slow path on every call (multipkg.FuncForPCMany 96x96: 8-11ms vs Go 172µs). Add a per-ftab-row *Func cache for exact-entry lookups, so batch lookups are O(binary search) after the first pass at any scale. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…erflow Function-value stubs can push the row count past what the entry section holds (~9k functions with taken addresses). Instead of dropping stub rows, write the full blob into the (larger) stub section and leave a 32-byte redirect header ("LLGOFTB2" + a live-relocation pointer) in the entry section; the runtime follows it and adopts the same zero-copy view. Function-value lookups keep the prebuilt table at any scale instead of degrading to dladdr. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

funcForPCSlow treated any unaligned pc as a shadow-stack synthetic marker. arm64 function entries are always 4-aligned so this never fired, but amd64 function and stub entries need not be: an unaligned function-value pc skipped the prebuilt exact-entry path entirely and fell through to nearest-below symbolization, reporting the previous function's name (test/go TestRuntimeLineInfoAndStack on ubuntu CI, "bad function value func: main.renamedPC"). Hoist the prebuilt exact-entry + per-row-cache lookup ahead of the alignment heuristic; a genuine synthetic pc just misses the cheap search and proceeds as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The overflow fallback dropped stub rows to fit the entry section. That leaves pc ranges the table claims to cover but does not: a function value whose stub falls in a gap resolves nearest-below to the previous function and silently returns the wrong name — exactly what ubuntu CI caught (amd64 --icf=safe layouts overflow by a few hundred bytes, and non-PIE ELF dladdr cannot rescue). If the blob fits neither the entry section nor the (larger) stub section, skip the rewrite entirely: first-use construction is slower but covers every record. Reproduced and verified on linux/amd64 (qemu): the stub pc had no exact row and nearest-below returned the neighbouring function's name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rgery Fabricated fixtures make the IO paths testable in-process: a minimal ELF exercises load/Rewrite end to end (in-place, stub-section spill, and the overflow fallback that must leave the binary untouched), and a synthetic Mach-O image drives the chained-fixup chain surgery (remove+splice, empty-page insert, unconsumed-insert error). Package coverage 16% -> 69%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A fabricated Mach-O (segments, sections, symtab, chained-fixup imports and an empty page chain) drives load, bind-target resolution, record decoding and both Rewrite outcomes (in-place and stub-section spill) end to end. codesign now runs only when the input carries LC_CODE_SIGNATURE: real lld executables always do, unsigned inputs need no signature and codesign rejects them. Also cover asmQuoteELFSymbol, the empty-table initializers and the Rewrite error paths. Package coverage: pclnpost 69% -> 86%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion and others added 30 commits July 2, 2026 03:27

build: add Go-style pclntab findfunc index

2870ba8

ssa: emit DCE-safe function metadata

f365b73

runtime: add line info for stack frames

e81a007

runtime: add statement line caller frames

1e15db2

runtime: compress funcinfo table

407e0dd

test: cover indirect runtime caller paths

a3115cd

cl: narrow runtime caller tracking

c6c128a

runtime: add compact pc-line funcinfo table

4692e15

runtime: optimize FuncForPC metadata lookup

b29906c

runtime: slim FuncForPC cache hot path

734368f

cl: make pc-line labels clone-safe

85b1c13

runtime: guard funcinfo table initialization

b33a774

runtime: fix funcinfo entry pc line metadata

8e6f33c

runtime: publish funcinfo records for live stubs

8eec8e0

runtime: skip ELF stub-site records during LTO

01913f4

runtime: reduce funcinfo lookup initialization cost

6fa875b

test: cover runtime caller metadata edges

34b274c

runtime: speed up funcinfo entry lookup

3b87c90

test: cover pcline metadata in dev lto coverage

012095b

runtime: speed up funcinfo hot paths

9d90419

runtime: avoid FuncForPC cache thrashing

1b8bc56

runtime: use Go-style funcinfo find index

cf33354

runtime: use static funcinfo symbol index

bb1b1a3

cpunion force-pushed the codex/pclntab-linkphase-p1 branch from f91edf9 to 89a1b4e Compare July 3, 2026 01:50

cpunion force-pushed the codex/pclntab-linkphase-p1 branch 2 times, most recently from 201b7bb to 1a9ca68 Compare July 3, 2026 02:33

cpunion and others added 2 commits July 3, 2026 11:00

ci: exclude dev tools and test scaffolding from coverage

83edf88

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion mentioned this pull request Jul 3, 2026

runtime: LLGo-owned frame-pointer unwinder (Stage 5) #2019

Open

cpunion and others added 14 commits July 3, 2026 13:40

doc: record P3 findings in pclntab-linkphase design

7ab0a84

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

benchmark: keep full LTO from constant-folding the plain fib probe

b8078d4

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion force-pushed the codex/pclntab-linkphase-p1 branch from 1a9ca68 to aa4ef20 Compare July 3, 2026 05:42

cpunion mentioned this pull request Jul 3, 2026

runtime,cl,build: stage5 cleanup — explicit FP pairing flag and tidy-ups #2022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: link-phase ftab/findfunctab generation (Stage 2, P1–P3)#2016

runtime: link-phase ftab/findfunctab generation (Stage 2, P1–P3)#2016
cpunion wants to merge 54 commits into
xgo-dev:mainfrom
cpunion:codex/pclntab-linkphase-p1

cpunion commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cpunion commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Full benchmark matrix

macOS arm64

Linux arm64 (container)

Fresh-pc slow-path lookups (function values, per-pc, split-timing probe)

Scale benchmarks (packages × methods, call depth, big methods)

Big-method scenario (bigfunc, funcs × statements)

Honest remaining gaps at scale (all quantified, none regressions of this PR)

Ordinary-code performance (plain scenario, no runtime introspection)

Binary size (identical between 2012 and 2016 — in-place rewrite does not change file size)

LTO table integrity

Validation

Staging status (per #2004 / design doc)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpunion commented Jul 2, 2026 •

edited

Loading

Big-method scenario (`bigfunc`, funcs × statements)

Ordinary-code performance (`plain` scenario, no runtime introspection)