runtime,cl: stack-keyed sampled memory profiling with gc semantics by cpunion · Pull Request #2027 · xgo-dev/llgo

cpunion · 2026-07-04T07:24:09Z

Third follow-up planned in #2004: gc-shaped heap profiling (reimplements the goal of #1905 on the unwinder base; #1902's records were size-class counters without stacks).

What changes

Stack-keyed sampled records: sampled allocations attribute to physical call stacks at exact statement lines. Records hold raw sampled counts — consumers (pprof's scaleHeapSample, goroot heapsampling.go) apply the Poisson correction themselves, exactly as with gc. Getting this wrong is invisible in single-size microbenchmarks and only shows up as a uniform skew under real workloads.
Sampling mirrors mcache.nextSample: bytes count down to an exponentially distributed threshold (mean MemProfileRate), one sample per crossing, redraw. The memoryless distribution is load-bearing — with any bounded-support threshold, a near-periodic allocation pattern phase-locks the sampling points onto the large sites (measured: 1.6× per-site skew on heapsampling's interleaved sizes before switching).
Capture: FP walk at sample time through a hook the public runtime registers; allocator plumbing (including the hook's __llgo_stub. wrapper frame) trimmed at read time; a reentrancy flag spans the whole decision path (threshold drawing and bucket nodes allocate — a recursive sample overflows the stack, found the hard way).
Attribution support in cl: heap allocations get statement anchors in tracked functions, and a package that reads the memory profile pins all its trackable functions — per-site attribution loses sites to inlining otherwise (profiling packages are rare; gc gets both via its inline tree, which is the P4 path to lifting this).

Conformance

goroot heapsampling.go passes on darwin/arm64 and linux/arm64 (the latter was a pre-existing gap on main); xfail entries removed for all platforms.
New acceptance regression asserts exact per-line attribution at rate=1; a cl unit test covers the package-pinning criterion under both the "runtime" and patched lib-runtime spellings (the two diverge between unit builds and the real pipeline — the first implementation only matched one and silently pinned nothing).

Old PR #1905 can close when this lands (its shadow-stack-era implementation is superseded).

Validation (both platforms before push)

macOS: acceptance + statement/line-info suites, full test/go, heapsampling green.
linux/arm64 container: memprofile + C-fault + statement regressions, heapsampling green.
Allocation fast path gains one flag check + one subtraction when profiling is active (default rate 512KiB); plain.* unaffected.

Based on #2024 (needs its --icf=none: heapsampling's three identical wrapper functions must keep distinct pcs); #2024 in turn is based on #2023.

🤖 Generated with Claude Code

Bring over the cross-branch runtime funcinfo benchmark (hot, deep, multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own performance numbers. cold.FirstCallersFrames now walks to the first fully symbolized frame, because synthetic runtime frames (LLGo's runtime.Callers placeholder) carry no file/line and the metric was silently skipped on LLGo. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

macOS previously had no entry/stub/pcline site sections, so first-use funcinfo initialization fell back to one dlsym per function and per stub (13ms cold on a small binary, 27ms with LTO), and statement-level pc-line records did not exist at all. Emit the same site records on Mach-O: - __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the live_support attribute: under ld64/lld -dead_strip a live_support atom survives only if the atom it references (the anchor label inside the function body) is live, which matches the records-follow-function semantics ELF gets from SHF_LINK_ORDER with --gc-sections. - One lowercase-l linker-private symbol per record so each record is its own atom and dead functions drop exactly their own records. - Assembler-local (L-prefixed) pc-site labels: Mach-O subsections-via-symbols treats visible labels as atom boundaries, and a visible label in the middle of a function let the linker split and reorder function bodies. - Boundary symbols via ld64's section$start$/section$end$, emitted with the \x01 verbatim-name prefix so LLVM does not prepend the Mach-O underscore. - A no_dead_strip zero record per section in the main module keeps the sections (and their boundary symbols) present even when no package contributed records. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First-use initialization: - Skip the per-stub dlsym loop when the stub-site section provided the frames; each dlsym is a dynamic-loader query and the loop dominated cold latency. - Materialize per-function strings and entry PCs once per function and packed file strings once per file ID during pcline table construction instead of once per site. Cold FuncForPC fast path: before the frame table exists, resolve exact function-value PCs with a bounded linear scan of the raw entry-site and stub-site sections (compile-time data, no loader query), then one dladdr as fallback; both require an entry match within the warm path's slack so stripped-local misattribution is impossible. The path is budgeted: after a handful of cold lookups the sorted table amortizes better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to ~35us on macOS. Find index: subbucket deltas are now uint16 and the whole-index abandonment on delta overflow is gone. Go stores uint8 deltas because its linker guarantees a 16-byte minimum function size; LLGo indexes call-site records that sit a few bytes apart, and a dense 4KiB bucket silently degraded every lookup in the process to a full binary search. A delta counts deduplicated PCs inside one bucket, so it is bounded by the bucket size and uint16 cannot overflow. Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built table (frame/bucket counts, index built or fallback, sites vs dlsym sources) so benchmarks can tell which path they measured. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Caller/Callers capture used to intern the frame into the synthetic table: a hash probe plus a full frame comparison per stack slot per call. Memoize the interned PC base in the shadow-stack slot and invalidate it when the recorded line changes (for one entry the instrumented name/file operands are constants, so the line is the only thing that varies between call sites). The three static frames emitted around every Callers walk get per-store memo slots, and the emit loop is unrolled so nothing escapes and skipped frames are never captured. macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns. hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…py limit Frames.Next allocated a fresh *Func per symbolized frame; route it through the FuncForPC 4-way cache so repeated CallersFrames walks over the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns (338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns. Also document a pre-existing limitation at the entry-site emitter: the body-embedded inline-asm record is duplicated by LTO inlining into every inline site (~4x section growth on multipkg) and registers host-function PCs under the inlinee's symbol ID. Runtime only consults the table when native symbolization fails, which bounds the impact; the fix (data globals with !associated metadata) needs LLVMGlobalSetMetadata in the llvm binding and lands with the link-phase ftab work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Record the experiment results at the emitter: !associated only guides linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used pins dead functions through the records' address initializers; and noduplicate blocks inlining. Section dedup is link-phase work. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Post-link table generation plan: parse the linked binary's metadata sections, dedup LTO inline copies against the symbol table, sort with a sentinel, build Go-layout findfunctab via internal/pclntab, and write back into a reserved section with ASLR-safe anchor offsets. Runtime adopts the prebuilt table when the header validates and keeps first-use construction as fallback. Includes the list of platform facts established in xgo-dev#2012 so implementation does not re-derive them. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

First stage of doc/design/pclntab-linkphase.md: parse a linked binary's funcinfo entry/stub sections (Mach-O and ELF), deduplicate LTO inline copies against the symbol table's text ranges, sort with a Go-style sentinel, and build findfunctab through internal/pclntab — the faithful port that has been waiting for exactly this caller. Read-only: prints what the P2 build integration would write back. Measured on the 576-target multipkg binaries: - non-LTO: 9319 records -> ftab 3161 + 207 buckets; lookup self-check 3160/3160; site sections 149KB -> 29KB (5.1x) - LTO: 15371 entry records -> 13857 inline copies dropped, 4144 kept; self-check 3045/3045; 299KB -> 28.5KB (10.5x) Findings for P2: on-disk Mach-O pointer slots hold dyld chained-fixup encodings (low 36 bits are the target; decoded here; the write-back design stores anchor-relative offsets and avoids pointers entirely), and some non-LTO stub symbols are absent from the symbol table (records conservatively dropped; needs tightening). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…adoption pclnpost -write rewrites the entry-site section in place with the prebuilt table (header + ftab {entryOff,funcIndex} + runtime-layout findfunctab buckets), resolving funcinfo indexes through the binary's symbol-index section, and voids the stub section (its records are merged into the table). ASLR is handled by anchoring on the section's own link-time address; entries are normalized to true symbol starts, which retires the entry-PC slack on this path. macOS re-signs with an ad-hoc codesign after rewriting. The runtime adopts the table zero-copy when the magic header validates: lookups binary-search the on-disk ftab directly through the shared bucket index, nothing is materialized on first use (the funcIndex -> entry map is built lazily and only for the pcline initializer), and the cold scan/dladdr path is skipped since adoption is cheap. First-use construction remains the fallback whenever the header is absent. Linux end-to-end: entries=prebuilt, FuncForPC/FileLine correct, first-FuncForPC 110µs (materializing) -> 6-8µs (zero-copy); 13ms on the original macOS baseline. Known gap: on macOS the on-disk rewrite is corrupted at load time because dyld still walks the stale chained-fixup chain over the section; fix (unlinking the section's nodes from the page chains in LC_DYLD_CHAINED_FIXUPS) is identified and next. Non-prebuilt paths verified regression-free: cl + test/go suites pass, smoke behavior unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every llgo-linked executable (linux/darwin, sites enabled) now gets the prebuilt ftab/findfunctab automatically: internal/build runs internal/pclnpost.Rewrite after linkMainPkg, and any failure degrades silently to the first-use construction fallback. Moves the tool core into internal/pclnpost and hardens it: - Canonical-record detection by FNV: a record survives when its anchor's owning symbol hashes to the record's symbolID (or is the __llgo_stub. wrapper of it). The previous one-per-symbolID rule wrongly collapsed a function with its stub — they share the target's symbolID by design — which broke exact-entry lookups (caught by TestRuntimeLineInfoAndStack on Linux). LTO inline copies are now identified exactly: 8.4k/9.5k copies removed in the LTO probes. - Mach-O chained-fixups surgery: unlink the rewritten sections' pointer slots from the dyld page chains (repointing predecessors' next links and page_start entries) so dyld neither rebases slots inside the new table nor skips unrelated fixups after the zeroed stub section, then re-sign ad hoc. Without this the table was corrupted at load. - LTO-safe metadata location: the entry section carries a meta record whose relocations hold the addresses of the symbol-index pointer and count globals; LTO internalization strips those names from the symbol table but relocations always resolve. Runtime skips the meta rows (pc==0 / symbolID==0). - Idempotence guard (already-rewritten binaries are left alone). Runtime fixes that surfaced during validation: - materializePrebuiltEntries is now two-phase so concurrent losers wait for the winner's store instead of reading a nil entries slice. - pcLineFrameForPC rejects nearest-below sites whose entry is unresolved when the caller knows the function entry, instead of leaking a neighboring function's file/line. Validation: macOS cl (full) + test/go + LLDB 194/194; Linux test/go TestRuntime suite; probes on both platforms report entries=prebuilt with first-FuncForPC at 7-21µs (Linux) from 13ms on the original baseline, and LTO builds drop 8-9.5k inline copies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…table On Mach-O, pointer slots that name exported functions — every __llgo_stub.* wrapper and any exported Go function — are emitted as chained-fixup BIND nodes, not rebases. The rewriter only decoded rebase nodes, so all stub records (and some entry records) were dropped as unowned and never reached the prebuilt ftab; FuncForPC on function values silently fell back to dladdr (~6µs per fresh pc on darwin). - Parse the LC_DYLD_CHAINED_FIXUPS imports table and resolve bind ordinals to their in-image definitions. - Match canonical owners against the record symbolID with underscore normalization (debug/macho's suffix-shared string table can surface one mangling underscore more or less than the source-level name). - Splice the prebuilt header's base slot back into the fixup chain as a live rebase node: dyld writes the slid text base at load, so the runtime reads a ready runtime PC with no slide arithmetic (non-PIE ELF link-time values already equal runtime addresses). - LLGO_PCLNPOST=0 escape hatch keeps first-use construction. Fresh-pc FuncForPC slow path: darwin 6-8µs -> 1.2-1.7µs, linux 6.8µs -> 0.5µs; first-in-process lookup: darwin ~32µs -> ~14µs, linux ~6.8µs -> ~4µs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Pure-compute probes (recursive fib, JSON round-trip, sort.Ints, map churn) with no runtime introspection, so one harness run covers both the introspection extremes and what the funcinfo machinery costs code that never asks for it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Go's pclntab pages are touched by its own runtime (traceback, GC) long before user code queries it, so its first FuncForPC never pays page-in. Mirror that: when the prebuilt table is present, init adopts it (zero-copy, sub-µs), touches the pages the lookup path reads (blob, funcinfo records, string offsets, strings), runs one synthetic lookup to warm the code paths, and write-warms the FuncForPC cache pages. First-in-process FuncForPC: darwin ~17µs -> ~2.8µs, linux ~6.6µs -> ~1.0µs. Startup cost is page-count-bound (tens of µs on stdlib-sized tables, invisible next to ~3ms process startup; hello-world medians unchanged). Non-prebuilt binaries stay fully lazy: first-use construction allocates, which has no place in init, and programs that never introspect pay nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

-depths generates deep_<N> scenarios at configurable call depths; -bigsizes generates bigfunc scenarios (funcs x statements) whose large bodies stress statement-level pcline density, mid-function pc symbolization, and ordinary performance of big method bodies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- Blob overflow: function-value stubs can double the row count, and at ~9k functions the prebuilt blob no longer fit the entry section, so the rewrite silently fell back to first-use construction (cold.FirstFuncForPC 96x96 non-LTO: 2.4ms). On overflow, retry with function entries only — stub pcs degrade to dladdr, real entries keep the prebuilt table. - FuncForPC cache thrash: the set-associative pc cache holds 4k entries; batch workloads over 9k+ distinct functions evicted constantly and paid the string-materializing slow path on every call (multipkg.FuncForPCMany 96x96: 8-11ms vs Go 172µs). Add a per-ftab-row *Func cache for exact-entry lookups, so batch lookups are O(binary search) after the first pass at any scale. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…erflow Function-value stubs can push the row count past what the entry section holds (~9k functions with taken addresses). Instead of dropping stub rows, write the full blob into the (larger) stub section and leave a 32-byte redirect header ("LLGOFTB2" + a live-relocation pointer) in the entry section; the runtime follows it and adopts the same zero-copy view. Function-value lookups keep the prebuilt table at any scale instead of degrading to dladdr. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

funcForPCSlow treated any unaligned pc as a shadow-stack synthetic marker. arm64 function entries are always 4-aligned so this never fired, but amd64 function and stub entries need not be: an unaligned function-value pc skipped the prebuilt exact-entry path entirely and fell through to nearest-below symbolization, reporting the previous function's name (test/go TestRuntimeLineInfoAndStack on ubuntu CI, "bad function value func: main.renamedPC"). Hoist the prebuilt exact-entry + per-row-cache lookup ahead of the alignment heuristic; a genuine synthetic pc just misses the cheap search and proceeds as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The overflow fallback dropped stub rows to fit the entry section. That leaves pc ranges the table claims to cover but does not: a function value whose stub falls in a gap resolves nearest-below to the previous function and silently returns the wrong name — exactly what ubuntu CI caught (amd64 --icf=safe layouts overflow by a few hundred bytes, and non-PIE ELF dladdr cannot rescue). If the blob fits neither the entry section nor the (larger) stub section, skip the rewrite entirely: first-use construction is slower but covers every record. Reproduced and verified on linux/amd64 (qemu): the stub pc had no exact row and nearest-below returned the neighbouring function's name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rgery Fabricated fixtures make the IO paths testable in-process: a minimal ELF exercises load/Rewrite end to end (in-place, stub-section spill, and the overflow fallback that must leave the binary untouched), and a synthetic Mach-O image drives the chained-fixup chain surgery (remove+splice, empty-page insert, unconsumed-insert error). Package coverage 16% -> 69%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A fabricated Mach-O (segments, sections, symtab, chained-fixup imports and an empty page chain) drives load, bind-target resolution, record decoding and both Rewrite outcomes (in-place and stub-section spill) end to end. codesign now runs only when the input carries LC_CODE_SIGNATURE: real lld executables always do, unsigned inputs need no signature and codesign rejects them. Also cover asmQuoteELFSymbol, the empty-table initializers and the Rewrite error paths. Package coverage: pclnpost 69% -> 86%. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Every Go function on supported targets keeps the frame-pointer chain ("frame-pointer"="non-leaf", gated by Program.NeedsFramePointer to linux/darwin — on embedded targets the unwinder does not exist and the layout change perturbed the conservative GC on ESP32-C3). runtime.Caller, Callers, CallersFrames, Stack and the unrecovered-panic dump walk [fp]/[fp+w] directly and symbolize through the prebuilt ftab and pcline tables: - Return addresses resolve at pc-1 (Go's convention); statement labels can land exactly on a return address, so raw-pc nearest-below reported the following line. The convention holds with or without the prebuilt table (text bounds fall back to the first-use frame table — link-phase overflow layouts otherwise silently disabled it, the root cause of the amd64 CI failures). - The walk is bounded to the program's own text: libc frames without FP discipline decode as wild pcs that nearest-below would attribute to arbitrary functions. - Methods and anonymous functions are now trackable (methods had no pcline labels; closures lost their innermost frame to tail-call optimization), and mid-function aligned pcs merge statement records instead of returning declaration lines. - frameSymbol results are memoized per pc (deep re-walks paid a dladdr per frame: 32-frame walks 8µs -> 180ns) and the pcline table is built during the startup pre-warm (lazily building it inside the first Caller cost ~200µs at scale). - Shadow-stack instrumentation is no longer emitted; LLGO_SHADOW_STACK=1 keeps the legacy emitters for one release. Tracked functions retain noinline, no-tail-call and the data-only pcline records. - libunwind is gone: the clite stacktrace fallback walks the FP chain with dladdr names (same output format), and linux binaries no longer link -lunwind. Semantics are gc ground truth, verified against go: physical stacks show every real frame; interface-chain Caller marks land at skip 3 and closure chains at skip 4 (the old expectations encoded shadow-stack frame loss). Perf (best-of, mac/linux): hot.Caller0 17/37ns (Go 155/241), deep.Direct512 2.8µs (Go 9.7µs; was 87-95µs), bigfunc.Work 18µs (Go 30µs; was 433µs), binary size unchanged or smaller. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

IR goldens gain the frame-pointer attribute (out.ll files carry no attribute groups and needed no regeneration); the legacy shadow-stack emitter assertions opt into LLGO_SHADOW_STACK; statement-line probes move to gc ground-truth skip counts; NeedsFramePointer target matrix and pclnpost symbolAddr/decodePtr edges covered. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ct log/slog/testing locations An unrecovered panic now prints a Go-style traceback (function names plus file:line per physical frame) through a PanicTraceback hook the public runtime registers; the clite dladdr dump remains the fallback when the FP walk or the tables are unavailable. Caller-frame tracking now applies uniformly: the blanket stdlib exclusion is gone, so the same per-package reaches-runtime.Caller analysis that already covered third-party code tracks log.Output, slog's Logger.log and testing's decorate chains (their thin wrappers were inlined, making fixed Caller depths count past them — log.Lshortfile printed "???:1"). Call sites into caller-pc-consuming functions of other packages get a statement anchor so the attributed frame reports the exact line. The collector also picks up named-type methods declared by the package itself — a type used only concretely never enters RuntimeTypes, which is exactly how slog.(*Logger).Info escaped tracking. hello-world size cost: +368 bytes (the traceback printer). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Four scenarios, every expectation verified against gc: unrecovered panic tracebacks; log.Lshortfile and slog AddSource (text+JSON, package funcs and logger methods); a failing t.Errorf under llgo test; and an introspection grab-bag (goroutine/init/defer callers, FuncForPC names for methods, closures and generics, the errors-with-stack capture idiom). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

codecov · 2026-07-04T07:41:13Z

Codecov Report

❌ Patch coverage is 75.19380% with 192 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/build/funcinfo_table.go	76.24%	103 Missing and 50 partials ⚠️
internal/build/pclntab_llvm.go	0.00%	22 Missing ⚠️
internal/build/build.go	73.80%	7 Missing and 4 partials ⚠️
ssa/funcinfo.go	90.90%	3 Missing and 2 partials ⚠️
internal/crosscompile/crosscompile.go	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…root xfails Three conformance residuals left by the earlier shadow-stack-era PRs (xgo-dev#1903, xgo-dev#1924), reimplemented on the unwinder base: - //line directive filenames follow gc's spelling: relative directives stay relative (the package loader expands them to absolute paths), empty directives report "??" — and an empty filename must still emit its statement anchor or attribution falls through to a neighboring record (fixedbugs/issue18149, issue22662). - Linking uses --icf=none: Go semantics require distinct functions to keep distinct pcs (FuncForPC names, function-value identity — fixedbugs/issue58300 printed main.f for both f and g). lld's safe mode folded llgo-emitted same-body functions anyway; gc never folds. Hello-world cost: +1.6%. - Remove 12 goroot xfails now passing: the two //line cases, issue58300, and nine that the stage5+acceptance line fixed outright (inline_caller, inline_callers, inline_literal, issue7690, issue17381, issue21879, issue22083, issue29735, issue58300b). Still xfailed, tracked for the panic-snapshot follow-up: bug347, bug348, issue14646, issue27201, issue29504, issue33724, issue4562, issue5856 (deferred runtime.Caller during panic unwinding); heapsampling (memprofile attribution). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replaces the size-class counters with gc-shaped heap profiling: sampled allocations are attributed to physical call stacks at exact statement lines, and records hold RAW sampled counts — consumers (pprof, goroot heapsampling.go) apply the Poisson correction themselves, exactly as with gc. - Sampling mirrors gc's mcache.nextSample: bytes count down to an exponentially distributed threshold (mean MemProfileRate), sample once on crossing, redraw. The memoryless distribution is load-bearing: with any bounded-support threshold a near-periodic allocation pattern phase-locks the sample points onto the large sites (observed 1.6x per-site skew on heapsampling's interleaved sizes). ln() is a small local approximation — the runtime core cannot import math. - Stacks come from the FP walk at sample time (fpCallers via a hook the public runtime registers), bucketed by stack hash; allocator plumbing (including __llgo_stub. wrapper frames of the hook) is trimmed at read time. A reentrancy flag spans the whole decision path: threshold drawing and bucket allocation themselves allocate, and a recursive sample overflows the stack. - Heap allocations get statement anchors in tracked functions, and a package that reads the memory profile (runtime.MemProfile / MemProfileRate under either the "runtime" or the patched lib-runtime spelling) pins all its trackable functions: per-site attribution loses sites to inlining otherwise. Profiling packages are rare and accuracy beats inlining there; gc gets both via its inline tree (P4). goroot heapsampling.go passes on darwin/arm64 and linux/arm64 (the latter was a pre-existing platform gap). Depends on --icf=none from the line-directive PR: heapsampling's three identical wrapper functions must keep distinct pcs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

An acceptance regression asserts exact per-line attribution at rate=1 (raw counts are exact there), and a cl unit test covers the memprofile-package pinning criterion under both runtime spellings. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion and others added 30 commits July 2, 2026 03:27

build: add Go-style pclntab findfunc index

2870ba8

ssa: emit DCE-safe function metadata

f365b73

runtime: add line info for stack frames

e81a007

runtime: add statement line caller frames

1e15db2

runtime: compress funcinfo table

407e0dd

test: cover indirect runtime caller paths

a3115cd

cl: narrow runtime caller tracking

c6c128a

runtime: add compact pc-line funcinfo table

4692e15

runtime: optimize FuncForPC metadata lookup

b29906c

runtime: slim FuncForPC cache hot path

734368f

cl: make pc-line labels clone-safe

85b1c13

runtime: guard funcinfo table initialization

b33a774

runtime: fix funcinfo entry pc line metadata

8e6f33c

runtime: publish funcinfo records for live stubs

8eec8e0

runtime: skip ELF stub-site records during LTO

01913f4

runtime: reduce funcinfo lookup initialization cost

6fa875b

test: cover runtime caller metadata edges

34b274c

runtime: speed up funcinfo entry lookup

3b87c90

test: cover pcline metadata in dev lto coverage

012095b

runtime: speed up funcinfo hot paths

9d90419

runtime: avoid FuncForPC cache thrashing

1b8bc56

runtime: use Go-style funcinfo find index

cf33354

runtime: use static funcinfo symbol index

bb1b1a3

cpunion and others added 20 commits July 3, 2026 13:40

doc: record P3 findings in pclntab-linkphase design

7ab0a84

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

benchmark: keep full LTO from constant-folding the plain fib probe

b8078d4

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

doc: stage5 invariants, diagnostic traps and merge queue

4925b82

cpunion mentioned this pull request Jul 4, 2026

Proposal: runtime funcinfo metadata and fast FuncForPC lookup #2004

Open

cpunion force-pushed the codex/stage5-memprofile branch from cd907ef to f4dea56 Compare July 4, 2026 07:43

cpunion and others added 2 commits July 4, 2026 15:49

ci: raise covered-test timeout to 45m for the acceptance suite

a5fa2fc

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

cpunion force-pushed the codex/stage5-memprofile branch 2 times, most recently from 31e99a5 to c647b14 Compare July 4, 2026 10:57

cpunion and others added 2 commits July 4, 2026 20:50

cpunion force-pushed the codex/stage5-memprofile branch from c647b14 to 2b22d07 Compare July 4, 2026 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime,cl: stack-keyed sampled memory profiling with gc semantics#2027

runtime,cl: stack-keyed sampled memory profiling with gc semantics#2027
cpunion wants to merge 63 commits into
xgo-dev:mainfrom
cpunion:codex/stage5-memprofile

cpunion commented Jul 4, 2026

Uh oh!

codecov Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cpunion commented Jul 4, 2026

What changes

Conformance

Validation (both platforms before push)

Uh oh!

codecov Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Jul 4, 2026 •

edited

Loading