Skip to content

runtime,cl: stack-keyed sampled memory profiling with gc semantics#2027

Open
cpunion wants to merge 63 commits into
xgo-dev:mainfrom
cpunion:codex/stage5-memprofile
Open

runtime,cl: stack-keyed sampled memory profiling with gc semantics#2027
cpunion wants to merge 63 commits into
xgo-dev:mainfrom
cpunion:codex/stage5-memprofile

Conversation

@cpunion

@cpunion cpunion commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Third follow-up planned in #2004: gc-shaped heap profiling (reimplements the goal of #1905 on the unwinder base; #1902's records were size-class counters without stacks).

What changes

  • Stack-keyed sampled records: sampled allocations attribute to physical call stacks at exact statement lines. Records hold raw sampled counts — consumers (pprof's scaleHeapSample, goroot heapsampling.go) apply the Poisson correction themselves, exactly as with gc. Getting this wrong is invisible in single-size microbenchmarks and only shows up as a uniform skew under real workloads.
  • Sampling mirrors mcache.nextSample: bytes count down to an exponentially distributed threshold (mean MemProfileRate), one sample per crossing, redraw. The memoryless distribution is load-bearing — with any bounded-support threshold, a near-periodic allocation pattern phase-locks the sampling points onto the large sites (measured: 1.6× per-site skew on heapsampling's interleaved sizes before switching).
  • Capture: FP walk at sample time through a hook the public runtime registers; allocator plumbing (including the hook's __llgo_stub. wrapper frame) trimmed at read time; a reentrancy flag spans the whole decision path (threshold drawing and bucket nodes allocate — a recursive sample overflows the stack, found the hard way).
  • Attribution support in cl: heap allocations get statement anchors in tracked functions, and a package that reads the memory profile pins all its trackable functions — per-site attribution loses sites to inlining otherwise (profiling packages are rare; gc gets both via its inline tree, which is the P4 path to lifting this).

Conformance

  • goroot heapsampling.go passes on darwin/arm64 and linux/arm64 (the latter was a pre-existing gap on main); xfail entries removed for all platforms.
  • New acceptance regression asserts exact per-line attribution at rate=1; a cl unit test covers the package-pinning criterion under both the "runtime" and patched lib-runtime spellings (the two diverge between unit builds and the real pipeline — the first implementation only matched one and silently pinned nothing).

Old PR #1905 can close when this lands (its shadow-stack-era implementation is superseded).

Validation (both platforms before push)

  • macOS: acceptance + statement/line-info suites, full test/go, heapsampling green.
  • linux/arm64 container: memprofile + C-fault + statement regressions, heapsampling green.
  • Allocation fast path gains one flag check + one subtraction when profiling is active (default rate 512KiB); plain.* unaffected.

Based on #2024 (needs its --icf=none: heapsampling's three identical wrapper functions must keep distinct pcs); #2024 in turn is based on #2023.

🤖 Generated with Claude Code

cpunion and others added 30 commits July 2, 2026 03:27
Bring over the cross-branch runtime funcinfo benchmark (hot, deep,
multipkg, cold, stdlib scenarios) so xgo-dev#2012 can reproduce its own
performance numbers. cold.FirstCallersFrames now walks to the first
fully symbolized frame, because synthetic runtime frames (LLGo's
runtime.Callers placeholder) carry no file/line and the metric was
silently skipped on LLGo.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
macOS previously had no entry/stub/pcline site sections, so first-use
funcinfo initialization fell back to one dlsym per function and per
stub (13ms cold on a small binary, 27ms with LTO), and statement-level
pc-line records did not exist at all.

Emit the same site records on Mach-O:

- __DATA,__llgo_fie / __llgo_stub / __llgo_pcl sections with the
  live_support attribute: under ld64/lld -dead_strip a live_support
  atom survives only if the atom it references (the anchor label inside
  the function body) is live, which matches the records-follow-function
  semantics ELF gets from SHF_LINK_ORDER with --gc-sections.
- One lowercase-l linker-private symbol per record so each record is
  its own atom and dead functions drop exactly their own records.
- Assembler-local (L-prefixed) pc-site labels: Mach-O
  subsections-via-symbols treats visible labels as atom boundaries, and
  a visible label in the middle of a function let the linker split and
  reorder function bodies.
- Boundary symbols via ld64's section$start$/section$end$, emitted
  with the \x01 verbatim-name prefix so LLVM does not prepend the
  Mach-O underscore.
- A no_dead_strip zero record per section in the main module keeps the
  sections (and their boundary symbols) present even when no package
  contributed records.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
First-use initialization:

- Skip the per-stub dlsym loop when the stub-site section provided the
  frames; each dlsym is a dynamic-loader query and the loop dominated
  cold latency.
- Materialize per-function strings and entry PCs once per function and
  packed file strings once per file ID during pcline table construction
  instead of once per site.

Cold FuncForPC fast path: before the frame table exists, resolve exact
function-value PCs with a bounded linear scan of the raw entry-site and
stub-site sections (compile-time data, no loader query), then one
dladdr as fallback; both require an entry match within the warm path's
slack so stripped-local misattribution is impossible. The path is
budgeted: after a handful of cold lookups the sorted table amortizes
better, so it is built as usual. cold.FirstFuncForPC drops from 13ms to
~35us on macOS.

Find index: subbucket deltas are now uint16 and the whole-index
abandonment on delta overflow is gone. Go stores uint8 deltas because
its linker guarantees a 16-byte minimum function size; LLGo indexes
call-site records that sit a few bytes apart, and a dense 4KiB bucket
silently degraded every lookup in the process to a full binary search.
A delta counts deduplicated PCs inside one bucket, so it is bounded by
the bucket size and uint16 cannot overflow.

Observability: LLGO_FUNCINFO_DEBUG=1 prints one line per lazily built
table (frame/bucket counts, index built or fallback, sites vs dlsym
sources) so benchmarks can tell which path they measured.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every Caller/Callers capture used to intern the frame into the
synthetic table: a hash probe plus a full frame comparison per stack
slot per call. Memoize the interned PC base in the shadow-stack slot
and invalidate it when the recorded line changes (for one entry the
instrumented name/file operands are constants, so the line is the only
thing that varies between call sites). The three static frames emitted
around every Callers walk get per-store memo slots, and the emit loop
is unrolled so nothing escapes and skipped frames are never captured.

macOS: hot.CallersOnly 182ns -> 125ns (Go 1.26: 118ns); with LTO 96ns.
hot.CallersFramesFirst 528ns -> 471ns, 354ns with LTO (Go: 401ns).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…py limit

Frames.Next allocated a fresh *Func per symbolized frame; route it
through the FuncForPC 4-way cache so repeated CallersFrames walks over
the same PCs stop allocating. hot.CallersFramesFirst: macOS 471->456ns
(338ns with LTO, Go 1.26: 406ns); Linux LTO reaches parity at 433ns.

Also document a pre-existing limitation at the entry-site emitter: the
body-embedded inline-asm record is duplicated by LTO inlining into
every inline site (~4x section growth on multipkg) and registers
host-function PCs under the inlinee's symbol ID. Runtime only consults
the table when native symbolization fails, which bounds the impact;
the fix (data globals with !associated metadata) needs
LLVMGlobalSetMetadata in the llvm binding and lands with the
link-phase ftab work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Record the experiment results at the emitter: !associated only guides
linker GC and IR-level GlobalDCE deletes the records; llvm.compiler.used
pins dead functions through the records' address initializers; and
noduplicate blocks inlining. Section dedup is link-phase work.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Post-link table generation plan: parse the linked binary's metadata
sections, dedup LTO inline copies against the symbol table, sort with a
sentinel, build Go-layout findfunctab via internal/pclntab, and write
back into a reserved section with ASLR-safe anchor offsets. Runtime
adopts the prebuilt table when the header validates and keeps first-use
construction as fallback. Includes the list of platform facts
established in xgo-dev#2012 so implementation does not re-derive them.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
cpunion and others added 20 commits July 3, 2026 13:40
First stage of doc/design/pclntab-linkphase.md: parse a linked binary's
funcinfo entry/stub sections (Mach-O and ELF), deduplicate LTO inline
copies against the symbol table's text ranges, sort with a Go-style
sentinel, and build findfunctab through internal/pclntab — the faithful
port that has been waiting for exactly this caller. Read-only: prints
what the P2 build integration would write back.

Measured on the 576-target multipkg binaries:
- non-LTO: 9319 records -> ftab 3161 + 207 buckets; lookup self-check
  3160/3160; site sections 149KB -> 29KB (5.1x)
- LTO: 15371 entry records -> 13857 inline copies dropped, 4144 kept;
  self-check 3045/3045; 299KB -> 28.5KB (10.5x)

Findings for P2: on-disk Mach-O pointer slots hold dyld chained-fixup
encodings (low 36 bits are the target; decoded here; the write-back
design stores anchor-relative offsets and avoids pointers entirely),
and some non-LTO stub symbols are absent from the symbol table
(records conservatively dropped; needs tightening).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…adoption

pclnpost -write rewrites the entry-site section in place with the
prebuilt table (header + ftab {entryOff,funcIndex} + runtime-layout
findfunctab buckets), resolving funcinfo indexes through the binary's
symbol-index section, and voids the stub section (its records are
merged into the table). ASLR is handled by anchoring on the section's
own link-time address; entries are normalized to true symbol starts,
which retires the entry-PC slack on this path. macOS re-signs with an
ad-hoc codesign after rewriting.

The runtime adopts the table zero-copy when the magic header validates:
lookups binary-search the on-disk ftab directly through the shared
bucket index, nothing is materialized on first use (the funcIndex ->
entry map is built lazily and only for the pcline initializer), and the
cold scan/dladdr path is skipped since adoption is cheap. First-use
construction remains the fallback whenever the header is absent.

Linux end-to-end: entries=prebuilt, FuncForPC/FileLine correct,
first-FuncForPC 110µs (materializing) -> 6-8µs (zero-copy); 13ms on the
original macOS baseline. Known gap: on macOS the on-disk rewrite is
corrupted at load time because dyld still walks the stale chained-fixup
chain over the section; fix (unlinking the section's nodes from the
page chains in LC_DYLD_CHAINED_FIXUPS) is identified and next.
Non-prebuilt paths verified regression-free: cl + test/go suites pass,
smoke behavior unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every llgo-linked executable (linux/darwin, sites enabled) now gets the
prebuilt ftab/findfunctab automatically: internal/build runs
internal/pclnpost.Rewrite after linkMainPkg, and any failure degrades
silently to the first-use construction fallback.

Moves the tool core into internal/pclnpost and hardens it:

- Canonical-record detection by FNV: a record survives when its anchor's
  owning symbol hashes to the record's symbolID (or is the __llgo_stub.
  wrapper of it). The previous one-per-symbolID rule wrongly collapsed a
  function with its stub — they share the target's symbolID by design —
  which broke exact-entry lookups (caught by TestRuntimeLineInfoAndStack
  on Linux). LTO inline copies are now identified exactly: 8.4k/9.5k
  copies removed in the LTO probes.
- Mach-O chained-fixups surgery: unlink the rewritten sections' pointer
  slots from the dyld page chains (repointing predecessors' next links
  and page_start entries) so dyld neither rebases slots inside the new
  table nor skips unrelated fixups after the zeroed stub section, then
  re-sign ad hoc. Without this the table was corrupted at load.
- LTO-safe metadata location: the entry section carries a meta record
  whose relocations hold the addresses of the symbol-index pointer and
  count globals; LTO internalization strips those names from the symbol
  table but relocations always resolve. Runtime skips the meta rows
  (pc==0 / symbolID==0).
- Idempotence guard (already-rewritten binaries are left alone).

Runtime fixes that surfaced during validation:

- materializePrebuiltEntries is now two-phase so concurrent losers wait
  for the winner's store instead of reading a nil entries slice.
- pcLineFrameForPC rejects nearest-below sites whose entry is
  unresolved when the caller knows the function entry, instead of
  leaking a neighboring function's file/line.

Validation: macOS cl (full) + test/go + LLDB 194/194; Linux test/go
TestRuntime suite; probes on both platforms report entries=prebuilt
with first-FuncForPC at 7-21µs (Linux) from 13ms on the original
baseline, and LTO builds drop 8-9.5k inline copies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…table

On Mach-O, pointer slots that name exported functions — every
__llgo_stub.* wrapper and any exported Go function — are emitted as
chained-fixup BIND nodes, not rebases. The rewriter only decoded rebase
nodes, so all stub records (and some entry records) were dropped as
unowned and never reached the prebuilt ftab; FuncForPC on function
values silently fell back to dladdr (~6µs per fresh pc on darwin).

- Parse the LC_DYLD_CHAINED_FIXUPS imports table and resolve bind
  ordinals to their in-image definitions.
- Match canonical owners against the record symbolID with underscore
  normalization (debug/macho's suffix-shared string table can surface
  one mangling underscore more or less than the source-level name).
- Splice the prebuilt header's base slot back into the fixup chain as a
  live rebase node: dyld writes the slid text base at load, so the
  runtime reads a ready runtime PC with no slide arithmetic (non-PIE
  ELF link-time values already equal runtime addresses).
- LLGO_PCLNPOST=0 escape hatch keeps first-use construction.

Fresh-pc FuncForPC slow path: darwin 6-8µs -> 1.2-1.7µs, linux
6.8µs -> 0.5µs; first-in-process lookup: darwin ~32µs -> ~14µs,
linux ~6.8µs -> ~4µs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pure-compute probes (recursive fib, JSON round-trip, sort.Ints, map
churn) with no runtime introspection, so one harness run covers both
the introspection extremes and what the funcinfo machinery costs code
that never asks for it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Go's pclntab pages are touched by its own runtime (traceback, GC) long
before user code queries it, so its first FuncForPC never pays page-in.
Mirror that: when the prebuilt table is present, init adopts it
(zero-copy, sub-µs), touches the pages the lookup path reads (blob,
funcinfo records, string offsets, strings), runs one synthetic lookup
to warm the code paths, and write-warms the FuncForPC cache pages.

First-in-process FuncForPC: darwin ~17µs -> ~2.8µs, linux ~6.6µs ->
~1.0µs. Startup cost is page-count-bound (tens of µs on stdlib-sized
tables, invisible next to ~3ms process startup; hello-world medians
unchanged). Non-prebuilt binaries stay fully lazy: first-use
construction allocates, which has no place in init, and programs that
never introspect pay nothing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
-depths generates deep_<N> scenarios at configurable call depths;
-bigsizes generates bigfunc scenarios (funcs x statements) whose large
bodies stress statement-level pcline density, mid-function pc
symbolization, and ordinary performance of big method bodies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Blob overflow: function-value stubs can double the row count, and at
  ~9k functions the prebuilt blob no longer fit the entry section, so
  the rewrite silently fell back to first-use construction
  (cold.FirstFuncForPC 96x96 non-LTO: 2.4ms). On overflow, retry with
  function entries only — stub pcs degrade to dladdr, real entries keep
  the prebuilt table.

- FuncForPC cache thrash: the set-associative pc cache holds 4k
  entries; batch workloads over 9k+ distinct functions evicted
  constantly and paid the string-materializing slow path on every call
  (multipkg.FuncForPCMany 96x96: 8-11ms vs Go 172µs). Add a per-ftab-row
  *Func cache for exact-entry lookups, so batch lookups are
  O(binary search) after the first pass at any scale.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erflow

Function-value stubs can push the row count past what the entry section
holds (~9k functions with taken addresses). Instead of dropping stub
rows, write the full blob into the (larger) stub section and leave a
32-byte redirect header ("LLGOFTB2" + a live-relocation pointer) in the
entry section; the runtime follows it and adopts the same zero-copy
view. Function-value lookups keep the prebuilt table at any scale
instead of degrading to dladdr.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
funcForPCSlow treated any unaligned pc as a shadow-stack synthetic
marker. arm64 function entries are always 4-aligned so this never
fired, but amd64 function and stub entries need not be: an unaligned
function-value pc skipped the prebuilt exact-entry path entirely and
fell through to nearest-below symbolization, reporting the previous
function's name (test/go TestRuntimeLineInfoAndStack on ubuntu CI,
"bad function value func: main.renamedPC").

Hoist the prebuilt exact-entry + per-row-cache lookup ahead of the
alignment heuristic; a genuine synthetic pc just misses the cheap
search and proceeds as before.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The overflow fallback dropped stub rows to fit the entry section. That
leaves pc ranges the table claims to cover but does not: a function
value whose stub falls in a gap resolves nearest-below to the previous
function and silently returns the wrong name — exactly what ubuntu CI
caught (amd64 --icf=safe layouts overflow by a few hundred bytes, and
non-PIE ELF dladdr cannot rescue). If the blob fits neither the entry
section nor the (larger) stub section, skip the rewrite entirely:
first-use construction is slower but covers every record.

Reproduced and verified on linux/amd64 (qemu): the stub pc had no exact
row and nearest-below returned the neighbouring function's name.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rgery

Fabricated fixtures make the IO paths testable in-process: a minimal
ELF exercises load/Rewrite end to end (in-place, stub-section spill,
and the overflow fallback that must leave the binary untouched), and a
synthetic Mach-O image drives the chained-fixup chain surgery
(remove+splice, empty-page insert, unconsumed-insert error). Package
coverage 16% -> 69%.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A fabricated Mach-O (segments, sections, symtab, chained-fixup imports
and an empty page chain) drives load, bind-target resolution, record
decoding and both Rewrite outcomes (in-place and stub-section spill)
end to end. codesign now runs only when the input carries
LC_CODE_SIGNATURE: real lld executables always do, unsigned inputs need
no signature and codesign rejects them. Also cover asmQuoteELFSymbol,
the empty-table initializers and the Rewrite error paths. Package
coverage: pclnpost 69% -> 86%.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Every Go function on supported targets keeps the frame-pointer chain
("frame-pointer"="non-leaf", gated by Program.NeedsFramePointer to
linux/darwin — on embedded targets the unwinder does not exist and the
layout change perturbed the conservative GC on ESP32-C3). runtime.Caller,
Callers, CallersFrames, Stack and the unrecovered-panic dump walk
[fp]/[fp+w] directly and symbolize through the prebuilt ftab and pcline
tables:

- Return addresses resolve at pc-1 (Go's convention); statement labels
  can land exactly on a return address, so raw-pc nearest-below reported
  the following line. The convention holds with or without the prebuilt
  table (text bounds fall back to the first-use frame table — link-phase
  overflow layouts otherwise silently disabled it, the root cause of the
  amd64 CI failures).
- The walk is bounded to the program's own text: libc frames without FP
  discipline decode as wild pcs that nearest-below would attribute to
  arbitrary functions.
- Methods and anonymous functions are now trackable (methods had no
  pcline labels; closures lost their innermost frame to tail-call
  optimization), and mid-function aligned pcs merge statement records
  instead of returning declaration lines.
- frameSymbol results are memoized per pc (deep re-walks paid a dladdr
  per frame: 32-frame walks 8µs -> 180ns) and the pcline table is built
  during the startup pre-warm (lazily building it inside the first
  Caller cost ~200µs at scale).
- Shadow-stack instrumentation is no longer emitted; LLGO_SHADOW_STACK=1
  keeps the legacy emitters for one release. Tracked functions retain
  noinline, no-tail-call and the data-only pcline records.
- libunwind is gone: the clite stacktrace fallback walks the FP chain
  with dladdr names (same output format), and linux binaries no longer
  link -lunwind.

Semantics are gc ground truth, verified against go: physical stacks show
every real frame; interface-chain Caller marks land at skip 3 and
closure chains at skip 4 (the old expectations encoded shadow-stack
frame loss). Perf (best-of, mac/linux): hot.Caller0 17/37ns (Go
155/241), deep.Direct512 2.8µs (Go 9.7µs; was 87-95µs), bigfunc.Work
18µs (Go 30µs; was 433µs), binary size unchanged or smaller.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
IR goldens gain the frame-pointer attribute (out.ll files carry no
attribute groups and needed no regeneration); the legacy shadow-stack
emitter assertions opt into LLGO_SHADOW_STACK; statement-line probes
move to gc ground-truth skip counts; NeedsFramePointer target matrix and
pclnpost symbolAddr/decodePtr edges covered.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ct log/slog/testing locations

An unrecovered panic now prints a Go-style traceback (function names plus
file:line per physical frame) through a PanicTraceback hook the public
runtime registers; the clite dladdr dump remains the fallback when the FP
walk or the tables are unavailable.

Caller-frame tracking now applies uniformly: the blanket stdlib exclusion
is gone, so the same per-package reaches-runtime.Caller analysis that
already covered third-party code tracks log.Output, slog's Logger.log and
testing's decorate chains (their thin wrappers were inlined, making fixed
Caller depths count past them — log.Lshortfile printed "???:1"). Call
sites into caller-pc-consuming functions of other packages get a statement
anchor so the attributed frame reports the exact line. The collector also
picks up named-type methods declared by the package itself — a type used
only concretely never enters RuntimeTypes, which is exactly how
slog.(*Logger).Info escaped tracking.

hello-world size cost: +368 bytes (the traceback printer).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Four scenarios, every expectation verified against gc: unrecovered panic
tracebacks; log.Lshortfile and slog AddSource (text+JSON, package funcs and
logger methods); a failing t.Errorf under llgo test; and an introspection
grab-bag (goroutine/init/defer callers, FuncForPC names for methods,
closures and generics, the errors-with-stack capture idiom).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

@cpunion cpunion force-pushed the codex/stage5-memprofile branch from cd907ef to f4dea56 Compare July 4, 2026 07:43
cpunion and others added 2 commits July 4, 2026 15:49
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…root xfails

Three conformance residuals left by the earlier shadow-stack-era PRs
(xgo-dev#1903, xgo-dev#1924), reimplemented on the unwinder base:

- //line directive filenames follow gc's spelling: relative directives
  stay relative (the package loader expands them to absolute paths),
  empty directives report "??" — and an empty filename must still emit
  its statement anchor or attribution falls through to a neighboring
  record (fixedbugs/issue18149, issue22662).
- Linking uses --icf=none: Go semantics require distinct functions to
  keep distinct pcs (FuncForPC names, function-value identity —
  fixedbugs/issue58300 printed main.f for both f and g). lld's safe mode
  folded llgo-emitted same-body functions anyway; gc never folds.
  Hello-world cost: +1.6%.
- Remove 12 goroot xfails now passing: the two //line cases, issue58300,
  and nine that the stage5+acceptance line fixed outright (inline_caller,
  inline_callers, inline_literal, issue7690, issue17381, issue21879,
  issue22083, issue29735, issue58300b).

Still xfailed, tracked for the panic-snapshot follow-up: bug347, bug348,
issue14646, issue27201, issue29504, issue33724, issue4562, issue5856
(deferred runtime.Caller during panic unwinding); heapsampling
(memprofile attribution).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cpunion cpunion force-pushed the codex/stage5-memprofile branch 2 times, most recently from 31e99a5 to c647b14 Compare July 4, 2026 10:57
cpunion and others added 2 commits July 4, 2026 20:50
Replaces the size-class counters with gc-shaped heap profiling: sampled
allocations are attributed to physical call stacks at exact statement
lines, and records hold RAW sampled counts — consumers (pprof, goroot
heapsampling.go) apply the Poisson correction themselves, exactly as
with gc.

- Sampling mirrors gc's mcache.nextSample: bytes count down to an
  exponentially distributed threshold (mean MemProfileRate), sample once
  on crossing, redraw. The memoryless distribution is load-bearing: with
  any bounded-support threshold a near-periodic allocation pattern
  phase-locks the sample points onto the large sites (observed 1.6x
  per-site skew on heapsampling's interleaved sizes). ln() is a small
  local approximation — the runtime core cannot import math.
- Stacks come from the FP walk at sample time (fpCallers via a hook the
  public runtime registers), bucketed by stack hash; allocator plumbing
  (including __llgo_stub. wrapper frames of the hook) is trimmed at read
  time. A reentrancy flag spans the whole decision path: threshold
  drawing and bucket allocation themselves allocate, and a recursive
  sample overflows the stack.
- Heap allocations get statement anchors in tracked functions, and a
  package that reads the memory profile (runtime.MemProfile /
  MemProfileRate under either the "runtime" or the patched
  lib-runtime spelling) pins all its trackable functions: per-site
  attribution loses sites to inlining otherwise. Profiling packages are
  rare and accuracy beats inlining there; gc gets both via its inline
  tree (P4).

goroot heapsampling.go passes on darwin/arm64 and linux/arm64 (the
latter was a pre-existing platform gap). Depends on --icf=none from the
line-directive PR: heapsampling's three identical wrapper functions must
keep distinct pcs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
An acceptance regression asserts exact per-line attribution at rate=1
(raw counts are exact there), and a cl unit test covers the
memprofile-package pinning criterion under both runtime spellings.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cpunion cpunion force-pushed the codex/stage5-memprofile branch from c647b14 to 2b22d07 Compare July 4, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant