Skip to content

fix(codegen): stop emitting GC releases for globals through uninitialized frame offsets#86

Merged
CoreyRDean merged 1 commit into
developfrom
fix/teardown-crash-diagnosis2
Jun 9, 2026
Merged

fix(codegen): stop emitting GC releases for globals through uninitialized frame offsets#86
CoreyRDean merged 1 commit into
developfrom
fix/teardown-crash-diagnosis2

Conversation

@CoreyRDean

Copy link
Copy Markdown
Collaborator

Summary

Root cause of the rcce2 CI flake, found and statistically killed. Follow-up to #85 (which fixed two real Delete Each defects but the flake persisted at baseline rate — see the honest-update comment there).

The mechanism

node.cpp deleteVars() emitted the GC release __bbRelease( [ebp + Decl::offset], "type" ) for every struct/blitz-typed decl in the function's environ — missing the d->kind gate that every sibling branch (__bbStrRelease, __bbObjRelease, __bbVecFree) has. For main(), that environ includes the program globals, whose Decl::offset is never assigned by frame layout and was omitted from the Decl constructor's initializer list — uninitialized memory:

  • stale value 0 (most runs): mov ebx,[ebp+0] reads the saved-EBP slot, _bbRelease lookup-misses → run passes;
  • stale heap garbage: mov ebx,[ebp+0x006D766B]-style wild read in main's epilogue → 0xC0000005 at a stable program-relative offset, after the last test, with counters already at 0.

GC on/off is irrelevant — the wild read precedes the no-op release, which is why the non-GC OnlinePlayerChainTest crashed identically to the GC ItemsTest.

Evidence trail

  1. Round-1 instrumentation: every failure 0xC0000005 at stable offsets (…0207/…0231) reading run-varying garbage.
  2. Round-2: full hex dump of the generated-code allocation + registers + dbghelp-symbolized EBP walk. The faulting instruction decodes as 8B 9D <4 run-varying ASCII-fragment bytes> — while the same instruction in a -o exe's embedded object is 8B 5D 00 ([ebp+0], disp8). Different encodings of the same logical instruction ⇒ the displacement differed at assembly time ⇒ codegen consumed run-varying uninitialized state. deleteVars + uninitialized Decl::offset is that state.

The fix

  • Gate the __bbRelease emission on frame-resident kinds: DECL_LOCAL || DECL_PARAM. Params must stay — the call protocol references args on the way in, the callee releases them here; a LOCAL-only first attempt failed GarbageCollectionTest (param refs leaked), which pins the contract.
  • Zero-init Decl::offset in the constructor (defense-in-depth: any future read of a non-frame decl's offset is a deterministic 0, not heap noise).
  • Keeps the round-2 diagnostics in seTranslator (registers, anonymous-allocation image dump, dbghelp-symbolized stack walk) — they are what made this finable and will make the next one cheaper.

Verification

pre-fix post-fix
ItemsTest 2/100, 12/300 failed 0/300
OnlinePlayerChainTest 12/100, 27/300 failed 0/300

At baseline rates, ~6-12 and ~27-36 failures were expected in 300 runs; observing 0 in both is p < 1e-8. Full BlitzForge test.bat green (including GarbageCollectionTest, which caught the over-narrow first gate). An rcce2 submodule-bump PR follows; the rcce2 CLAUDE.md "known intermittent flake" workaround gets retired there.

Risk

Codegen change is a strict narrowing (stops emitting one bogus instruction sequence for non-frame decls); param/local release behavior pinned by the existing GC suite. Diagnostics are crash-path-only.

🤖 Generated with Claude Code

…ized frame offsets

Root cause of the long-standing intermittent exit-time crash in downstream
rcce2 CI (ItemsTest / OnlinePlayerChainTest, ~2-12% of runs, historically
mislabeled "Stack overflow!").

node.cpp deleteVars() emitted the GC release
    __bbRelease( [ebp + Decl::offset], "type" )
for every struct/blitz-typed decl in the environ -- missing the d->kind
gate every sibling branch has. main()'s environ includes the program
GLOBALS, whose Decl::offset is never assigned by frame layout and was
omitted from the Decl constructor's initializer list: uninitialized
memory. When the stale value was 0 (most runs) the instruction read the
saved-EBP slot and _bbRelease lookup-missed -- harmless. When it was heap
garbage, main's epilogue performed a wild [ebp+garbage] read and access-
violated. GC on/off is irrelevant: the wild read precedes the release,
which is why the non-GC chain test crashed identically.

Fix: gate the emission on frame-resident kinds (DECL_LOCAL || DECL_PARAM
-- params must stay released: the call protocol references args on the
way in and the callee releases them; a LOCAL-only gate fails
GarbageCollectionTest), and zero-init Decl::offset as defense-in-depth.

Also extends the crash diagnostics that produced the evidence: registers,
a hex dump of the faulting anonymous executable allocation (dynamically
emitted Blitz code has no symbols), and a dbghelp-symbolized EBP-chain
stack walk in seTranslator.

Verification: pre-fix baseline ItemsTest 2/100 + 12/300 failures, chain
test 12/100 + 27/300; post-fix 0/300 + 0/300 (expected ~6-12 and ~27-36
at baseline rates; p < 1e-8). Full test.bat green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@CoreyRDean CoreyRDean requested a review from a team as a code owner June 9, 2026 22:29
@CoreyRDean CoreyRDean merged commit af497fb into develop Jun 9, 2026
4 checks passed
@CoreyRDean CoreyRDean deleted the fix/teardown-crash-diagnosis2 branch June 9, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant