Skip to content

fix(runtime): make Delete Each walk-safe + erase GC refs on object delete#85

Merged
CoreyRDean merged 1 commit into
developfrom
fix/exit-teardown-crash
Jun 9, 2026
Merged

fix(runtime): make Delete Each walk-safe + erase GC refs on object delete#85
CoreyRDean merged 1 commit into
developfrom
fix/exit-teardown-crash

Conversation

@CoreyRDean

Copy link
Copy Markdown
Collaborator

Summary

Root-causes the long-standing downstream rcce2 CI flake — intermittent exit-time crashes in ItemsTest.bb ("Stack overflow!") and OnlinePlayerChainTest.bb ("Memory access violation") that two rcce2-side band-aid rounds (rcce2 #313/#322) reduced from ~30-40% to ~5-7% without ever identifying the mechanism. The historical "Stack overflow!" label was misdirection from the (since-fixed) seTranslator fall-through; with new fault-address instrumentation, every baseline failure is 0xC0000005 EXCEPTION_ACCESS_VIOLATION at a stable program-relative offset.

Two defects, both evidence-confirmed

1. _bbObjDeleteEach captured-next invalidation (bbruntime/basic.cpp). The walk captured next=obj->next before _bbObjDelete(obj), but the delete's field-release cascade can drop another node's ref count to zero and relocate it from used to free — including the captured next (chain shape: a zombie kept alive only by its predecessor's field). The walk then steps onto free-list linkage and terminates at the free sentinel, silently leaving the rest of the list undeleted — or wanders invalid memory. Deterministic repro: a 1000-node chain with every other node zombified left 499 survivors under Delete Each (now tests/DeleteEachChainTest.bb). Fix: restart-on-delete walk — re-read used.next after every delete, step over zombies in place. Each delete removes at least one fielded node from used, so the walk terminates.

2. GC reference_map desync on pool delete. Delete / Delete Each freed objects via _bbObjDelete without erasing the pointer from the GC reference_map; a later _bbRelease on the stale entry hit count==0 and re-ran _bbObjDelete against a freed — possibly recycled — slot, releasing garbage "fields" (heap corruption surfacing at process exit with counters already at 0). Fix: _bbObjDelete erases its pointer from reference_map (the GC's own delete path already erases first, so that path double-erases as a no-op). Pinned by tests/DeleteEachGCRefMapTest.bb via RefCount() — pre-fix the stale count survives the sweep; post-fix it reads 0.

Diagnostics upgrade (permanent): seTranslator now appends exception code, module-relative faulting address (VirtualQuery/GetModuleFileName), and AV access kind/target to panic messages. Stack-overflow stays minimal on purpose (guard page is blown; no stack-hungry formatting).

Evidence

  • Pre-fix baseline (instrumented build, rcce2 suite): ItemsTest 2/100 failed, OnlinePlayerChainTest 12/100 failed, all 0xC0000005, faulting instruction at a stable per-test program-relative offset (…0207 / …0231 across ASLR'd runs) — one compiled statement dereferencing corrupted object linkage.
  • Deterministic probe: zombie-chain Delete Each → 499/1000 survivors pre-fix; 0 post-fix (and 0 active strings/objects/unreleased at exit, vs 499/499/998).
  • Full BlitzForge test.bat green including the two new regression files.
  • Post-fix 300-run statistical verification on the rcce2 flaky tests is running; results posted on the rcce2 submodule-bump PR (expected ~36 failures at baseline rate if the fix were ineffective).

Risk

Exit/delete-path-only changes to the runtime shared by every shipped executable. Mitigations: deterministic regression tests, full suite, and the 300-run statistical gate downstream. The restart walk is O(zombie-prefix × deletes) worst case — trivial at realistic counts (250k pointer hops for the 1000-node repro), and Delete Each is not a per-frame hot path.

🤖 Generated with Claude Code

…lete

Root-causes the long-standing downstream (rcce2) CI flake: intermittent
exit-time access violations in ItemsTest / OnlinePlayerChainTest,
historically mislabeled "Stack overflow!" by the pre-fix seTranslator.

Two defects, both confirmed by evidence:

1. _bbObjDeleteEach captured next=obj->next before _bbObjDelete(obj), but
   the delete's field-release cascade can drop another node's ref count
   to zero and relocate it from used to free -- including the captured
   next (chain shape: a zombie kept alive only by its predecessor's
   field). The walk then terminates at the free sentinel, silently
   leaving the rest of the list undeleted, or wanders invalid linkage.
   Deterministic repro: a 1000-node chain with every other node
   zombified left 499 survivors. Fixed with a restart-on-delete walk
   (re-read used.next after every delete; step over zombies in place).

2. Delete / Delete Each never erased the pointer from the GC
   reference_map, so a later _bbRelease on the stale entry re-ran
   _bbObjDelete against a freed -- possibly recycled -- slot, releasing
   garbage fields (heap corruption surfacing at exit). _bbObjDelete now
   erases its pointer from reference_map; the GC's own delete path
   already erased before calling it, so that path double-erases as a
   no-op.

Also upgrades seTranslator to append the exception code, module-relative
faulting address, and AV access kind/target to panic messages -- the
instrumentation that produced the evidence (pre-fix baseline: 2/100
ItemsTest, 12/100 OnlinePlayerChainTest failures, all 0xC0000005 at a
stable program-relative offset).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@CoreyRDean CoreyRDean requested a review from a team as a code owner June 9, 2026 21:45
@CoreyRDean

Copy link
Copy Markdown
Collaborator Author

Statistical verification result (honest update): the 300-run post-fix loop on the downstream rcce2 flaky tests shows the CI flake persists at the baseline rate (ItemsTest 12/300, OnlinePlayerChainTest 27/300 — vs 2/100 and 12/100 pre-fix; statistically unchanged). The two defects fixed here are real and deterministically pinned by the new regression tests (499/1000 chain nodes survived Delete Each pre-fix; 0 post-fix; stale GC refcounts observable via RefCount pre-fix), but they are not the mechanism behind the intermittent exit-time access violation. The fault signature is unchanged post-fix: 0xC0000005 at a stable program-relative offset in generated code (…0207 for ItemsTest, …0231 for the chain test) reading a varying garbage address. Diagnosis continues in a follow-up: stack-walk + code-byte dump in the panic path to symbolize the faulting site against runtime.dll/blitzcc.exe PDBs. This PR stands on its own merits: two confirmed runtime bugs fixed + permanent fault-address diagnostics.

@CoreyRDean CoreyRDean merged commit 939f936 into develop Jun 9, 2026
4 checks passed
@CoreyRDean CoreyRDean deleted the fix/exit-teardown-crash branch June 9, 2026 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant