fix(runtime): make Delete Each walk-safe + erase GC refs on object delete#85
Conversation
…lete Root-causes the long-standing downstream (rcce2) CI flake: intermittent exit-time access violations in ItemsTest / OnlinePlayerChainTest, historically mislabeled "Stack overflow!" by the pre-fix seTranslator. Two defects, both confirmed by evidence: 1. _bbObjDeleteEach captured next=obj->next before _bbObjDelete(obj), but the delete's field-release cascade can drop another node's ref count to zero and relocate it from used to free -- including the captured next (chain shape: a zombie kept alive only by its predecessor's field). The walk then terminates at the free sentinel, silently leaving the rest of the list undeleted, or wanders invalid linkage. Deterministic repro: a 1000-node chain with every other node zombified left 499 survivors. Fixed with a restart-on-delete walk (re-read used.next after every delete; step over zombies in place). 2. Delete / Delete Each never erased the pointer from the GC reference_map, so a later _bbRelease on the stale entry re-ran _bbObjDelete against a freed -- possibly recycled -- slot, releasing garbage fields (heap corruption surfacing at exit). _bbObjDelete now erases its pointer from reference_map; the GC's own delete path already erased before calling it, so that path double-erases as a no-op. Also upgrades seTranslator to append the exception code, module-relative faulting address, and AV access kind/target to panic messages -- the instrumentation that produced the evidence (pre-fix baseline: 2/100 ItemsTest, 12/100 OnlinePlayerChainTest failures, all 0xC0000005 at a stable program-relative offset). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Statistical verification result (honest update): the 300-run post-fix loop on the downstream rcce2 flaky tests shows the CI flake persists at the baseline rate (ItemsTest 12/300, OnlinePlayerChainTest 27/300 — vs 2/100 and 12/100 pre-fix; statistically unchanged). The two defects fixed here are real and deterministically pinned by the new regression tests (499/1000 chain nodes survived |
Summary
Root-causes the long-standing downstream rcce2 CI flake — intermittent exit-time crashes in
ItemsTest.bb("Stack overflow!") andOnlinePlayerChainTest.bb("Memory access violation") that two rcce2-side band-aid rounds (rcce2 #313/#322) reduced from ~30-40% to ~5-7% without ever identifying the mechanism. The historical "Stack overflow!" label was misdirection from the (since-fixed) seTranslator fall-through; with new fault-address instrumentation, every baseline failure is0xC0000005 EXCEPTION_ACCESS_VIOLATIONat a stable program-relative offset.Two defects, both evidence-confirmed
1.
_bbObjDeleteEachcaptured-next invalidation (bbruntime/basic.cpp). The walk capturednext=obj->nextbefore_bbObjDelete(obj), but the delete's field-release cascade can drop another node's ref count to zero and relocate it fromusedtofree— including the capturednext(chain shape: a zombie kept alive only by its predecessor's field). The walk then steps onto free-list linkage and terminates at the free sentinel, silently leaving the rest of the list undeleted — or wanders invalid memory. Deterministic repro: a 1000-node chain with every other node zombified left 499 survivors underDelete Each(nowtests/DeleteEachChainTest.bb). Fix: restart-on-delete walk — re-readused.nextafter every delete, step over zombies in place. Each delete removes at least one fielded node fromused, so the walk terminates.2. GC
reference_mapdesync on pool delete.Delete/Delete Eachfreed objects via_bbObjDeletewithout erasing the pointer from the GCreference_map; a later_bbReleaseon the stale entry hit count==0 and re-ran_bbObjDeleteagainst a freed — possibly recycled — slot, releasing garbage "fields" (heap corruption surfacing at process exit with counters already at 0). Fix:_bbObjDeleteerases its pointer fromreference_map(the GC's own delete path already erases first, so that path double-erases as a no-op). Pinned bytests/DeleteEachGCRefMapTest.bbviaRefCount()— pre-fix the stale count survives the sweep; post-fix it reads 0.Diagnostics upgrade (permanent):
seTranslatornow appends exception code, module-relative faulting address (VirtualQuery/GetModuleFileName), and AV access kind/target to panic messages. Stack-overflow stays minimal on purpose (guard page is blown; no stack-hungry formatting).Evidence
0xC0000005, faulting instruction at a stable per-test program-relative offset (…0207/…0231across ASLR'd runs) — one compiled statement dereferencing corrupted object linkage.Delete Each→ 499/1000 survivors pre-fix; 0 post-fix (and 0 active strings/objects/unreleased at exit, vs 499/499/998).test.batgreen including the two new regression files.Risk
Exit/delete-path-only changes to the runtime shared by every shipped executable. Mitigations: deterministic regression tests, full suite, and the 300-run statistical gate downstream. The restart walk is O(zombie-prefix × deletes) worst case — trivial at realistic counts (250k pointer hops for the 1000-node repro), and
Delete Eachis not a per-frame hot path.🤖 Generated with Claude Code