fix: add death-handling to the shm writer spinlock (#38) by toloco · Pull Request #83 · toloco/warp_cache

toloco · 2026-06-18T11:10:14Z

Closes #38

What & why

The cross-process writer spinlock in src/shm/lock.rs was a bare TTAS lock with no robustness. If a process was killed (SIGKILL / OOM / crash) between write_lock() and write_unlock(), the lock stayed taken and seq stayed odd forever — every other process then spun indefinitely: write_lock() waiting for the flag to clear, read_begin() waiting for seq to go even. The whole shared cache was permanently wedged with no recovery short of deleting the shm file. insert()/clear() weren't panic-safe either: any panic between lock/unlock leaked the lock.

How

Owner PID in the lock word. The writer slot now stores the owner's PID (0 = free) instead of a 0/1 flag. write_lock() stamps getpid() via CAS(0 → pid) (called per-acquire — caching across fork() would be unsafe).
Dead-owner recovery. A waiter that spins past RECOVER_SPINS on a continuously-held lock probes the owner with kill(pid, 0). On ESRCH it recovers by stamping its own PID — this keeps the lock held (no writer can enter against a stale odd seq) and, crucially, keeps it probeable: a recoverer that itself dies mid-recovery leaves a dead PID that the same path recovers, rather than a terminal sentinel that wedges forever. It then forces seq even and releases. read_begin() runs the same probe so readers can't wedge on a dead writer either.
Panic-safety via RAII. write_lock() returns a WriteGuard whose Drop restores seq parity and releases the lock during unwinding. The crate is panic = unwind, so a panic in the critical section (e.g. a serde/pointer-math bug) across the PyO3 boundary can no longer leak the lock. Callers (insert/clear/get-expired) updated to let _guard = lock.write_lock();.
kill safety. owner_is_dead rejects 0 and any value > i32::MAX, so a corrupt lock word can never make kill() target a process group / every process (kill(0/-1)).

Scope / accepted limitation

Recovery restores liveness only, not the consistency of data a writer was mid-mutating — a process killed mid-insert can leave a half-written entry/free-list (no write-ahead log). A recoverable cache beats a permanent global deadlock; full crash-consistency would be a separate, larger change. Layout is unchanged (the owner PID reuses the existing AtomicU32 slot). No public Python API change.

Tests

New cargo tests in lock.rs: dead-owner recovery on both the write and read paths, recovery for either seq parity (covers a recoverer dying mid-recovery), panic-in-critical-section releases the lock, and the kill-argument guard. Each wedges a lock with a real reaped-child PID and asserts recovery completes under a watchdog (so a regression fails rather than hangs the suite).

Gates run

make fmt / make lint — clean (ruff, ty, clippy -D warnings)
make test — cargo (16, incl. 5 new [medium] TTAS spinlock has no death-handling: a process killed while holding write_lock deadlocks all others forever #38 tests) + pytest (107) ✅
make test-matrix — Python 3.10 / 3.11 / 3.12 / 3.13 ✅; 3.14 skipped (toolchain resolves a prerelease 3.14.0a6 whose ABI breaks PyO3)
make bench (3.13) — no regression (change is off the hot read path; adds one getpid() to the cold miss/write path only)
Loom model RUSTFLAGS="--cfg loom" cargo test --lib seqlock_ordering — [medium] Seqlock writer entry increment uses Release-only, allowing data writes to be reordered before the odd-seq publish #40 ordering still holds ✅

The fix was reviewed by an adversarial multi-agent pass; it caught a terminal RECOVERING-sentinel wedge in the first draft, which is why recovery now stamps the recoverer's own PID instead.

🤖 Generated with Claude Code

The cross-process writer spinlock was a bare TTAS lock with no robustness: a process killed (SIGKILL/OOM/crash) while holding it left the lock taken and seq odd forever, wedging every other process — both write_lock() and read_begin() spin indefinitely with no recovery short of deleting the shm file. Store the owner PID in the lock word (0 = free) instead of a 0/1 flag. A waiter that spins past RECOVER_SPINS on a continuously-held lock probes the owner with kill(pid, 0); on ESRCH it recovers by stamping its own PID (keeps the lock held so no writer enters against a stale odd seq, and stays probeable so a recoverer that itself dies is recovered in turn rather than leaving a terminal sentinel), forces seq even, then releases. read_begin runs the same probe, so readers can't wedge on a dead writer either. write_lock now returns an RAII WriteGuard whose Drop restores seq parity and releases the lock during unwinding, so a panic in the critical section (e.g. a serde/pointer-math bug) across the PyO3 boundary can't leak the lock. Recovery restores liveness only — a writer killed mid-insert can leave a half-written entry/free-list (no WAL); a recoverable cache beats a permanent global deadlock. owner_is_dead rejects 0 and any value > i32::MAX so a corrupt word can never make kill() target a process group. Adds cargo tests for dead-owner recovery (write + read paths, both seq parities), panic-safety, and the kill-argument guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add death-handling to the shm writer spinlock (#38)#83

fix: add death-handling to the shm writer spinlock (#38)#83
toloco wants to merge 1 commit into
masterfrom
fix/38-spinlock-death-handling

toloco commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toloco commented Jun 18, 2026

What & why

How

Scope / accepted limitation

Tests

Gates run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant