fix(security): cascade-guard redesign (#29, #36)#40
Conversation
Resolves two FR-PROTO-10 cascade-guard findings, both rooted in the per-pthid error budget living on the dispatcher's general thread store. A PoC-backed adversarial red-team pass (AGENTS.md §2) caught a HIGH concurrency bug in the first cut and drove the final design. Full suite 630 green; warnings-as-errors clean (incl. cookbook). - #36 Budget moved to a dedicated CascadeBudgetStore, separate from the general thread store. A flood of cheap fresh thids (general store) can no longer evict and reset a victim pthid's budget. The store OWNS its concurrency: the increment + threshold + trip decision is atomic under a single pthid-keyed lock with eviction inside it — fixing a double-emit the first cut had (locking on the evictable per-thread state object broke under LRU eviction). Its LRU prefers evicting not-yet-tripped entries so a tripped thread stays silenced through a fresh-pthid flood (still bounded — not exempted). Registered as its own DI singleton so the budget survives a non-singleton handler. Residual (distinct-pthid error-report pressure) documented. - #29 Sustained unrepliable (from-less) over-threshold streams no longer grow the counter or flood the Info log forever: the count clamps at threshold+1 and logging stops after the breach, while the unrepliable-deferral is preserved so a later repliable report still fires the cascade-stop. - ThreadState.ErrorCount/MaxErrorsNoticeSent retained as general per-thread fields, no longer used by the built-in guard; cookbook narrates the cascade-stop notice (which carries the count) instead of internal state. Tests: + sustained-unrepliable clamp (#29), + general-store-flood-survives (#36), + no-double-emit-under-eviction, + tripped-survives-flood (red-team). Closes #29, closes #36 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
moisesja
left a comment
There was a problem hiding this comment.
Overall: solid work. The core design is correct, the adversarial pass caught and drove a real bug fix (the evictable-lock-seam double-emit), and the test suite covers all the stated properties. A few things to address or carry forward:
Bug — maxEntries = 1 silently breaks the capacity bound
_lowWaterMark is computed as:
_lowWaterMark = Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10));
For maxEntries = 1: 1 / 10 = 0 (integer division) → Math.Max(1, 0) = 1 → 1 - 1 = 0 → Math.Max(1, 0) = 1. So _lowWaterMark = 1 = _maxEntries.
Inside Evict():
.Take(_entries.Count - _lowWaterMark) // 1 - 1 = 0 victims removed
Nothing is removed. Then GetOrCreate adds the new entry anyway — the store exceeds its cap by 1 and keeps growing by 1 on every eviction trigger. The ctor only rejects <= 0, so maxEntries = 1 is silently accepted and the bound never holds.
The same math makes eviction remove exactly 1 entry for any maxEntries < 10 (since maxEntries / 10 = 0). That's not catastrophically wrong, but maxEntries = 1 is a guaranteed no-op eviction.
Suggested fix — clamp _lowWaterMark strictly below _maxEntries:
_lowWaterMark = Math.Min(_maxEntries - 1, Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10)));
or tighten the ctor guard to maxEntries < 2.
Test quality — Warning_report_does_not_increment_error_count becomes vacuously true
Before this PR the test confirmed the handler didn't write ErrorCount on the general store for warning codes. After this PR the handler writes nothing to the general store for any code, so the existing assertion is trivially true and no longer distinguishes warning from error codes on the budget side.
Worth adding budget.Peek("thread-1").ErrorCount.Should().Be(0) to confirm warning codes don't create a budget entry at all.
Minor — single global lock is a known constraint
RecordErrorReport serialises all callers through _gate. Correct and safe; problem reports are rare events. Worth documenting as a deliberate trade-off (striped locks → high throughput; distributed store → multi-host). The PR description already covers the distributed residual — just noting the in-process analogue is the same frontier.
…rigor - Bug: maxEntries=1 silently broke the CascadeBudgetStore bound. _lowWaterMark computed as Math.Max(1, …) equalled the cap for maxEntries=1 (1/10==0), so Evict() removed 0 entries and the store grew by one per distinct pthid. Clamp _lowWaterMark to maxEntries-1 so eviction always frees a slot. + regression test: 200 distinct pthids at maxEntries=1 ⇒ Count ≤ 1. - Test rigor: Warning_report_does_not_increment_error_count was vacuously true (the handler no longer writes the general store for any code). Now asserts the warning creates NO budget entry (budget.Peek(...).ErrorCount == 0, Count == 0). - Doc: note the single _gate lock is a deliberate trade-off (striped locks for high in-process throughput; distributed store for multi-host). Full suite 631 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks — all three addressed in 78657bb. Full suite 631 green. 1. 🔴
|
Summary
Resolves two FR-PROTO-10 problem-report cascade-guard findings (#29, #36), addressed together since both stem from the per-
pthiderror budget living on the dispatcher's general thread store. Full suite: 630 tests green; warnings-as-errors clean (incl. cookbook).Fixes
CascadeBudgetStore, separate from the general thread store (which is flooded with one entry per inboundthid). A flood of cheap freshthids can no longer evict and reset a victimpthid's budget.threshold + 1and logging stops after the breach, while the unrepliable-deferral is preserved (a later repliable report still fires the cascade-stop).The adversarial pass earned its keep (AGENTS.md §2)
Both red-team subagents converged on a HIGH-severity bug in my first cut: I made the dedicated store an LRU store but kept locking on the evictable per-
pthidstate object. Under eviction, two concurrent callers lock different instances for the same pthid → the "emit exactly once" guard double-emits (PoC-proven, deterministic at small caps). Remediated:RecordErrorReportdoes the whole increment + threshold + trip decision atomically under a singlepthid-keyed lock, with eviction inside that lock, returning a decision (no shared mutable object to lock). Closes the double-emit.pthidflood, without exempting entries from eviction (still bounded, per the Problem-report cascade guard (FR-PROTO-10) defeatable by forcing thread-state eviction (residual of #21) #36 issue).CascadeBudgetStoreis registered as its own singleton, so the budget persists even if the handler is mis-registered non-singleton.pthiderror reports (far costlier than the original cheap-thidreset; each does real work). A distributed/persistent budget is the path to a hard guarantee. Captured as lesson L-028.Integration impact handled
Section_U_ProblemReport) inspectedThreadState.ErrorCounton a store it controlled; the budget moved away from that store, so it now narrates the cascade-stop notice (which carries the count) — the consumer-facing signal.ThreadState.ErrorCount/MaxErrorsNoticeSentare retained as general per-thread fields (no public-API removal), re-documented as no longer used by the built-in guard.Test plan
threshold+1(not N), trip deferred; a later repliable report fires the cascade-stop.thid-flood does not reset the budget (would fail onmain, where they share one store).pthidflood.Closes #29, closes #36
🤖 Generated with Claude Code