Skip to content

fix(security): cascade-guard redesign (#29, #36)#40

Merged
moisesja merged 2 commits into
mainfrom
feat/security-cascade-guard
Jun 17, 2026
Merged

fix(security): cascade-guard redesign (#29, #36)#40
moisesja merged 2 commits into
mainfrom
feat/security-cascade-guard

Conversation

@moisesja

Copy link
Copy Markdown
Owner

Summary

Resolves two FR-PROTO-10 problem-report cascade-guard findings (#29, #36), addressed together since both stem from the per-pthid error budget living on the dispatcher's general thread store. Full suite: 630 tests green; warnings-as-errors clean (incl. cookbook).

Fixes

# Change
#36 The cascade budget moved to a dedicated CascadeBudgetStore, separate from the general thread store (which is flooded with one entry per inbound thid). A flood of cheap fresh thids can no longer evict and reset a victim pthid's budget.
#29 A sustained unrepliable (anoncrypt/from-less) over-threshold stream no longer grows the counter or floods the Info log forever — the count clamps at threshold + 1 and logging stops after the breach, while the unrepliable-deferral is preserved (a later repliable report still fires the cascade-stop).

The adversarial pass earned its keep (AGENTS.md §2)

Both red-team subagents converged on a HIGH-severity bug in my first cut: I made the dedicated store an LRU store but kept locking on the evictable per-pthid state object. Under eviction, two concurrent callers lock different instances for the same pthid → the "emit exactly once" guard double-emits (PoC-proven, deterministic at small caps). Remediated:

  • The store now owns its concurrencyRecordErrorReport does the whole increment + threshold + trip decision atomically under a single pthid-keyed lock, with eviction inside that lock, returning a decision (no shared mutable object to lock). Closes the double-emit.
  • LRU prefers evicting not-yet-tripped entries — a tripped thread's silenced decision survives a fresh-pthid flood, without exempting entries from eviction (still bounded, per the Problem-report cascade guard (FR-PROTO-10) defeatable by forcing thread-state eviction (residual of #21) #36 issue).
  • DI decouplingCascadeBudgetStore is registered as its own singleton, so the budget persists even if the handler is mis-registered non-singleton.
  • Residual documented — a determined attacker can still pressure the bound with distinct-pthid error reports (far costlier than the original cheap-thid reset; each does real work). A distributed/persistent budget is the path to a hard guarantee. Captured as lesson L-028.

Integration impact handled

  • The cookbook (Section_U_ProblemReport) inspected ThreadState.ErrorCount on a store it controlled; the budget moved away from that store, so it now narrates the cascade-stop notice (which carries the count) — the consumer-facing signal.
  • ThreadState.ErrorCount/MaxErrorsNoticeSent are retained as general per-thread fields (no public-API removal), re-documented as no longer used by the built-in guard.

Test plan

Closes #29, closes #36

🤖 Generated with Claude Code

Resolves two FR-PROTO-10 cascade-guard findings, both rooted in the per-pthid
error budget living on the dispatcher's general thread store. A PoC-backed
adversarial red-team pass (AGENTS.md §2) caught a HIGH concurrency bug in the
first cut and drove the final design. Full suite 630 green; warnings-as-errors
clean (incl. cookbook).

- #36 Budget moved to a dedicated CascadeBudgetStore, separate from the general
  thread store. A flood of cheap fresh thids (general store) can no longer evict
  and reset a victim pthid's budget. The store OWNS its concurrency: the
  increment + threshold + trip decision is atomic under a single pthid-keyed
  lock with eviction inside it — fixing a double-emit the first cut had
  (locking on the evictable per-thread state object broke under LRU eviction).
  Its LRU prefers evicting not-yet-tripped entries so a tripped thread stays
  silenced through a fresh-pthid flood (still bounded — not exempted).
  Registered as its own DI singleton so the budget survives a non-singleton
  handler. Residual (distinct-pthid error-report pressure) documented.
- #29 Sustained unrepliable (from-less) over-threshold streams no longer grow
  the counter or flood the Info log forever: the count clamps at threshold+1 and
  logging stops after the breach, while the unrepliable-deferral is preserved so
  a later repliable report still fires the cascade-stop.
- ThreadState.ErrorCount/MaxErrorsNoticeSent retained as general per-thread
  fields, no longer used by the built-in guard; cookbook narrates the
  cascade-stop notice (which carries the count) instead of internal state.

Tests: + sustained-unrepliable clamp (#29), + general-store-flood-survives (#36),
+ no-double-emit-under-eviction, + tripped-survives-flood (red-team).

Closes #29, closes #36

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@moisesja moisesja self-assigned this Jun 17, 2026
@moisesja moisesja added this to the 1.0.0 GA milestone Jun 17, 2026

@moisesja moisesja left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: solid work. The core design is correct, the adversarial pass caught and drove a real bug fix (the evictable-lock-seam double-emit), and the test suite covers all the stated properties. A few things to address or carry forward:

Bug — maxEntries = 1 silently breaks the capacity bound
_lowWaterMark is computed as:

_lowWaterMark = Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10));
For maxEntries = 1: 1 / 10 = 0 (integer division) → Math.Max(1, 0) = 1 → 1 - 1 = 0 → Math.Max(1, 0) = 1. So _lowWaterMark = 1 = _maxEntries.

Inside Evict():

.Take(_entries.Count - _lowWaterMark) // 1 - 1 = 0 victims removed
Nothing is removed. Then GetOrCreate adds the new entry anyway — the store exceeds its cap by 1 and keeps growing by 1 on every eviction trigger. The ctor only rejects <= 0, so maxEntries = 1 is silently accepted and the bound never holds.

The same math makes eviction remove exactly 1 entry for any maxEntries < 10 (since maxEntries / 10 = 0). That's not catastrophically wrong, but maxEntries = 1 is a guaranteed no-op eviction.

Suggested fix — clamp _lowWaterMark strictly below _maxEntries:

_lowWaterMark = Math.Min(_maxEntries - 1, Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10)));
or tighten the ctor guard to maxEntries < 2.

Test quality — Warning_report_does_not_increment_error_count becomes vacuously true
Before this PR the test confirmed the handler didn't write ErrorCount on the general store for warning codes. After this PR the handler writes nothing to the general store for any code, so the existing assertion is trivially true and no longer distinguishes warning from error codes on the budget side.

Worth adding budget.Peek("thread-1").ErrorCount.Should().Be(0) to confirm warning codes don't create a budget entry at all.

Minor — single global lock is a known constraint
RecordErrorReport serialises all callers through _gate. Correct and safe; problem reports are rare events. Worth documenting as a deliberate trade-off (striped locks → high throughput; distributed store → multi-host). The PR description already covers the distributed residual — just noting the in-process analogue is the same frontier.

…rigor

- Bug: maxEntries=1 silently broke the CascadeBudgetStore bound. _lowWaterMark
  computed as Math.Max(1, …) equalled the cap for maxEntries=1 (1/10==0), so
  Evict() removed 0 entries and the store grew by one per distinct pthid.
  Clamp _lowWaterMark to maxEntries-1 so eviction always frees a slot.
  + regression test: 200 distinct pthids at maxEntries=1 ⇒ Count ≤ 1.
- Test rigor: Warning_report_does_not_increment_error_count was vacuously true
  (the handler no longer writes the general store for any code). Now asserts the
  warning creates NO budget entry (budget.Peek(...).ErrorCount == 0, Count == 0).
- Doc: note the single _gate lock is a deliberate trade-off (striped locks for
  high in-process throughput; distributed store for multi-host).

Full suite 631 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@moisesja

Copy link
Copy Markdown
Owner Author

Thanks — all three addressed in 78657bb. Full suite 631 green.

1. 🔴 maxEntries = 1 bound (real bug) — fixed

Exactly your analysis: Math.Max(1, …) left _lowWaterMark == _maxEntries for maxEntries = 1, so Evict() took Count − lowWater == 0 and the store grew by one per distinct pthid. Clamped to strictly below the cap:

_lowWaterMark = Math.Min(maxEntries - 1, Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10)));
  • maxEntries = 1 → lowWater 0 (evict all, add one → Count 1 — bound holds).
  • maxEntries = 21; 109; 10_0009_000 (unchanged for normal caps).

Added Budget_store_stays_bounded_even_at_a_cap_of_one: 200 distinct pthids at maxEntries = 1Count ≤ 1 (grows unbounded on the old code).

2. Warning test was vacuously true — fixed

Right — the handler no longer writes the general store for any code, so the old store.Get(...).Should().BeNull() proved nothing. Warning_report_does_not_increment_error_count now asserts against the budget: budget.Peek("thread-1").ErrorCount == 0 and budget.Count == 0 (a warning creates no budget entry at all).

3. Single global lock — documented

Added a remark on CascadeBudgetStore that the single _gate lock is a deliberate trade-off (problem reports are rare; striped-by-pthid locks for high in-process throughput, a distributed store for multi-host — same frontier as the eviction residual).

Ready for re-review.

@moisesja moisesja merged commit 85ce2c8 into main Jun 17, 2026
2 checks passed
@moisesja moisesja deleted the feat/security-cascade-guard branch June 17, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant