Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,33 @@ All notable changes to didcomm-dotnet are documented here. Format follows

## [Unreleased]

### Security — problem-report cascade-guard redesign (`feat/security-cascade-guard`)

Resolves two FR-PROTO-10 cascade-guard findings (GitHub issues #29, #36), addressed together since both
stem from the per-`pthid` error budget living on the dispatcher's general thread store. Passed a
PoC-backed adversarial red-team pass that caught — and drove the fix for — a concurrency bug in the
first cut. Full suite (631 tests) green.

- **Cascade budget moved to a dedicated, bounded store (#36).** The FR-PROTO-10 budget now lives in a
new `CascadeBudgetStore`, separate from the dispatcher's general thread store (which is flooded with
one entry per inbound `thid`). Previously a flood of cheap fresh `thid`s could evict — and reset to
zero — a victim `pthid`'s error budget, defeating the cascade-stop. The dedicated store **owns its
concurrency**: the increment + threshold check + trip decision is atomic under a single `pthid`-keyed
lock (the earlier "lock on the per-thread state object" seam broke under LRU eviction and could
double-emit the cascade-stop — fixed). Its LRU **prefers evicting not-yet-tripped entries**, so a
tripped thread's silenced decision survives a fresh-`pthid` flood without exempting entries from
eviction (still bounded). Registered as its own DI singleton so the budget persists regardless of the
handler's lifetime. Residual: a determined attacker can still pressure the bound with distinct-`pthid`
*error reports* (far costlier than the original cheap-`thid` reset); a distributed/persistent budget
is the path to a hard guarantee for horizontally-scaled hosts.
- **Unrepliable-stream counter/log flood capped (#29).** A sustained stream of unrepliable
(anoncrypt/from-less) over-threshold reports previously incremented the counter and emitted an Info
log line forever (the trip permanently deferred). The handler now clamps the count at `threshold + 1`
and stops per-message logging after the breach, while preserving the deferral so a later repliable
report still fires the cascade-stop.
- `ThreadState.ErrorCount`/`MaxErrorsNoticeSent` are retained as general-purpose per-thread fields but
are no longer used by the built-in cascade guard (which uses `CascadeBudgetStore`).

### Fixed — rotation JWT compliance

Resolves two Low-severity audit findings on the `from_prior` rotation path (GitHub issues #25, #26).
Expand Down
4 changes: 3 additions & 1 deletion samples/02-Cookbook/Sections/Section_U_ProblemReport.cs
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,9 @@ public static async Task RunAsync(CookbookContext ctx)

ctx.Narrator.Value("Cascade-stop code", ProblemReportApi.ReadCode(trip!.Reply!));
ctx.Narrator.Value("Cascade-stop pthid", trip.Reply!.Pthid);
ctx.Narrator.Value("Failing-thread errorCount", localStore.GetOrCreate(failingThread).ErrorCount);
// The breach count rides on the cascade-stop notice itself (the count is communicated to the
// peer, not read from internal state — the FR-PROTO-10 budget lives in a dedicated store, #36).
ctx.Narrator.Value("Cascade-stop comment", ProblemReportApi.RenderComment(trip.Reply!));
ctx.Narrator.Note("Beyond the trip the handler returns null for further reports on the same pthid (FR-PROTO-10).");
}
}
168 changes: 168 additions & 0 deletions src/DidComm.Core/Protocols/ProblemReport/CascadeBudgetStore.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
using System.Linq;

namespace DidComm.Protocols.ProblemReport;

/// <summary>
/// Bounded, thread-safe, process-local tracker for the FR-PROTO-10 cascade budget, kept
/// <strong>separate</strong> from the dispatcher's general thread-state store.
/// </summary>
/// <remarks>
/// <para>
/// The dispatcher writes one entry per inbound message's <c>thid</c> into the general store. Routing
/// the cascade budget through that same store let a flood of cheap fresh <c>thid</c>s evict — and
/// reset to zero — a victim <c>pthid</c>'s error budget, defeating the cascade guard (#36). This
/// tracker holds the budget separately, keyed only by the failing thread's <c>pthid</c> (so only
/// error reports create entries).
/// </para>
/// <para>
/// It <strong>owns its concurrency</strong>: <see cref="RecordErrorReport"/> does the whole
/// increment + threshold check + trip decision atomically under a single lock keyed by the pthid
/// string. (An earlier design locked on the per-pthid state object borrowed from an LRU store; under
/// eviction two concurrent callers could lock <em>different</em> instances for the same pthid and
/// double-emit the cascade-stop — the lock seam must not be an evictable value.) The single lock
/// serializes all callers; that is a deliberate trade-off — problem reports are rare events, so a
/// global lock is simpler and safe. A high-throughput in-process host would stripe locks by pthid; a
/// multi-host deployment supplies a distributed budget (same frontier as the eviction residual below).
/// </para>
/// <para>
/// Bounded at <see cref="DefaultMaxEntries"/> with approximate-LRU eviction that <em>prefers</em>
/// evicting not-yet-tripped entries, so a tripped thread's silenced decision survives a flood of
/// fresh pthids — without exempting tripped entries from eviction (if every entry is tripped the
/// oldest are still removed, so the store never grows without limit, per #36). Residual: a determined
/// attacker can still pressure the bound with distinct-pthid <em>error reports</em> (each does real
/// per-message work), but that is far more expensive than the original cheap-thid reset; a
/// distributed/persistent budget is the path to a hard guarantee for horizontally-scaled hosts.
/// </para>
/// </remarks>
public sealed class CascadeBudgetStore
{
/// <summary>Default cap on tracked failing-thread budgets.</summary>
public const int DefaultMaxEntries = 10_000;

private sealed class Entry
{
public int ErrorCount;
public bool NoticeSent;
public long LastTouched;
}

private readonly object _gate = new();
private readonly Dictionary<string, Entry> _entries = new(StringComparer.Ordinal);
private readonly int _maxEntries;
private readonly int _lowWaterMark;
private long _tick;

/// <summary>Create a cascade-budget tracker bounded at <paramref name="maxEntries"/> failing threads.</summary>
/// <param name="maxEntries">Cap on tracked budgets; must be positive. Defaults to <see cref="DefaultMaxEntries"/>.</param>
/// <exception cref="ArgumentOutOfRangeException"><paramref name="maxEntries"/> is not positive.</exception>
public CascadeBudgetStore(int maxEntries = DefaultMaxEntries)
{
if (maxEntries <= 0)
{
throw new ArgumentOutOfRangeException(
nameof(maxEntries), maxEntries, "Cascade-budget capacity must be > 0.");
}

_maxEntries = maxEntries;
// Trim toward ~90%, but ALWAYS strictly below the cap so Evict() actually frees a slot. A naive
// Math.Max(1, …) leaves lowWater == cap for maxEntries == 1 (1/10 == 0): Evict() removes
// Count − lowWater == 0 entries, the new entry is added anyway, and the store grows past its cap
// without bound. Clamp to maxEntries − 1 (PR #40 review).
_lowWaterMark = Math.Min(maxEntries - 1, Math.Max(1, maxEntries - Math.Max(1, maxEntries / 10)));
}

/// <summary>
/// Atomically record one inbound error problem-report on the failing thread named by
/// <paramref name="pthid"/> and return the FR-PROTO-10 cascade-guard decision.
/// </summary>
/// <param name="pthid">The failing thread's <c>pthid</c> (FR-PROTO-07, non-empty).</param>
/// <param name="threshold">The per-thread error budget (<see cref="ProblemReportOptions.CascadeThreshold"/>).</param>
/// <param name="repliable">Whether this report has a usable reply target (a cascade-stop can be addressed).</param>
/// <returns>
/// <see cref="CascadeStep.Emit"/> = build and send the cascade-stop; <see cref="CascadeStep.Log"/> =
/// the count advanced and should be logged; <see cref="CascadeStep.Count"/> = the (clamped) count.
/// </returns>
internal CascadeStep RecordErrorReport(string pthid, int threshold, bool repliable)
{
lock (_gate)
{
var entry = GetOrCreate(pthid);

// Notice already emitted → fully silent: no increment, no log, no work.
if (entry.NoticeSent)
return new CascadeStep(Emit: false, Log: false, Count: entry.ErrorCount);

// Clamp at threshold+1: once a thread has breached, stop incrementing AND stop logging per
// message — otherwise a sustained unrepliable (from-less) stream grows the counter and
// floods the log forever, the trip permanently deferred (#29). The breach is counted and
// logged exactly once.
bool atCap = entry.ErrorCount > threshold;
if (!atCap)
entry.ErrorCount++;
int count = entry.ErrorCount;

if (count > threshold)
{
// Breach. A cascade-stop needs a repliable target; an anoncrypt/addressless report
// DEFERS emission (NoticeSent stays false) so a later repliable report on the same
// pthid can fire it, while the counter stays clamped so the unrepliable stream can't flood.
if (!repliable)
return new CascadeStep(Emit: false, Log: !atCap, Count: count);

entry.NoticeSent = true;
return new CascadeStep(Emit: true, Log: !atCap, Count: count);
}

return new CascadeStep(Emit: false, Log: !atCap, Count: count);
}
}

private Entry GetOrCreate(string pthid) // caller holds _gate
{
if (_entries.TryGetValue(pthid, out var entry))
{
entry.LastTouched = ++_tick;
return entry;
}

if (_entries.Count >= _maxEntries)
Evict();

entry = new Entry { LastTouched = ++_tick };
_entries[pthid] = entry;
return entry;
}

private void Evict() // caller holds _gate
{
// Remove down to the low-water mark, preferring not-yet-tripped entries (NoticeSent=false
// sorts first) and then oldest-touched. This keeps a tripped thread's silenced decision sticky
// against a flood of fresh, non-tripped pthids — without exempting tripped entries (if all are
// tripped the oldest are still removed, so the store can never grow without limit, #36).
var victims = _entries
.OrderBy(kv => kv.Value.NoticeSent)
.ThenBy(kv => kv.Value.LastTouched)
.Take(_entries.Count - _lowWaterMark)
.Select(kv => kv.Key)
.ToArray();

foreach (var key in victims)
_entries.Remove(key);
}

/// <summary>Current number of tracked failing-thread budgets. Exposed for tests.</summary>
internal int Count
{
get { lock (_gate) return _entries.Count; }
}

/// <summary>Read a thread's current budget without mutating it. Exposed for tests.</summary>
internal (int ErrorCount, bool NoticeSent) Peek(string pthid)
{
lock (_gate)
return _entries.TryGetValue(pthid, out var e) ? (e.ErrorCount, e.NoticeSent) : (0, false);
}
}

/// <summary>The cascade-guard decision returned by <see cref="CascadeBudgetStore.RecordErrorReport"/>.</summary>
internal readonly record struct CascadeStep(bool Emit, bool Log, int Count);
67 changes: 29 additions & 38 deletions src/DidComm.Core/Protocols/ProblemReport/ProblemReportHandler.cs
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ namespace DidComm.Protocols.ProblemReport;
/// <para>
/// Problem-reports are informational by design — they tell the peer something went wrong on
/// a thread, but they do not themselves require an application-level reply. So this handler's
/// "happy path" is to read the code, increment <see cref="DidComm.Threading.ThreadState.ErrorCount"/>
/// for error-sorter codes, and return <c>null</c>.
/// "happy path" is to read the code, record it against the failing thread's FR-PROTO-10 budget in
/// the dedicated <see cref="CascadeBudgetStore"/> (#36) for error-sorter codes, and return <c>null</c>.
/// </para>
/// <para>
/// The one exception is the FR-PROTO-10 cascade guard: once a thread accumulates more than
Expand All @@ -25,16 +25,27 @@ namespace DidComm.Protocols.ProblemReport;
public sealed class ProblemReportHandler : IProtocolHandler
{
private readonly ProblemReportOptions _options;
private readonly CascadeBudgetStore _cascadeBudget;
private readonly ILogger<ProblemReportHandler>? _logger;

/// <summary>Construct the handler with explicit options + optional logger.</summary>
/// <summary>Construct the handler with explicit options, an optional dedicated cascade-budget store, and an optional logger.</summary>
/// <param name="options">Handler options (cascade threshold, etc.). Validated eagerly at ctor so a misconfig surfaces at DI resolution rather than as silently-degraded cascade-guard behaviour.</param>
/// <param name="cascadeBudget">
/// The dedicated FR-PROTO-10 cascade-budget store (#36), kept separate from the dispatcher's general
/// thread store. Defaults to a fresh process-local instance; the DI setup registers one as a
/// singleton so the budget's lifetime is decoupled from the handler's (a handler accidentally
/// registered non-singleton would otherwise lose its budget per request).
/// </param>
/// <param name="logger">Optional structured logger.</param>
public ProblemReportHandler(IOptions<ProblemReportOptions> options, ILogger<ProblemReportHandler>? logger = null)
public ProblemReportHandler(
IOptions<ProblemReportOptions> options,
CascadeBudgetStore? cascadeBudget = null,
ILogger<ProblemReportHandler>? logger = null)
{
ArgumentNullException.ThrowIfNull(options);
_options = options.Value;
_options.Validate();
_cascadeBudget = cascadeBudget ?? new CascadeBudgetStore();
_logger = logger;
}

Expand Down Expand Up @@ -70,46 +81,26 @@ public ProblemReportHandler(IOptions<ProblemReportOptions> options, ILogger<Prob
if (string.IsNullOrEmpty(message.Pthid))
return Task.FromResult<Message?>(null);

var failingThread = context.Threads.GetOrCreate(message.Pthid);
// FR-PROTO-10 cascade guard. The whole increment + threshold check + trip decision is done
// ATOMICALLY inside the dedicated, bounded cascade-budget store (#36) — separate from the
// dispatcher's general thread store (which a cheap-thid flood evicts) and self-synchronizing on
// a stable pthid-keyed lock (so concurrent reports on the same pthid can't double-emit, even
// under the store's LRU eviction). A report is repliable iff it has a usable reply target.
bool repliable = !string.IsNullOrEmpty(message.From)
&& message.To is { Count: > 0 } && !string.IsNullOrEmpty(message.To[0]);
var step = _cascadeBudget.RecordErrorReport(message.Pthid, _options.CascadeThreshold, repliable);

// FR-PROTO-10 cascade guard. The increment + threshold check + trip-emission decision
// must be atomic per-thread, otherwise concurrent inbound reports on the same pthid
// race on `ErrorCount` and either double-emit the cascade-stop or skip the trip entirely.
// Lock on the ThreadState instance — it is shared between callers of the same pthid via
// IThreadStateStore.GetOrCreate, so it is the natural lock seam.
bool shouldEmit = false;
int errorCountSnapshot = 0;
lock (failingThread)
if (step.Log)
{
// Truly silent after the trip has fired: no increment, no log, no work. (Previously
// every post-trip report still mutated state and emitted an Information log line,
// turning "silently ignored" into a DoS-shaped log-flood + unbounded counter growth.)
if (failingThread.MaxErrorsNoticeSent)
return Task.FromResult<Message?>(null);

failingThread.ErrorCount++;
errorCountSnapshot = failingThread.ErrorCount;
_logger?.LogInformation(
"Inbound problem-report on thread {Pthid}: code={Code}, errorCount={Count}/{Threshold}",
message.Pthid, code.Value, errorCountSnapshot, _options.CascadeThreshold);

if (errorCountSnapshot > _options.CascadeThreshold)
{
// We need a repliable target to emit the cascade-stop. If this report is anoncrypt
// (no `from`) or addressless, DEFER the trip — leave MaxErrorsNoticeSent=false so
// the next repliable report on the same pthid can fire the trip. Without this
// deferral, an unrepliable report sets the counter past the threshold; the next
// report falls into the silent-ignore branch above and the cascade-stop is never
// emitted (FR-PROTO-10 silently broken).
if (string.IsNullOrEmpty(message.From) || message.To is not { Count: > 0 } || string.IsNullOrEmpty(message.To[0]))
return Task.FromResult<Message?>(null);

failingThread.MaxErrorsNoticeSent = true;
shouldEmit = true;
}
message.Pthid, code.Value, step.Count, _options.CascadeThreshold);
}

if (!shouldEmit) return Task.FromResult<Message?>(null);
if (!step.Emit)
return Task.FromResult<Message?>(null);

var errorCountSnapshot = step.Count;

// Build + log the cascade-stop OUTSIDE the per-thread lock — MessageBuilder allocations
// and logging shouldn't contend on the lock.
Expand Down
13 changes: 6 additions & 7 deletions src/DidComm.Core/Threading/ThreadState.cs
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,16 @@ public sealed class ThreadState
public IReadOnlyList<string>? AcceptLang { get; set; }

/// <summary>
/// Count of problem-reports produced on this thread. Used by the FR-PROTO-10 cascade
/// guard (Phase 6.2c) to emit <c>e.p.req.max-errors-exceeded</c> once the per-thread
/// threshold trips and then stop responding on the thread. Implementations / tests MAY
/// increment this directly; concurrent callers MUST hold the per-instance lock
/// (<see cref="ThreadState"/> exposes itself as the lock seam — see remarks).
/// General-purpose mutable per-thread error counter for consumer use. <b>Note:</b> the built-in
/// FR-PROTO-10 cascade guard no longer uses this field — it keeps its budget in a dedicated,
/// separate store (<c>CascadeBudgetStore</c>) so a flood of cheap thids in the general store can't
/// evict and reset it (#36). Retained for consumers who track their own per-thread counts.
/// </summary>
public int ErrorCount { get; set; }

/// <summary>
/// Set to <c>true</c> the moment the FR-PROTO-10 cascade-stop notice has been emitted on
/// this thread, so the handler can short-circuit subsequent reports without further work.
/// General-purpose per-thread flag for consumer use. <b>Note:</b> as with <see cref="ErrorCount"/>,
/// the built-in cascade guard no longer reads or writes this — see <c>CascadeBudgetStore</c> (#36).
/// </summary>
public bool MaxErrorsNoticeSent { get; set; }

Expand Down
3 changes: 3 additions & 0 deletions src/DidComm.Extensions.DependencyInjection/DidCommBuilder.cs
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,9 @@ public DidCommBuilder AddBuiltInProtocols()
AddProtocol<EmptyHandler>();
AddProtocol<DiscoverFeaturesHandler>();
AddProtocol<ProblemReportHandler>();
// The FR-PROTO-10 cascade budget (#36) lives in a dedicated store registered as its OWN
// singleton, so the budget persists even if the handler is (mis)registered non-singleton.
Services.TryAddSingleton<CascadeBudgetStore>();
// ProblemReportOptions is bound via the standard Options pattern; default ctor's
// CascadeThreshold = 5 is the SICPA-python-matching default per locked decision.
Services.AddOptions<ProblemReportOptions>();
Expand Down
Loading
Loading