Skip to content

fix(security): remediate Medium-severity audit cluster (#17–#21)#37

Merged
moisesja merged 4 commits into
mainfrom
feat/security-medium-cluster
Jun 17, 2026
Merged

fix(security): remediate Medium-severity audit cluster (#17–#21)#37
moisesja merged 4 commits into
mainfrom
feat/security-medium-cluster

Conversation

@moisesja

@moisesja moisesja commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Remediates the five Medium-severity findings (#17#21) from the multi-agent security & compliance audit, plus two lower-severity defects on the same receive code path (#28, #33). Each fix ships with negative + regression tests, and each was then put through an adversarial red-team pass (one break-it agent per fix, remote-attacker threat model) which confirmed #17/#18/#19 hold and surfaced two further hardening items (folded in) and two deeper residuals (filed as #35/#36).

Full suite: 600 tests green (492 Core + 108 Interop); dotnet build clean under warnings-as-errors.

Fixes

# Fix
#17 EnvelopeReader rejects illegal envelope compositions — enforces the DIDComm v2.1 receive grammar anoncrypt? authcrypt? sign? plaintext (auth/anon-flag aware) before any content/consistency processing. Adds PRD FR-ENV-04a documenting the receive-acceptance set (broader than the FR-ENV-02 emit set; admits Appendix C.3 anoncrypt(authcrypt(sign))).
#18 NetDidKeyService authorizes a kid only when its id-subject and controller resolve to the asserted DID (FR-CONSIST-06 / DD-01), walking raw verification methods so controller is no longer dropped. did:key/did:peer keep authorizing.
#19 FromPriorValidator normalizes every malformed-JWT parse failure (FormatException/InvalidOperationException/ArgumentException/JsonException/KeyNotFoundException) to a generic ProtocolException out of UnpackAsync.
#20 / #28 / #33 ASP.NET Core HTTP receive collapses every failure to one opaque 400 with empty body (reason logged server-side) — closes the decryption/recipient-kid oracle, the 500 + dev-stack-trace escape, and the echoed handler-bug detail. 202/415/413 unchanged.
#21 InMemoryThreadStateStore bounded (DefaultMaxEntries = 10_000, approximate-LRU) — converts an unbounded-growth memory-exhaustion DoS into a few-MB bound. Secure-by-default.

Red-team hardening (folded into this PR)

Residuals filed (NOT closed here)

Test plan

Closes #17, closes #18, closes #19, closes #20, closes #21, closes #28, closes #33

🤖 Generated with Claude Code

moisesja and others added 3 commits June 16, 2026 18:57
Fixes the five Medium-severity findings from the multi-agent security &
compliance audit, plus two lower-severity defects on the shared receive
code path (#28, #33). Full suite (598 tests) green; warnings-as-errors clean.

- #17 EnvelopeReader rejects illegal envelope compositions: enforces the
  DIDComm v2.1 receive grammar `anoncrypt? authcrypt? sign? plaintext`
  (auth/anon-flag aware) before content/consistency processing. Adds PRD
  FR-ENV-04a documenting the receive-acceptance set (broader than the
  FR-ENV-02 emit set; admits Appendix C.3 anoncrypt(authcrypt(sign))).
- #18 NetDidKeyService authorizes a kid only when its id-subject AND
  controller resolve to the asserted DID (FR-CONSIST-06 / DD-01), walking
  raw verification methods so `controller` is no longer dropped.
- #19 FromPriorValidator normalizes all malformed-JWT parse failures
  (FormatException/InvalidOperationException/ArgumentException/JsonException/
  KeyNotFoundException) to a generic ProtocolException out of UnpackAsync.
- #20/#28/#33 ASP.NET Core HTTP receive collapses every failure to one
  opaque 400 with empty body (reason logged server-side), closing the
  decryption/recipient-kid oracle, the 500 + dev-stack-trace escape, and
  the echoed handler-bug detail. 202/415/413 paths unchanged.
- #21 InMemoryThreadStateStore is bounded (DefaultMaxEntries=10_000,
  approximate-LRU), converting an unbounded-growth memory-exhaustion DoS
  into a few-MB bound. Secure-by-default; no DI wiring change.

Closes #17, #18, #19, #20, #21, #28, #33

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An adversarial red-team pass (one break-it agent per fix, remote-attacker
threat model) confirmed #17/#18/#19 hold and surfaced two exploitable flaws
in the new #20/#21 code, fixed here, plus two deeper residuals filed as
follow-ups (#35 timing oracle, #36 cascade-guard eviction bypass).

- #20 callback-overload 500 oracle: the inline-callback overload ran
  `onReceive` outside the unpack try, so a throwing callback escaped as a 500
  — distinguishing "unpacked successfully" from "rejected" and re-exposing an
  unpack-success oracle. Now wrapped: a post-unpack callback fault is logged
  and the receive still answers 202 (one-way per FR-TRN-10).
- #21 concurrent eviction-storm: GetOrCreate had no mutual exclusion around
  the over-cap eviction, so N concurrent inserters each ran a full O(n log n)
  snapshot-sort (measured ~11x redundant work at 16 threads) — a CPU-
  amplification DoS. Eviction is now single-flight (Interlocked guard); only
  one pass runs at a time, restoring the intended amortized cost.

Tests: + concurrent-eviction-storm regression (bounds eviction passes under
8-thread flood) and + throwing-callback-still-202. Full suite 600 green.

Residuals filed, not closed by this PR: #35 (HTTP receive timing side-channel,
held-vs-unheld recipient kid — needs a response-time floor), #36 (FR-PROTO-10
cascade guard defeatable by forcing thread-state eviction; see also #29).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@moisesja moisesja self-assigned this Jun 17, 2026
@moisesja moisesja added this to the 1.0.0 GA milestone Jun 17, 2026
@moisesja

Copy link
Copy Markdown
Owner Author

Review — strong PR, one issue re-opens the oracle it sets out to close

This is high-quality work: each fix is narrow, fail-closed, documented with FR/issue refs, and the self-run adversarial pass (callback-500, eviction-storm) is exactly the rigor a security cluster deserves. I traced the envelope grammar gate against the illegal compositions (anoncrypt(anoncrypt), sign(anoncrypt), authcrypt(anoncrypt), double-sign) and the legal anoncrypt(authcrypt(sign)) — the walk is correct and the depth-4 bound admits the legal max. The IsControlledBy rule is fail-closed (null DidSubjectOf → reject). CI is green on ubuntu + windows.

One blocking concern before merge, plus two minor notes.

🔴 Blocking — catch (OperationCanceledException) { throw; } re-exposes the 400-vs-500 oracle via did:webvh

Both receive overloads rethrow every OperationCanceledException on the assumption it's a client abort:

That assumption doesn't hold. HttpClient.Timeout surfaces as TaskCanceledException : OperationCanceledException with the caller's token not cancelled — and UnpackAsync resolves the attacker-controlled sender DID (and from_prior issuer) to fetch verification keys. I confirmed the chain in the dependency the PRD steers users toward:

net-didDefaultWebVhHttpClient does new HttpClient() (default 100s Timeout) and catches only HttpRequestException:

// net-did/src/NetDid.Method.WebVh/DefaultWebVhHttpClient.cs:34
using var response = await _httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead, ct);
// ...
catch (HttpRequestException) { return null; }   // TaskCanceledException is NOT caught here

Exploit: peer sends a signed/authcrypt envelope whose sender DID is did:webvh:<attacker-host>, where the host accepts the TCP connection but never responds. Resolution hangs until the 100s HttpClient timeout fires → TaskCanceledException → propagates out of UnpackAsync → caught at line 91/180 → rethrown (because httpContext.RequestAborted is not cancelled) → unhandled → 500 (with the dev exception page if ASPNETCORE_ENVIRONMENT=Development). A normal undecryptable envelope returns 400. That's a clean, attacker-driven 400-vs-500 partition — exactly the "500 + dev-stack-trace escape" #33/#20 set out to eliminate — and it doubles as a ~100s-per-probe hang (slowloris amplification).

It's gated on did:webvh being wired (did:key/did:peer do no I/O), so it's latent in the default config — but the PRD recommends did:webvh as the did:web replacement, so a deployment that follows the guidance is exposed.

Fix — gate the rethrow on the request token so only genuine client aborts propagate and a downstream-timeout OCE collapses to the uniform 400:

catch (OperationCanceledException) when (httpContext.RequestAborted.IsCancellationRequested)
{
    throw; // genuine client abort — propagate
}
catch (Exception ex)
{
    return NormalizedReceiveRejection(logger, ex); // downstream-timeout OCE included
}

Apply at both unpack sites (91, 180); the callback site at line 112 is lower-risk but worth the same treatment for consistency. A regression test (resolver throws TaskCanceledException with RequestAborted un-cancelled → assert 400, not 500) would lock it in — the current suite covers the throwing-callback case but not this one.

🟡 Minor — handler bugs now return 400 instead of 500

In MapDidCommEndpoint, a handler InvalidOperationException (FR-THR-04 rule-2 — a server-side bug) now collapses to 400 via NormalizedReceiveRejection, where it used to be 500. Intentional for oracle-uniformity and it's logged at LogError, but ops dashboards/alerts that key on 5xx will go blind to receiver-side handler bugs. Fine as a deliberate tradeoff — just call it out in the PR/CHANGELOG, and consider an explicit failure metric so operators retain a signal.

🟡 Nit — LRU doc claim is slightly stronger than the guarantee

InMemoryThreadStateStore remarks say an active thread "is touched on every access and so is never evicted while active." Only true within the eviction window — a legit thread idle while a peer floods >low-water fresh thids is among the oldest and gets evicted (this is residual #36). Suggest softening to "...never evicted while actively receiving" so it doesn't read as a hard guarantee.


Everything else looks merge-ready. The blocking item is a small, contained change. Happy to re-review once the OCE rethrow is gated.

Addresses the blocking review finding: both HTTP receive overloads rethrew
EVERY OperationCanceledException assuming a client abort. A downstream
DID-resolution timeout surfaces as TaskCanceledException : OCE with the
request token NOT cancelled (e.g. net-did's webvh client hits its 100s
HttpClient.Timeout against an attacker-controlled host that hangs). That
escaped as an unhandled 500 (+ dev stack trace) while a normal undecryptable
envelope returns 400 — a clean 400-vs-500 oracle plus a ~100s slowloris hang,
re-opening exactly what #20/#33 closed.

- Gate all three OCE catches (overload-1 unpack + callback, overload-2
  unpack+dispatch) with `when (httpContext.RequestAborted.IsCancellationRequested)`
  so only a genuine client abort propagates; a downstream-timeout OCE falls
  through to the uniform 400 (unpack) / logged-202 (callback).
- Regression test: a resolver that throws TaskCanceledException with
  RequestAborted un-cancelled now returns 400, not 500.

Minor review notes:
- CHANGELOG: call out that handler-bug InvalidOperationException now returns
  400 (logged at Error) instead of 500, so operators alert on the log not 5xx.
- Soften the InMemoryThreadStateStore LRU doc: "not evicted while actively
  receiving" holds only within the eviction window (residual #36).

Full suite 601 green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@moisesja

Copy link
Copy Markdown
Owner Author

Thanks for the catch — the did:webvh-timeout 400-vs-500 oracle is real and I've fixed it in 451c4e0.

🔴 Blocking — OCE rethrow gated on the request token

All three OCE catches now use catch (OperationCanceledException) when (httpContext.RequestAborted.IsCancellationRequested):

  • overload-1 unpack (DidCommEndpointRouteBuilderExtensions.cs:91)
  • overload-1 callback (:112)
  • overload-2 unpack+dispatch (:183)

So only a genuine client abort propagates; a downstream-resolution TaskCanceledException (request token not cancelled) falls through to the uniform 400 on the unpack sites and to logged-202 on the callback site. Exactly your suggested shape.

Regression test addedDownstream_resolution_timeout_returns_400_not_a_500_oracle: an IDidResolver that throws TaskCanceledException with RequestAborted un-cancelled now yields 400 (empty body), not 500. It reuses the known-good anoncrypt-for-Bob envelope (proven fully-unpackable by the throwing-callback test), so the only thing producing the 400 is the resolver OCE — not an incidental decrypt failure.

🟡 Handler-bug 400-vs-500

Called out in the CHANGELOG: handler InvalidOperationException (FR-THR-04 rule 2) now returns 400 for response uniformity and is logged at Error on the DidComm.AspNetCore category — operators should alert on that log rather than a 5xx rate. (Left an explicit failure-metric as a future enhancement rather than bolting a counter into this PR.)

🟡 LRU doc nit

Softened — the remark now says a thread "is not evicted" only "within the eviction window," and explicitly notes an idle thread can age out under a >(cap − low-water) flood (residual #36).

Full suite 601 green. Ready for re-review.

@moisesja moisesja merged commit f07a1c3 into main Jun 17, 2026
2 checks passed
@moisesja moisesja deleted the feat/security-medium-cluster branch June 17, 2026 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment