Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,42 @@ All notable changes to didcomm-dotnet are documented here. Format follows

## [Unreleased]

### Security — HTTP receive timing side-channel (`feat/security-receive-timing-floor`)

Closes GitHub issue **#35** (High) — the timing residual the #20 red-team explicitly filed (see
`tasks/lessons.md` L-025). The #20 fix made every HTTP receive failure return a uniform bodyless
`400`, closing the response-content/status oracle, but response **time** still partitioned those 400s:
an envelope addressed to a recipient key the agent holds runs the full ECDH/AEAD path (~360 µs) while
one addressed to a key it does not hold fast-fails before any crypto (~180 µs), so a single timed
request enumerated which recipient keys the agent holds (FR-API-07). A PoC-backed adversarial pass
caught — and the fix folds in — two further residuals.

- **Constant-time rejection floor.** New `DidCommReceiveOptions.ReceiveRejectionFloor` (default
**5 ms**, set `TimeSpan.Zero` to disable). The HTTP receive handler times the unpack window and,
before returning the uniform `400`, pads with a non-cancelable timer-backed delay up to the floor, so
the rejection time is independent of how far envelope processing got. The fast-fail "unheld kid" path
and the full-crypto "held kid + AEAD fail" path now both finish under the floor and pad to the same
duration. **Behavior change:** rejected requests now take ≥ the floor; legitimate received messages
answer `202` and are never padded (the hot path is untouched), and `415`/`413` pre-decrypt rejections
are unchanged. The cost is paid only on rejected — typically hostile or malformed — traffic.
- **Red-team Finding 1 (folded in): floor measured after the body read.** The pad clock starts *after*
`ReadCappedAsync`, not at handler entry — otherwise a peer could pad the envelope toward
`MaxReceiveBytes` so the (kid-independent) body read alone exhausts the floor, collapse the pad to
zero, and re-expose the gap. Covered by `Large_body_rejection_is_still_padded_to_the_floor`.
- **Red-team Finding 2 (documented residual, not closed by a fixed floor).** The held-only path that
decrypts and then does **network** DID resolution (authcrypt sender / FR-CONSIST-06 / `from_prior`)
can run longer than the floor — e.g. an attacker authcrypts to a guessed kid signed by an
attacker-controlled, deliberately-slow `did:webvh` sender, making a held response visibly exceed the
floor while an unheld one stays at it. That residual is network-bound and unbounded, so a fixed floor
cannot close it without an absurd value; the API docs direct operators to authentication / a
rate-limiter in front of the receive endpoint (the deployment tradeoff #35 calls out). Tracked as #44.
- **WebSocket receive is document-only.** The WS loops log-and-discard with nothing written back, so
there is no per-message response to time; the floor is intentionally not applied there (noted in code).
- **Root cause tracked upstream.** The local-crypto gap originates in `JweParser.Parse` /
`FindPresent` (DataProofsDotnet.Jose), which fast-fails before ECDH when no recipient kid is held.
The floor is a compensating edge control; the durable constant-time-recipient-selection fix is filed
upstream (`moisesja/dataproofs-dotnet#12`) and tracked here in #42.

### Security — protocol & transport hardening (`feat/security-protocol-hardening`)

Resolves four scattered Low/Info findings (GitHub issues #27, #30, #31, #34); #32 closed as
Expand Down
71 changes: 68 additions & 3 deletions src/DidComm.AspNetCore/DidCommEndpointRouteBuilderExtensions.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
using System.Buffers;
using System.Diagnostics;
using System.Net.Mime;
using System.Net.WebSockets;
using DidComm.Exceptions;
Expand Down Expand Up @@ -83,6 +84,13 @@ public static IEndpointConventionBuilder MapDidCommEndpoint(

var logger = httpContext.RequestServices.GetService<ILoggerFactory>()?.CreateLogger("DidComm.AspNetCore.Receive");

// #35: start the constant-time clock HERE — after the body has been read — so the floor
// budgets only the unpack window where the recipient-kid-dependent timing lives. The body
// read is kid-independent (the peer chose its size), so charging it against the floor would
// let a peer pad the request out to MaxReceiveBytes, exhaust the floor before any crypto,
// and re-expose the held-vs-unheld gap (PR #43 red-team).
var startTimestamp = Stopwatch.GetTimestamp();

UnpackResult unpacked;
try
{
Expand All @@ -99,7 +107,7 @@ public static IEndpointConventionBuilder MapDidCommEndpoint(
// downstream DID-resolution timeout (TaskCanceledException : OperationCanceledException
// with the request token NOT cancelled, e.g. an attacker-controlled did:webvh host that
// hangs). Only a real client abort is rethrown above, so there is no 400-vs-500 oracle.
return NormalizedReceiveRejection(logger, ex);
return await NormalizedReceiveRejectionAsync(logger, ex, startTimestamp, receiveOptions.ReceiveRejectionFloor).ConfigureAwait(false);
}

// The message was received and unpacked; HTTP receive is one-way / fire-and-forget
Expand Down Expand Up @@ -170,6 +178,11 @@ public static IEndpointConventionBuilder MapDidCommEndpoint(
return Results.StatusCode(StatusCodes.Status413PayloadTooLarge);
}

// #35: start the constant-time clock after the (kid-independent) body read, covering only
// the unpack+dispatch window where the recipient-kid-dependent timing lives — see the
// matching note in the callback overload above.
var startTimestamp = Stopwatch.GetTimestamp();

// Unpack AND dispatch share one opaque exit: a handler bug (FR-THR-04 InvalidOperationException),
// an unpack rejection (crypto/malformed/consistency/resolution), and a downstream
// DID-resolution timeout (TaskCanceledException with the request token NOT cancelled) all
Expand All @@ -188,7 +201,7 @@ public static IEndpointConventionBuilder MapDidCommEndpoint(
}
catch (Exception ex)
{
return NormalizedReceiveRejection(logger, ex);
return await NormalizedReceiveRejectionAsync(logger, ex, startTimestamp, receiveOptions.ReceiveRejectionFloor).ConfigureAwait(false);
}
});
}
Expand Down Expand Up @@ -344,6 +357,11 @@ private static async Task ReceiveLoopWithDispatchAsync(

var packed = System.Text.Encoding.UTF8.GetString(ms.ToArray());

// #35: the recipient-kid timing side-channel that the HTTP receive path pads against
// (see DelayToRejectionFloorAsync) does NOT apply here — a failed unpack is logged and
// discarded with NOTHING written back on the socket, so there is no per-message response
// for a peer to time. We deliberately do not pad the discard: it would add latency to the
// whole multiplexed message stream on the connection for a far weaker, noisier signal.
UnpackResult unpacked;
try
{
Expand Down Expand Up @@ -512,8 +530,18 @@ private static void LogOutcome(ILogger? logger, DispatchOutcome outcome, bool sa
/// status code or response text as a decryption / recipient-kid / DID-resolution oracle (issues
/// #20/#28/#33). The detailed reason is logged server-side only. The WebSocket paths already take
/// this posture (log-and-discard, echo nothing); this brings the HTTP paths to parity.
/// <para>
/// #35: the uniform body/status alone is not enough — the response <em>time</em> still partitioned
/// the failures (a held recipient kid runs the full ECDH/AEAD path; an unheld one fast-fails before
/// any crypto), so a single timed request enumerated which keys the agent holds. Before returning
/// the 400 we therefore pad the response out to <paramref name="rejectionFloor"/> measured from
/// <paramref name="startTimestamp"/> (captured after the body read, so the floor budgets only the
/// unpack window), making the rejection time independent of the crypto work done. See
/// <see cref="DelayToRejectionFloorAsync"/>.
/// </para>
/// </summary>
private static IResult NormalizedReceiveRejection(ILogger? logger, Exception ex)
private static async Task<IResult> NormalizedReceiveRejectionAsync(
ILogger? logger, Exception ex, long startTimestamp, TimeSpan rejectionFloor)
{
if (ex is DidCommException)
{
Expand All @@ -527,9 +555,43 @@ private static IResult NormalizedReceiveRejection(ILogger? logger, Exception ex)
logger?.LogError(ex, "DidComm receive: inbound processing failed ({Reason}).", ex.GetType().Name);
}

await DelayToRejectionFloorAsync(startTimestamp, rejectionFloor).ConfigureAwait(false);
return Results.StatusCode(StatusCodes.Status400BadRequest);
}

/// <summary>
/// #35: wait until at least <paramref name="floor"/> has elapsed since <paramref name="startTimestamp"/>
/// (captured after the body read, so the floor budgets only the unpack window) before the caller
/// returns its rejection. This pins the cheap, universal enumeration probe — garbage ciphertext to a
/// guessed kid — to a constant time: the fast-fail "unheld kid" path and the full-crypto "held kid +
/// AEAD fail" path both finish well under the floor and so are padded to the same duration.
/// <para>
/// The wait is deliberately <b>not</b> cancelable: a cancelable delay would let a peer abort the
/// connection early to skip the pad and re-expose the timing gap. <see cref="Task.Delay(TimeSpan)"/>
/// is timer-backed and holds no thread, so padding a soon-to-be-rejected request is cheap.
/// </para>
/// <para>
/// LIMITATION (documented, not fixed by a fixed floor): a fixed floor only masks failures that
/// finish <em>under</em> it. The held-only path that decrypts and then does network DID resolution
/// (authcrypt sender / FR-CONSIST-06 / from_prior) can run <em>longer</em> than the floor — and an
/// attacker who can drive that (e.g. authcrypt to a guessed kid signed by an attacker-controlled,
/// deliberately-slow did:webvh sender) sees "response ≫ floor" for a held kid versus "response ≈
/// floor" for an unheld one. That residual is unbounded (network latency), so a fixed floor cannot
/// close it without an absurd value; mitigate it with auth / a rate-limiter in front of the receive
/// endpoint (the deployment tradeoff the #35 issue calls out, tracked in #44). The durable fix for
/// the local-crypto gap is constant-time recipient-kid selection in the JWE parser, tracked upstream.
/// </para>
/// </summary>
internal static async Task DelayToRejectionFloorAsync(long startTimestamp, TimeSpan floor)
{
if (floor <= TimeSpan.Zero)
return;

var remaining = floor - Stopwatch.GetElapsedTime(startTimestamp);
if (remaining > TimeSpan.Zero)
await Task.Delay(remaining).ConfigureAwait(false);
}

private static async Task ReceiveLoopAsync(
System.Net.WebSockets.WebSocket socket,
DidCommClient client,
Expand Down Expand Up @@ -570,6 +632,9 @@ private static async Task ReceiveLoopAsync(
// FR-TRN-09: the reassembled bytes are exactly one packed envelope.
var packed = System.Text.Encoding.UTF8.GetString(ms.ToArray());

// #35: no constant-time pad here (unlike the HTTP 400 path) — a failed unpack is logged
// and discarded with nothing written back, so there is no per-message response a peer can
// time to enumerate held recipient kids.
try
{
var unpacked = await client.UnpackAsync(packed, ct).ConfigureAwait(false);
Expand Down
36 changes: 36 additions & 0 deletions src/DidComm.AspNetCore/DidCommReceiveOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,40 @@ public sealed class DidCommReceiveOptions
/// (e.g. chat samples) set this to <c>true</c> per endpoint.
/// </summary>
public bool AllowSameSocketReplies { get; set; }

/// <summary>
/// Minimum wall-clock time the HTTP receive endpoint takes before returning the uniform
/// <c>400</c> rejection. The handler times itself from entry and, before answering 400, waits
/// out the remainder of this floor so the response time no longer reveals how far envelope
/// processing got. This closes the recipient-kid timing side-channel (#35): without it, an
/// envelope addressed to a recipient key the server holds runs the full ECDH/AEAD path (~360 µs)
/// while one addressed to a key it does NOT hold fast-fails (~180 µs) — a single timed request
/// then tells an attacker which recipient keys the agent holds, the same enumeration the uniform
/// 400 body (#20) set out to remove. Padding every rejection up to a fixed floor removes that
/// gap (FR-API-07).
///
/// Applies to the 400 path only. A successfully received message answers 202 and is never
/// padded (it stays full-throughput); the 415 / 413 pre-decrypt rejections carry a different
/// status code and reveal nothing about which keys are held, so they are not padded either. The
/// cost is therefore paid only on rejected — typically hostile or malformed — traffic. The floor
/// is measured from just after the request body is read, so a peer cannot exhaust it by padding
/// the body out to the size limit.
///
/// Keep this small (single-digit milliseconds): 5 ms already dominates the local-crypto spread
/// with margin, and the wait is backed by a timer (it holds no thread). Defaults to 5 ms; set
/// <see cref="TimeSpan.Zero"/> to disable the floor entirely. On Windows, <see cref="Task.Delay(TimeSpan)"/>
/// rounds up to the system timer resolution (~15 ms by default), so values below that are silently
/// floored to it — this does not weaken the security property (15 ms still dominates the µs-scale
/// crypto gap) but means rejected requests cost ~15 ms there rather than the configured 5 ms.
///
/// <para><b>What this does and does not close.</b> The floor closes the cheap, universal probe —
/// an unauthenticated peer sending garbage ciphertext to guessed recipient kids and timing the
/// 400 to learn which keys the agent holds. It does NOT, on its own, close the held-only path
/// where a decryptable envelope triggers network DID resolution that runs longer than the floor
/// (an attacker-controlled slow sender DID makes the held response visibly exceed the floor while
/// an unheld one stays at it). That residual is network-bound and unbounded; close it with
/// authentication / a rate-limiter in front of the endpoint, not by inflating this value — see the
/// tracking issue #44.</para>
/// </summary>
public TimeSpan ReceiveRejectionFloor { get; set; } = TimeSpan.FromMilliseconds(5);
}
27 changes: 27 additions & 0 deletions tasks/lessons.md
Original file line number Diff line number Diff line change
Expand Up @@ -548,3 +548,30 @@ Format per entry:
validated URL you surface to a consumer, return the canonical `AbsoluteUri` and reject userinfo. When a
comment claims a fix "mirrors" an existing guard, verify parity — the OOB redirect guard is the scheme
half of TraceObserver's, NOT its allowlist; say so rather than overclaim.

## L-030 — A constant-time floor must cover ONLY the secret-dependent window, and a fixed floor cannot mask an unbounded tail.

- **Lesson:** Closing the #25/L-025 timing residual (#35) with a response-time floor took two red-team
iterations to get right. (a) **Floor the right window.** The first cut captured the start timestamp at
HTTP handler entry — *before* the up-to-`MaxReceiveBytes` (1 MiB) body read. Body-read time is
attacker-controlled and kid-independent, but it was charged against the floor, so a peer padding the
envelope toward the size cap made the read alone exhaust the floor (`remaining ≤ 0`), the pad collapsed
to zero, and the held-vs-unheld crypto gap was re-exposed. Fix: start the clock *after* the body read,
budgeting only the unpack window where the secret-dependent timing actually lives. (b) **A fixed floor
only masks paths that finish UNDER it.** The held-only decrypt-then-network-DID-resolution path
(authcrypt sender / FR-CONSIST-06 / `from_prior`) can run *longer* than any sane floor — an
attacker-controlled slow `did:webvh` sender turns the floor into a clean threshold: "response ≫ floor ⇒
held". My first draft comment rationalized this away as "unreachable for an unheld kid, so masked" —
which is exactly backwards: unreachable-for-the-other-class IS the discriminator. Honest disposition:
document it as an unbounded residual and point operators at auth / a rate-limiter (the deployment
tradeoff the issue itself named), don't claim closure.
- **Why:** A timing defense is only as good as (1) the window it measures — anything constant-time-padded
must exclude attacker-controlled, secret-independent preamble, or the attacker inflates the preamble to
evict the pad; and (2) the assumption that the padded work finishes under the floor — a network/IO tail
on the secret-holding branch breaks that and a fixed floor can't fix unbounded latency.
- **How to apply:** When you add a constant-time floor: measure from immediately before the
secret-dependent work, not from request entry; add a regression test with a *large* input on the padded
path (not just a tiny one) to prove the preamble can't evict the floor; and when a branch has an
unbounded (network) tail reachable only when the secret is present, say plainly that the floor does not
close it and name the compensating control — never argue "the other class can't reach it" as if that
hid the signal. Always re-run the break-it adversarial pass on the *revised* fix, not just the first.
Loading
Loading