Skip to content

feat(crypto): Session key rotation with forward and backward secrecy#853

Open
dastansam wants to merge 13 commits into
mainfrom
feat/crypto-session-rotation
Open

feat(crypto): Session key rotation with forward and backward secrecy#853
dastansam wants to merge 13 commits into
mainfrom
feat/crypto-session-rotation

Conversation

@dastansam

@dastansam dastansam commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

closes #843

@dastansam dastansam changed the title feat(crypto): Phase 1 — Noise KK session rotation primitives feat(crypto): Session key rotation with forward and backward secrecy Apr 27, 2026
@dastansam dastansam force-pushed the feat/crypto-session-rotation branch from 22e3269 to d39d77a Compare April 27, 2026 13:53
@dastansam dastansam marked this pull request as ready for review April 27, 2026 13:53
@dastansam dastansam self-assigned this Apr 27, 2026
@dastansam dastansam requested a review from MathJud April 27, 2026 13:59
@MathJud MathJud added the Project: Scaling Usability Trust Project: Scaling, Usability & Trust label May 11, 2026
dastansam added 13 commits June 2, 2026 20:24
Implements the primitives described in plan.md for rotating the Noise
KK session between two peers, without wiring any triggers yet. All
behaviour is exercised only from unit tests in this phase; message
dispatch plumbing (periodic trigger, volume counter, grace-window
tick) arrives in Phase 2.

Design choices baked in:
- Session-id collision resolution: lower new_session_id wins
  (symmetric, local, no PeerId ordering required).
- No signature field on RotateHandshakeSecond — Noise KK already
  authenticates both endpoints via their static keys.
- Grace period default 1 h (configurable via CryptoRotation).

Protobuf
  crypto_net.proto: RotateHandshakeFirst / RotateHandshakeSecond
  messages and matching oneof variants on CryptoserviceContainer.

Config
  storage::configuration::CryptoRotation with enabled=false default,
  added to the Configuration struct as crypto_rotation.
  Upgrade migration and config_persistence test updated.

Storage
  New per-user sled tree "rotation_meta" on CryptoAccount.
  RotationMeta { primary_session_id, pending_initiated_session_id,
  draining_session_id, draining_until, draining_remaining_volume }.
  Get/save/delete helpers, delete_state for abandoning a rotation,
  and a test_account() helper for tests that bypass global state.

Primitives (services/crypto/noise.rs)
  rotate_initiate          — create fresh session_id, KK step 1,
                             record pending_initiated on meta.
  rotate_complete_responder — handle incoming rotate_first; on
                             collision, lower session_id wins; on
                             nonce mismatch, abandon; on success
                             emit rotate_second and move primary
                             into the grace window.
  rotate_finalize_initiator — handle rotate_second for our pending;
                             KK step 2; flip primary.
  drain_expired_rotations  — scan rotation_meta and retire any
                             draining session past its deadline or
                             with zero grace_remaining_volume.

Sessionmanager gets a log-and-drop stub for the new oneof variants;
Phase 2 replaces it with real dispatch.

Tests (6, all pass)
  rotation_meta_roundtrip, rotation_meta_keyed_per_peer,
  drain_leaves_unexpired, drain_retires_time_expired,
  drain_retires_volume_exhausted, drain_noop_on_primary_only_meta.

End-to-end rotation tests (clean rotation, collision, late message
within/past grace, replayed nonce) are deferred to Phase 2 / Phase 4
integration tests because the primitives depend on global Users,
Configuration, and CRYPTOSTORAGE state — constructing those is a
libqaul-init operation, not a unit-test operation.

No behaviour change for existing peers: rotate_* frames are never
sent (trigger wiring lands Phase 2), and incoming rotate_* frames
are logged and dropped for now.
Turns the Phase 1 primitives into a live feature. Rotation is still
gated behind `CryptoRotation::enabled` (default false), so unchanged
defaults give byte-identical behaviour to main for existing peers.

What fires rotation now
  - Outbound send: `Crypto::encrypt` post-hook checks session age vs
    `period_seconds` and `index_nonce_out` vs `volume_messages`; on
    trigger, calls `rotate_initiate` and sends the resulting
    `RotateHandshakeFirst` as a `CryptoserviceContainer` through the
    normal `Messaging::pack_and_send_encrypted_data` path, encrypted
    under the currently-primary session.
  - Inbound receive: `Crypto::decrypt` post-hook checks
    `highest_index_nonce_in` vs `volume_messages` for messages
    arriving on the primary and fires a rotation symmetrically.

Dispatch of incoming rotation frames
  - `sessionmanager::process_rotate_first` calls
    `rotate_complete_responder`, then encrypts the resulting
    `RotateHandshakeSecond` **under the now-draining old session**
    (the initiator hasn't promoted yet) and sends it.
  - `sessionmanager::process_rotate_second` calls
    `rotate_finalize_initiator` to flip primary on the initiator side.
  - Two new helpers — `create_rotate_first_message` and
    `create_rotate_second_message` — mirror the existing
    `create_second_handshake_message` wrapper pattern.

Primary-session resolution
  - `Crypto::resolve_primary_state` consults `rotation_meta` so the
    post-rotation window (where a responder briefly has two Transport
    rows for the same peer) sends subsequent user traffic on the new
    primary, not whichever row `get_state` happens to find first.
  - `Crypto::encrypt` now uses `resolve_primary_state`; the decrypt
    path is unchanged (it already looks up by `message.session_id`).

Draining grace on the decrypt side
  - `Crypto::after_decrypt_rotation` decrements
    `draining_remaining_volume` on each successfully decrypted
    Transport message that arrives on the draining session, so the
    grace budget is honoured per message (separate from the time
    deadline handled by the drain ticker).

CryptoState
  - New `established_at: u64` (ms) with `#[serde(default)]` so
    existing on-disk rows deserialise with 0 and therefore never
    trip the time-based trigger until they re-handshake. Set on KK
    step-2 completion on both sides.

Periodic drain
  - New `rotation_ticker` (60 s) added to both `run`/`event_loop` and
    the `start_instance` loop. On tick, iterates
    `UserAccounts::get_all_users()` and calls
    `CryptoNoise::drain_expired_rotations` per account, gated on
    `cfg.crypto_rotation.enabled`.

Deferred to a follow-up
  - End-to-end integration tests (clean rotation, collision, late
    message within/past grace, replayed nonce). These require
    standing up global `Users`, `Configuration`, and `CRYPTOSTORAGE`
    state, which is a libqaul-init operation; tests belong in a
    dedicated integration harness and will land as Task 11 in a
    follow-up commit.

All 27 existing lib tests still pass.
Six new tests exercising the helpers introduced by Phase 2:

  resolve_primary_state
    - resolve_primary_prefers_meta_designated_row — when
      rotation_meta names a primary and both Transport rows exist,
      the meta-designated one is returned (the post-responder-step
      ambiguity fix).
    - resolve_primary_falls_back_without_meta — legacy get_state
      path when no rotation activity has happened.
    - resolve_primary_ignores_missing_state_for_meta_primary —
      stale-meta safety: fall back to get_state rather than
      returning None.

  after_decrypt_rotation
    - after_decrypt_decrements_draining_volume — a message
      decrypted on the draining session decrements
      `draining_remaining_volume` by exactly one; primary fields
      remain untouched.
    - after_decrypt_saturates_at_zero — saturating_sub prevents
      underflow when the budget is already exhausted.
    - after_decrypt_noop_on_unrelated_session — a session_id that
      matches neither primary nor draining is ignored.

To drive `Configuration::get()` from these tests without the full
libqaul init chain, add `Configuration::init_for_tests(cfg)` — a
`#[cfg(test)]` idempotent installer for the `CONFIG` InitCell.
`Configuration::default()` could not be used: `Internet::default`
reads `DEFCONFIGS` which is only populated by `Libqaul::new`, so the
test fixture builds the Configuration struct literally from the
sub-modules' self-contained defaults.

Full end-to-end rotation tests (clean rotation across two in-
process peers running the real Noise handshake, collision-loss
path, replayed nonce rejection, grace-window expiry in the face of
live traffic) require `Users::init`, `DataBase::init`, and
`CryptoStorage::init` against tempdirs — a non-trivial fixture that
belongs in plan.md's Phase 4 local-mesh integration harness rather
than here.

All 33 libqaul lib tests pass.
Exposes the Phase 1/2 CryptoRotation settings to clients via a
standard module-scoped RPC, and a qaul-cli sub-command set. No
event surface yet — a `RotationEvent` log (`Rotated`,
`GraceExpired`) is a plausible Phase 3 follow-up but is split from
this commit to keep the diff focused.

Protobuf
  - rpc/qaul_rpc.proto: `CRYPTO = 16` in the Modules enum.
  - services/crypto/crypto_rpc.proto (new): `Crypto` oneof
    container with `GetConfigRequest`, `GetConfigResponse`,
    `SetConfigRequest`, `SetConfigResponse`. Every SetConfigRequest
    field is `optional`, so clients send *partial* updates —
    libqaul treats unset fields as "leave untouched".

libqaul
  - `Crypto::rpc(data, user_id, request_id)` (services/crypto/mod.rs):
    decodes the Crypto container, routes GetConfig/SetConfig to
    `handle_get_config` / `handle_set_config`.
  - `handle_set_config` validates each numeric field (rejecting
    zero with a per-field error message — rotating on every
    message, or retiring draining on first message, are near-
    certain client mistakes), applies only the present fields,
    persists via `Configuration::save()`, and echoes the post-
    update config in `SetConfigResponse.applied`.
  - `rpc/mod.rs`: dispatches `Ok(Modules::Crypto)` to
    `Crypto::rpc`.

CLI
  - `clients/cli/src/crypto.rs` (new): `crypto config`,
    `crypto config enable|disable|period <s>|volume <n>|grace <s>
    |grace-volume <n>`, plus `Crypto::rpc` render for both
    GetConfigResponse and SetConfigResponse.
  - Wired into `cli.rs`, `main.rs`, and the `rpc.rs` response
    dispatch.

Tests (all 36 lib tests pass)
  - `rpc_get_config_returns_installed_config` — round trip through
    the real `Rpc` send/receive channel; verifies the response
    matches the installed CryptoRotation fields.
  - `rpc_set_config_partial_update_preserves_other_fields` —
    sends a SetConfigRequest with only `period_seconds`, asserts
    `success=true`, `applied.period_seconds` updated, every other
    field unchanged. Reverts before releasing the test lock.
  - `rpc_set_config_rejects_zero_fields` — asserts validation
    path: `success=false`, error mentions the offending field,
    config left untouched.

  A module-scoped `CONFIG_LOCK: Mutex<()>` serialises tests that
  mutate the process-global `CONFIG` InitCell so they don't race
  Phase 2's after_decrypt_rotation tests, which also read config.

Remaining for a future Phase 3 bump (deferred)
  - Event surface (Rotated / GraceExpired / MessageDroppedPastGrace)
    — needs a ring-buffer event log + emission points at
    `rotate_finalize_initiator`, `drain_expired_rotations`, and the
    past-grace decrypt path. Does not share code with this commit;
    splitting keeps the diff focused.
Completes the Phase 3 split by exposing the three rotation events
from plan.md (`Rotated`, `GraceExpired`, `MessageDroppedPastGrace`)
to clients via a process-global ring buffer log queried through
the Crypto RPC module.

Protobuf
  - crypto_rpc.proto: `RotationEventKind` enum, `RotationEvent`
    message, `GetRotationEventsRequest { since_ms, limit }`,
    `GetRotationEventsResponse { events }`. New variants on the
    `Crypto` oneof.

libqaul
  - services/crypto/events.rs (new): MAX_EVENTS=256 ring buffer in
    a lazy `InitCell<RwLock<VecDeque<RotationEvent>>>`, `record()`
    with oldest-eviction, `query(since_ms, limit)` with oldest→
    newest ordering. Test-only `clear_for_tests()` resets the log
    between assertions.
  - Three emission sites in `CryptoNoise`:
      - `rotate_finalize_initiator` → `Rotated`
      - `drain_expired_rotations` → `GraceExpired` + stamps
        `last_retired_session_id`/`last_retired_at` on the meta.
      - decrypt "session not found" branch → `MessageDroppedPastGrace`
        when the incoming `session_id` matches `last_retired_*`.
  - `RotationMeta` gets `last_retired_session_id: Option<u32>` and
    `last_retired_at: Option<u64>` (both `#[serde(default)]` so
    existing on-disk rows deserialise cleanly). `Default` derived
    so the many struct-literal sites can use `..Default::default()`.
  - `Crypto::rpc` gains the `GetRotationEventsRequest` arm, routed
    to `handle_get_events` which maps the internal `events::*`
    types onto the proto shapes.

CLI (clients/cli/src/crypto.rs)
  - `crypto events [limit]` subcommand fires a
    `GetRotationEventsRequest` and prints a four-column table
    (timestamp_ms, kind, remote_id, primary, draining).

Tests (40 lib tests total, all pass)
  - `event_log_caps_at_max_events` — oldest evicted on overflow.
  - `event_log_query_filters_and_limits` — `since_ms` filter and
    `limit` cap.
  - `drain_emits_grace_expired_and_stamps_meta` — drain path emits
    the event and stamps `last_retired_*`.
  - `rpc_get_events_returns_recorded_events` — end-to-end round
    trip through `Rpc::send_message` / `receive_from_libqaul`.

Tests that mutate the event log hold a dedicated `EVENT_LOG_LOCK`;
`rpc_get_events_returns_recorded_events` additionally holds
`CONFIG_LOCK` (acquired first) to avoid lock-ordering inversions
with Phase 3 config-mutation tests.

Defaults unchanged — `CryptoRotation::enabled = false` still ships
dormant, so no event is emitted on a stock installation.
Adds a TriggerRotationRequest/Response pair to crypto_rpc.proto and
refactors the trigger-fire path into a shared perform_rotation helper so
the manual RPC and the automatic time/volume triggers share send code.
handle_trigger_rotation resolves the default user, validates the remote
PeerId, and reports the previous/new session ids back to the caller.

Mirrors the existing rust/clients/cli crypto commands into qauld-ctl
(config / enable / disable / set / rotate / events) with JSON output so
the pytest integration harness can drive rotation scenarios.

Unit-tests cover the disabled-config and invalid-remote-id rejection
paths; the end-to-end rotation path requires a live libqaul stack and
lives in the upcoming Phase 4 multi-node tests.
Adds the first of five multi-node rotation scenarios from plan.md Phase 4.
Also extends the pytest Node helper with crypto_config / set_crypto_config
/ rotate_with / crypto_events so subsequent scenarios can reuse the
driving code.

The test converges a line-5 mesh, pins rotation config so automatic
triggers cannot fire, then forces a rotation mid-stream between the two
endpoints. It asserts no message loss across pre-rotation, straddling,
and post-rotation traffic and that both peers log a Rotated event whose
draining_session_id matches the sender's previous primary.

Requires meshnet-lab (Linux netns + sudo); not runnable on CI or on
macOS dev machines.
Partitions the recipient off the mesh by swapping to a line-5 variant
that omits the last link, forces a rotation on the still-connected
sender, emits traffic while the peer is unreachable, then heals the
mesh. Asserts all messages land, both peers log matching Rotated
events, and the new primary session id is reflected on both sides.

Topology swap (rather than kill_node) keeps qauld alive on both ends
so this exercises the messaging buffer / DTN path rather than state
reload on the recipient. The restart scenario is tested separately.
Third Phase 4 scenario: two peers rotate with a 15 s grace window on
the recipient, then the drain ticker (60 s interval) must retire the
old draining session and emit a GraceExpired event for the previous
primary. Also asserts that post-rotation traffic on the new primary
delivers end-to-end, confirming that draining the old state did not
disturb the live session.

Notes in the module docstring why the sibling MessageDroppedPastGrace
event stays in unit-test scope — reproducing it in a live mesh would
require injecting ciphertext on an already-retired session, which no
public API exposes.
Fourth Phase 4 scenario. Both peers trip rotation concurrently from a
thread pool, then both emit bi-directional traffic across the collision
window. Asserts both peers log a Rotated event and every message in
both directions (pre-collision, during-collision, post-collision) is
delivered exactly once.

The collision-resolution rule (lower new_session_id wins, loser drops
its HalfOutgoing and adopts the winner's incoming rotate_first) is the
gnarliest rotation edge case in a DTN-tolerant system; this test pins
the observable convergence contract.
Fifth and final Phase 4 scenario: establish, rotate, then stop qauld
on every namespace and restart while the sled database and config
persist on disk. After reconvergence the test sends on the post-
rotation session in both directions and asserts delivery succeeds —
failure would mean either CryptoState or rotation_meta did not
round-trip through storage and the sender had to fall back to a new
handshake.

The in-memory rotation event ring buffer does not survive restart
(documented), so the test does not assert on crypto_events after
start_qaul.
Adds a UserInfo.capabilities bitset (router_net_info.proto) and an
in-memory Capabilities::{ROTATION, LOCAL, supports} API in
router::users. Local accounts stamp Capabilities::LOCAL into their
User row on create / on Router::init-time reload; incoming UserInfo
updates the remote peer's advertised caps through a new
add_with_check_caps / add_with_caps path.

Crypto::perform_rotation now refuses to rotate with any peer that
has not advertised Capabilities::ROTATION. Without the gate, a
legacy binary on the other end would silently drop the
RotateHandshakeFirst frame and leave the initiator stuck on a
dangling HalfOutgoing row — returning early here lets the caller
keep using the existing legacy session instead.

Also adds Users::{set_capabilities_for_tests, init_for_tests} so
unit tests can simulate UserInfo arrivals without running the full
routing stack, plus three phase5 unit tests covering the gate
rejection, gate acceptance, and bitmask semantics.

Defaults for the Phase 5 rollout are already in place: Phase 1
shipped `crypto_rotation.enabled = false` by default, and the
capability advertisement is a constant-at-compile-time bitset this
binary always includes. Flipping the default to `true` and
enabling on test nodes are operational steps.
Adds docs/protocols/Noise-Session-Rotation.md alongside the existing
messaging and BLE protocol docs. Captures the design separately from
plan.md (which mixes design and delivery): goals, why full session
rotation rather than a per-message ratchet, trigger model, the three
wire frames, receiver routing, rotation_meta layout, the capability
negotiation that gates mixed-version peers, the event surface, the
operator/RPC surface, threat model, and rollback procedure.

References the implementation files and the integration test scenarios
so the doc and the code can be navigated together.
@dastansam dastansam force-pushed the feat/crypto-session-rotation branch from ad496de to ceb837d Compare June 3, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Project: Scaling Usability Trust Project: Scaling, Usability & Trust

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a Concept for PFS Crypto in Messaging

2 participants