feat(crypto): Session key rotation with forward and backward secrecy#853
Open
dastansam wants to merge 13 commits into
Open
feat(crypto): Session key rotation with forward and backward secrecy#853dastansam wants to merge 13 commits into
dastansam wants to merge 13 commits into
Conversation
22e3269 to
d39d77a
Compare
Implements the primitives described in plan.md for rotating the Noise
KK session between two peers, without wiring any triggers yet. All
behaviour is exercised only from unit tests in this phase; message
dispatch plumbing (periodic trigger, volume counter, grace-window
tick) arrives in Phase 2.
Design choices baked in:
- Session-id collision resolution: lower new_session_id wins
(symmetric, local, no PeerId ordering required).
- No signature field on RotateHandshakeSecond — Noise KK already
authenticates both endpoints via their static keys.
- Grace period default 1 h (configurable via CryptoRotation).
Protobuf
crypto_net.proto: RotateHandshakeFirst / RotateHandshakeSecond
messages and matching oneof variants on CryptoserviceContainer.
Config
storage::configuration::CryptoRotation with enabled=false default,
added to the Configuration struct as crypto_rotation.
Upgrade migration and config_persistence test updated.
Storage
New per-user sled tree "rotation_meta" on CryptoAccount.
RotationMeta { primary_session_id, pending_initiated_session_id,
draining_session_id, draining_until, draining_remaining_volume }.
Get/save/delete helpers, delete_state for abandoning a rotation,
and a test_account() helper for tests that bypass global state.
Primitives (services/crypto/noise.rs)
rotate_initiate — create fresh session_id, KK step 1,
record pending_initiated on meta.
rotate_complete_responder — handle incoming rotate_first; on
collision, lower session_id wins; on
nonce mismatch, abandon; on success
emit rotate_second and move primary
into the grace window.
rotate_finalize_initiator — handle rotate_second for our pending;
KK step 2; flip primary.
drain_expired_rotations — scan rotation_meta and retire any
draining session past its deadline or
with zero grace_remaining_volume.
Sessionmanager gets a log-and-drop stub for the new oneof variants;
Phase 2 replaces it with real dispatch.
Tests (6, all pass)
rotation_meta_roundtrip, rotation_meta_keyed_per_peer,
drain_leaves_unexpired, drain_retires_time_expired,
drain_retires_volume_exhausted, drain_noop_on_primary_only_meta.
End-to-end rotation tests (clean rotation, collision, late message
within/past grace, replayed nonce) are deferred to Phase 2 / Phase 4
integration tests because the primitives depend on global Users,
Configuration, and CRYPTOSTORAGE state — constructing those is a
libqaul-init operation, not a unit-test operation.
No behaviour change for existing peers: rotate_* frames are never
sent (trigger wiring lands Phase 2), and incoming rotate_* frames
are logged and dropped for now.
Turns the Phase 1 primitives into a live feature. Rotation is still
gated behind `CryptoRotation::enabled` (default false), so unchanged
defaults give byte-identical behaviour to main for existing peers.
What fires rotation now
- Outbound send: `Crypto::encrypt` post-hook checks session age vs
`period_seconds` and `index_nonce_out` vs `volume_messages`; on
trigger, calls `rotate_initiate` and sends the resulting
`RotateHandshakeFirst` as a `CryptoserviceContainer` through the
normal `Messaging::pack_and_send_encrypted_data` path, encrypted
under the currently-primary session.
- Inbound receive: `Crypto::decrypt` post-hook checks
`highest_index_nonce_in` vs `volume_messages` for messages
arriving on the primary and fires a rotation symmetrically.
Dispatch of incoming rotation frames
- `sessionmanager::process_rotate_first` calls
`rotate_complete_responder`, then encrypts the resulting
`RotateHandshakeSecond` **under the now-draining old session**
(the initiator hasn't promoted yet) and sends it.
- `sessionmanager::process_rotate_second` calls
`rotate_finalize_initiator` to flip primary on the initiator side.
- Two new helpers — `create_rotate_first_message` and
`create_rotate_second_message` — mirror the existing
`create_second_handshake_message` wrapper pattern.
Primary-session resolution
- `Crypto::resolve_primary_state` consults `rotation_meta` so the
post-rotation window (where a responder briefly has two Transport
rows for the same peer) sends subsequent user traffic on the new
primary, not whichever row `get_state` happens to find first.
- `Crypto::encrypt` now uses `resolve_primary_state`; the decrypt
path is unchanged (it already looks up by `message.session_id`).
Draining grace on the decrypt side
- `Crypto::after_decrypt_rotation` decrements
`draining_remaining_volume` on each successfully decrypted
Transport message that arrives on the draining session, so the
grace budget is honoured per message (separate from the time
deadline handled by the drain ticker).
CryptoState
- New `established_at: u64` (ms) with `#[serde(default)]` so
existing on-disk rows deserialise with 0 and therefore never
trip the time-based trigger until they re-handshake. Set on KK
step-2 completion on both sides.
Periodic drain
- New `rotation_ticker` (60 s) added to both `run`/`event_loop` and
the `start_instance` loop. On tick, iterates
`UserAccounts::get_all_users()` and calls
`CryptoNoise::drain_expired_rotations` per account, gated on
`cfg.crypto_rotation.enabled`.
Deferred to a follow-up
- End-to-end integration tests (clean rotation, collision, late
message within/past grace, replayed nonce). These require
standing up global `Users`, `Configuration`, and `CRYPTOSTORAGE`
state, which is a libqaul-init operation; tests belong in a
dedicated integration harness and will land as Task 11 in a
follow-up commit.
All 27 existing lib tests still pass.
Six new tests exercising the helpers introduced by Phase 2:
resolve_primary_state
- resolve_primary_prefers_meta_designated_row — when
rotation_meta names a primary and both Transport rows exist,
the meta-designated one is returned (the post-responder-step
ambiguity fix).
- resolve_primary_falls_back_without_meta — legacy get_state
path when no rotation activity has happened.
- resolve_primary_ignores_missing_state_for_meta_primary —
stale-meta safety: fall back to get_state rather than
returning None.
after_decrypt_rotation
- after_decrypt_decrements_draining_volume — a message
decrypted on the draining session decrements
`draining_remaining_volume` by exactly one; primary fields
remain untouched.
- after_decrypt_saturates_at_zero — saturating_sub prevents
underflow when the budget is already exhausted.
- after_decrypt_noop_on_unrelated_session — a session_id that
matches neither primary nor draining is ignored.
To drive `Configuration::get()` from these tests without the full
libqaul init chain, add `Configuration::init_for_tests(cfg)` — a
`#[cfg(test)]` idempotent installer for the `CONFIG` InitCell.
`Configuration::default()` could not be used: `Internet::default`
reads `DEFCONFIGS` which is only populated by `Libqaul::new`, so the
test fixture builds the Configuration struct literally from the
sub-modules' self-contained defaults.
Full end-to-end rotation tests (clean rotation across two in-
process peers running the real Noise handshake, collision-loss
path, replayed nonce rejection, grace-window expiry in the face of
live traffic) require `Users::init`, `DataBase::init`, and
`CryptoStorage::init` against tempdirs — a non-trivial fixture that
belongs in plan.md's Phase 4 local-mesh integration harness rather
than here.
All 33 libqaul lib tests pass.
Exposes the Phase 1/2 CryptoRotation settings to clients via a
standard module-scoped RPC, and a qaul-cli sub-command set. No
event surface yet — a `RotationEvent` log (`Rotated`,
`GraceExpired`) is a plausible Phase 3 follow-up but is split from
this commit to keep the diff focused.
Protobuf
- rpc/qaul_rpc.proto: `CRYPTO = 16` in the Modules enum.
- services/crypto/crypto_rpc.proto (new): `Crypto` oneof
container with `GetConfigRequest`, `GetConfigResponse`,
`SetConfigRequest`, `SetConfigResponse`. Every SetConfigRequest
field is `optional`, so clients send *partial* updates —
libqaul treats unset fields as "leave untouched".
libqaul
- `Crypto::rpc(data, user_id, request_id)` (services/crypto/mod.rs):
decodes the Crypto container, routes GetConfig/SetConfig to
`handle_get_config` / `handle_set_config`.
- `handle_set_config` validates each numeric field (rejecting
zero with a per-field error message — rotating on every
message, or retiring draining on first message, are near-
certain client mistakes), applies only the present fields,
persists via `Configuration::save()`, and echoes the post-
update config in `SetConfigResponse.applied`.
- `rpc/mod.rs`: dispatches `Ok(Modules::Crypto)` to
`Crypto::rpc`.
CLI
- `clients/cli/src/crypto.rs` (new): `crypto config`,
`crypto config enable|disable|period <s>|volume <n>|grace <s>
|grace-volume <n>`, plus `Crypto::rpc` render for both
GetConfigResponse and SetConfigResponse.
- Wired into `cli.rs`, `main.rs`, and the `rpc.rs` response
dispatch.
Tests (all 36 lib tests pass)
- `rpc_get_config_returns_installed_config` — round trip through
the real `Rpc` send/receive channel; verifies the response
matches the installed CryptoRotation fields.
- `rpc_set_config_partial_update_preserves_other_fields` —
sends a SetConfigRequest with only `period_seconds`, asserts
`success=true`, `applied.period_seconds` updated, every other
field unchanged. Reverts before releasing the test lock.
- `rpc_set_config_rejects_zero_fields` — asserts validation
path: `success=false`, error mentions the offending field,
config left untouched.
A module-scoped `CONFIG_LOCK: Mutex<()>` serialises tests that
mutate the process-global `CONFIG` InitCell so they don't race
Phase 2's after_decrypt_rotation tests, which also read config.
Remaining for a future Phase 3 bump (deferred)
- Event surface (Rotated / GraceExpired / MessageDroppedPastGrace)
— needs a ring-buffer event log + emission points at
`rotate_finalize_initiator`, `drain_expired_rotations`, and the
past-grace decrypt path. Does not share code with this commit;
splitting keeps the diff focused.
Completes the Phase 3 split by exposing the three rotation events
from plan.md (`Rotated`, `GraceExpired`, `MessageDroppedPastGrace`)
to clients via a process-global ring buffer log queried through
the Crypto RPC module.
Protobuf
- crypto_rpc.proto: `RotationEventKind` enum, `RotationEvent`
message, `GetRotationEventsRequest { since_ms, limit }`,
`GetRotationEventsResponse { events }`. New variants on the
`Crypto` oneof.
libqaul
- services/crypto/events.rs (new): MAX_EVENTS=256 ring buffer in
a lazy `InitCell<RwLock<VecDeque<RotationEvent>>>`, `record()`
with oldest-eviction, `query(since_ms, limit)` with oldest→
newest ordering. Test-only `clear_for_tests()` resets the log
between assertions.
- Three emission sites in `CryptoNoise`:
- `rotate_finalize_initiator` → `Rotated`
- `drain_expired_rotations` → `GraceExpired` + stamps
`last_retired_session_id`/`last_retired_at` on the meta.
- decrypt "session not found" branch → `MessageDroppedPastGrace`
when the incoming `session_id` matches `last_retired_*`.
- `RotationMeta` gets `last_retired_session_id: Option<u32>` and
`last_retired_at: Option<u64>` (both `#[serde(default)]` so
existing on-disk rows deserialise cleanly). `Default` derived
so the many struct-literal sites can use `..Default::default()`.
- `Crypto::rpc` gains the `GetRotationEventsRequest` arm, routed
to `handle_get_events` which maps the internal `events::*`
types onto the proto shapes.
CLI (clients/cli/src/crypto.rs)
- `crypto events [limit]` subcommand fires a
`GetRotationEventsRequest` and prints a four-column table
(timestamp_ms, kind, remote_id, primary, draining).
Tests (40 lib tests total, all pass)
- `event_log_caps_at_max_events` — oldest evicted on overflow.
- `event_log_query_filters_and_limits` — `since_ms` filter and
`limit` cap.
- `drain_emits_grace_expired_and_stamps_meta` — drain path emits
the event and stamps `last_retired_*`.
- `rpc_get_events_returns_recorded_events` — end-to-end round
trip through `Rpc::send_message` / `receive_from_libqaul`.
Tests that mutate the event log hold a dedicated `EVENT_LOG_LOCK`;
`rpc_get_events_returns_recorded_events` additionally holds
`CONFIG_LOCK` (acquired first) to avoid lock-ordering inversions
with Phase 3 config-mutation tests.
Defaults unchanged — `CryptoRotation::enabled = false` still ships
dormant, so no event is emitted on a stock installation.
Adds a TriggerRotationRequest/Response pair to crypto_rpc.proto and refactors the trigger-fire path into a shared perform_rotation helper so the manual RPC and the automatic time/volume triggers share send code. handle_trigger_rotation resolves the default user, validates the remote PeerId, and reports the previous/new session ids back to the caller. Mirrors the existing rust/clients/cli crypto commands into qauld-ctl (config / enable / disable / set / rotate / events) with JSON output so the pytest integration harness can drive rotation scenarios. Unit-tests cover the disabled-config and invalid-remote-id rejection paths; the end-to-end rotation path requires a live libqaul stack and lives in the upcoming Phase 4 multi-node tests.
Adds the first of five multi-node rotation scenarios from plan.md Phase 4. Also extends the pytest Node helper with crypto_config / set_crypto_config / rotate_with / crypto_events so subsequent scenarios can reuse the driving code. The test converges a line-5 mesh, pins rotation config so automatic triggers cannot fire, then forces a rotation mid-stream between the two endpoints. It asserts no message loss across pre-rotation, straddling, and post-rotation traffic and that both peers log a Rotated event whose draining_session_id matches the sender's previous primary. Requires meshnet-lab (Linux netns + sudo); not runnable on CI or on macOS dev machines.
Partitions the recipient off the mesh by swapping to a line-5 variant that omits the last link, forces a rotation on the still-connected sender, emits traffic while the peer is unreachable, then heals the mesh. Asserts all messages land, both peers log matching Rotated events, and the new primary session id is reflected on both sides. Topology swap (rather than kill_node) keeps qauld alive on both ends so this exercises the messaging buffer / DTN path rather than state reload on the recipient. The restart scenario is tested separately.
Third Phase 4 scenario: two peers rotate with a 15 s grace window on the recipient, then the drain ticker (60 s interval) must retire the old draining session and emit a GraceExpired event for the previous primary. Also asserts that post-rotation traffic on the new primary delivers end-to-end, confirming that draining the old state did not disturb the live session. Notes in the module docstring why the sibling MessageDroppedPastGrace event stays in unit-test scope — reproducing it in a live mesh would require injecting ciphertext on an already-retired session, which no public API exposes.
Fourth Phase 4 scenario. Both peers trip rotation concurrently from a thread pool, then both emit bi-directional traffic across the collision window. Asserts both peers log a Rotated event and every message in both directions (pre-collision, during-collision, post-collision) is delivered exactly once. The collision-resolution rule (lower new_session_id wins, loser drops its HalfOutgoing and adopts the winner's incoming rotate_first) is the gnarliest rotation edge case in a DTN-tolerant system; this test pins the observable convergence contract.
Fifth and final Phase 4 scenario: establish, rotate, then stop qauld on every namespace and restart while the sled database and config persist on disk. After reconvergence the test sends on the post- rotation session in both directions and asserts delivery succeeds — failure would mean either CryptoState or rotation_meta did not round-trip through storage and the sender had to fall back to a new handshake. The in-memory rotation event ring buffer does not survive restart (documented), so the test does not assert on crypto_events after start_qaul.
Adds a UserInfo.capabilities bitset (router_net_info.proto) and an
in-memory Capabilities::{ROTATION, LOCAL, supports} API in
router::users. Local accounts stamp Capabilities::LOCAL into their
User row on create / on Router::init-time reload; incoming UserInfo
updates the remote peer's advertised caps through a new
add_with_check_caps / add_with_caps path.
Crypto::perform_rotation now refuses to rotate with any peer that
has not advertised Capabilities::ROTATION. Without the gate, a
legacy binary on the other end would silently drop the
RotateHandshakeFirst frame and leave the initiator stuck on a
dangling HalfOutgoing row — returning early here lets the caller
keep using the existing legacy session instead.
Also adds Users::{set_capabilities_for_tests, init_for_tests} so
unit tests can simulate UserInfo arrivals without running the full
routing stack, plus three phase5 unit tests covering the gate
rejection, gate acceptance, and bitmask semantics.
Defaults for the Phase 5 rollout are already in place: Phase 1
shipped `crypto_rotation.enabled = false` by default, and the
capability advertisement is a constant-at-compile-time bitset this
binary always includes. Flipping the default to `true` and
enabling on test nodes are operational steps.
Adds docs/protocols/Noise-Session-Rotation.md alongside the existing messaging and BLE protocol docs. Captures the design separately from plan.md (which mixes design and delivery): goals, why full session rotation rather than a per-message ratchet, trigger model, the three wire frames, receiver routing, rotation_meta layout, the capability negotiation that gates mixed-version peers, the event surface, the operator/RPC surface, threat model, and rollback procedure. References the implementation files and the integration test scenarios so the doc and the code can be navigated together.
ad496de to
ceb837d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes #843