qauld-ctl: analytics dashboard#884
Open
dastansam wants to merge 22 commits into
Open
Conversation
Implements the primitives described in plan.md for rotating the Noise
KK session between two peers, without wiring any triggers yet. All
behaviour is exercised only from unit tests in this phase; message
dispatch plumbing (periodic trigger, volume counter, grace-window
tick) arrives in Phase 2.
Design choices baked in:
- Session-id collision resolution: lower new_session_id wins
(symmetric, local, no PeerId ordering required).
- No signature field on RotateHandshakeSecond — Noise KK already
authenticates both endpoints via their static keys.
- Grace period default 1 h (configurable via CryptoRotation).
Protobuf
crypto_net.proto: RotateHandshakeFirst / RotateHandshakeSecond
messages and matching oneof variants on CryptoserviceContainer.
Config
storage::configuration::CryptoRotation with enabled=false default,
added to the Configuration struct as crypto_rotation.
Upgrade migration and config_persistence test updated.
Storage
New per-user sled tree "rotation_meta" on CryptoAccount.
RotationMeta { primary_session_id, pending_initiated_session_id,
draining_session_id, draining_until, draining_remaining_volume }.
Get/save/delete helpers, delete_state for abandoning a rotation,
and a test_account() helper for tests that bypass global state.
Primitives (services/crypto/noise.rs)
rotate_initiate — create fresh session_id, KK step 1,
record pending_initiated on meta.
rotate_complete_responder — handle incoming rotate_first; on
collision, lower session_id wins; on
nonce mismatch, abandon; on success
emit rotate_second and move primary
into the grace window.
rotate_finalize_initiator — handle rotate_second for our pending;
KK step 2; flip primary.
drain_expired_rotations — scan rotation_meta and retire any
draining session past its deadline or
with zero grace_remaining_volume.
Sessionmanager gets a log-and-drop stub for the new oneof variants;
Phase 2 replaces it with real dispatch.
Tests (6, all pass)
rotation_meta_roundtrip, rotation_meta_keyed_per_peer,
drain_leaves_unexpired, drain_retires_time_expired,
drain_retires_volume_exhausted, drain_noop_on_primary_only_meta.
End-to-end rotation tests (clean rotation, collision, late message
within/past grace, replayed nonce) are deferred to Phase 2 / Phase 4
integration tests because the primitives depend on global Users,
Configuration, and CRYPTOSTORAGE state — constructing those is a
libqaul-init operation, not a unit-test operation.
No behaviour change for existing peers: rotate_* frames are never
sent (trigger wiring lands Phase 2), and incoming rotate_* frames
are logged and dropped for now.
Turns the Phase 1 primitives into a live feature. Rotation is still
gated behind `CryptoRotation::enabled` (default false), so unchanged
defaults give byte-identical behaviour to main for existing peers.
What fires rotation now
- Outbound send: `Crypto::encrypt` post-hook checks session age vs
`period_seconds` and `index_nonce_out` vs `volume_messages`; on
trigger, calls `rotate_initiate` and sends the resulting
`RotateHandshakeFirst` as a `CryptoserviceContainer` through the
normal `Messaging::pack_and_send_encrypted_data` path, encrypted
under the currently-primary session.
- Inbound receive: `Crypto::decrypt` post-hook checks
`highest_index_nonce_in` vs `volume_messages` for messages
arriving on the primary and fires a rotation symmetrically.
Dispatch of incoming rotation frames
- `sessionmanager::process_rotate_first` calls
`rotate_complete_responder`, then encrypts the resulting
`RotateHandshakeSecond` **under the now-draining old session**
(the initiator hasn't promoted yet) and sends it.
- `sessionmanager::process_rotate_second` calls
`rotate_finalize_initiator` to flip primary on the initiator side.
- Two new helpers — `create_rotate_first_message` and
`create_rotate_second_message` — mirror the existing
`create_second_handshake_message` wrapper pattern.
Primary-session resolution
- `Crypto::resolve_primary_state` consults `rotation_meta` so the
post-rotation window (where a responder briefly has two Transport
rows for the same peer) sends subsequent user traffic on the new
primary, not whichever row `get_state` happens to find first.
- `Crypto::encrypt` now uses `resolve_primary_state`; the decrypt
path is unchanged (it already looks up by `message.session_id`).
Draining grace on the decrypt side
- `Crypto::after_decrypt_rotation` decrements
`draining_remaining_volume` on each successfully decrypted
Transport message that arrives on the draining session, so the
grace budget is honoured per message (separate from the time
deadline handled by the drain ticker).
CryptoState
- New `established_at: u64` (ms) with `#[serde(default)]` so
existing on-disk rows deserialise with 0 and therefore never
trip the time-based trigger until they re-handshake. Set on KK
step-2 completion on both sides.
Periodic drain
- New `rotation_ticker` (60 s) added to both `run`/`event_loop` and
the `start_instance` loop. On tick, iterates
`UserAccounts::get_all_users()` and calls
`CryptoNoise::drain_expired_rotations` per account, gated on
`cfg.crypto_rotation.enabled`.
Deferred to a follow-up
- End-to-end integration tests (clean rotation, collision, late
message within/past grace, replayed nonce). These require
standing up global `Users`, `Configuration`, and `CRYPTOSTORAGE`
state, which is a libqaul-init operation; tests belong in a
dedicated integration harness and will land as Task 11 in a
follow-up commit.
All 27 existing lib tests still pass.
Six new tests exercising the helpers introduced by Phase 2:
resolve_primary_state
- resolve_primary_prefers_meta_designated_row — when
rotation_meta names a primary and both Transport rows exist,
the meta-designated one is returned (the post-responder-step
ambiguity fix).
- resolve_primary_falls_back_without_meta — legacy get_state
path when no rotation activity has happened.
- resolve_primary_ignores_missing_state_for_meta_primary —
stale-meta safety: fall back to get_state rather than
returning None.
after_decrypt_rotation
- after_decrypt_decrements_draining_volume — a message
decrypted on the draining session decrements
`draining_remaining_volume` by exactly one; primary fields
remain untouched.
- after_decrypt_saturates_at_zero — saturating_sub prevents
underflow when the budget is already exhausted.
- after_decrypt_noop_on_unrelated_session — a session_id that
matches neither primary nor draining is ignored.
To drive `Configuration::get()` from these tests without the full
libqaul init chain, add `Configuration::init_for_tests(cfg)` — a
`#[cfg(test)]` idempotent installer for the `CONFIG` InitCell.
`Configuration::default()` could not be used: `Internet::default`
reads `DEFCONFIGS` which is only populated by `Libqaul::new`, so the
test fixture builds the Configuration struct literally from the
sub-modules' self-contained defaults.
Full end-to-end rotation tests (clean rotation across two in-
process peers running the real Noise handshake, collision-loss
path, replayed nonce rejection, grace-window expiry in the face of
live traffic) require `Users::init`, `DataBase::init`, and
`CryptoStorage::init` against tempdirs — a non-trivial fixture that
belongs in plan.md's Phase 4 local-mesh integration harness rather
than here.
All 33 libqaul lib tests pass.
Exposes the Phase 1/2 CryptoRotation settings to clients via a
standard module-scoped RPC, and a qaul-cli sub-command set. No
event surface yet — a `RotationEvent` log (`Rotated`,
`GraceExpired`) is a plausible Phase 3 follow-up but is split from
this commit to keep the diff focused.
Protobuf
- rpc/qaul_rpc.proto: `CRYPTO = 16` in the Modules enum.
- services/crypto/crypto_rpc.proto (new): `Crypto` oneof
container with `GetConfigRequest`, `GetConfigResponse`,
`SetConfigRequest`, `SetConfigResponse`. Every SetConfigRequest
field is `optional`, so clients send *partial* updates —
libqaul treats unset fields as "leave untouched".
libqaul
- `Crypto::rpc(data, user_id, request_id)` (services/crypto/mod.rs):
decodes the Crypto container, routes GetConfig/SetConfig to
`handle_get_config` / `handle_set_config`.
- `handle_set_config` validates each numeric field (rejecting
zero with a per-field error message — rotating on every
message, or retiring draining on first message, are near-
certain client mistakes), applies only the present fields,
persists via `Configuration::save()`, and echoes the post-
update config in `SetConfigResponse.applied`.
- `rpc/mod.rs`: dispatches `Ok(Modules::Crypto)` to
`Crypto::rpc`.
CLI
- `clients/cli/src/crypto.rs` (new): `crypto config`,
`crypto config enable|disable|period <s>|volume <n>|grace <s>
|grace-volume <n>`, plus `Crypto::rpc` render for both
GetConfigResponse and SetConfigResponse.
- Wired into `cli.rs`, `main.rs`, and the `rpc.rs` response
dispatch.
Tests (all 36 lib tests pass)
- `rpc_get_config_returns_installed_config` — round trip through
the real `Rpc` send/receive channel; verifies the response
matches the installed CryptoRotation fields.
- `rpc_set_config_partial_update_preserves_other_fields` —
sends a SetConfigRequest with only `period_seconds`, asserts
`success=true`, `applied.period_seconds` updated, every other
field unchanged. Reverts before releasing the test lock.
- `rpc_set_config_rejects_zero_fields` — asserts validation
path: `success=false`, error mentions the offending field,
config left untouched.
A module-scoped `CONFIG_LOCK: Mutex<()>` serialises tests that
mutate the process-global `CONFIG` InitCell so they don't race
Phase 2's after_decrypt_rotation tests, which also read config.
Remaining for a future Phase 3 bump (deferred)
- Event surface (Rotated / GraceExpired / MessageDroppedPastGrace)
— needs a ring-buffer event log + emission points at
`rotate_finalize_initiator`, `drain_expired_rotations`, and the
past-grace decrypt path. Does not share code with this commit;
splitting keeps the diff focused.
Completes the Phase 3 split by exposing the three rotation events
from plan.md (`Rotated`, `GraceExpired`, `MessageDroppedPastGrace`)
to clients via a process-global ring buffer log queried through
the Crypto RPC module.
Protobuf
- crypto_rpc.proto: `RotationEventKind` enum, `RotationEvent`
message, `GetRotationEventsRequest { since_ms, limit }`,
`GetRotationEventsResponse { events }`. New variants on the
`Crypto` oneof.
libqaul
- services/crypto/events.rs (new): MAX_EVENTS=256 ring buffer in
a lazy `InitCell<RwLock<VecDeque<RotationEvent>>>`, `record()`
with oldest-eviction, `query(since_ms, limit)` with oldest→
newest ordering. Test-only `clear_for_tests()` resets the log
between assertions.
- Three emission sites in `CryptoNoise`:
- `rotate_finalize_initiator` → `Rotated`
- `drain_expired_rotations` → `GraceExpired` + stamps
`last_retired_session_id`/`last_retired_at` on the meta.
- decrypt "session not found" branch → `MessageDroppedPastGrace`
when the incoming `session_id` matches `last_retired_*`.
- `RotationMeta` gets `last_retired_session_id: Option<u32>` and
`last_retired_at: Option<u64>` (both `#[serde(default)]` so
existing on-disk rows deserialise cleanly). `Default` derived
so the many struct-literal sites can use `..Default::default()`.
- `Crypto::rpc` gains the `GetRotationEventsRequest` arm, routed
to `handle_get_events` which maps the internal `events::*`
types onto the proto shapes.
CLI (clients/cli/src/crypto.rs)
- `crypto events [limit]` subcommand fires a
`GetRotationEventsRequest` and prints a four-column table
(timestamp_ms, kind, remote_id, primary, draining).
Tests (40 lib tests total, all pass)
- `event_log_caps_at_max_events` — oldest evicted on overflow.
- `event_log_query_filters_and_limits` — `since_ms` filter and
`limit` cap.
- `drain_emits_grace_expired_and_stamps_meta` — drain path emits
the event and stamps `last_retired_*`.
- `rpc_get_events_returns_recorded_events` — end-to-end round
trip through `Rpc::send_message` / `receive_from_libqaul`.
Tests that mutate the event log hold a dedicated `EVENT_LOG_LOCK`;
`rpc_get_events_returns_recorded_events` additionally holds
`CONFIG_LOCK` (acquired first) to avoid lock-ordering inversions
with Phase 3 config-mutation tests.
Defaults unchanged — `CryptoRotation::enabled = false` still ships
dormant, so no event is emitted on a stock installation.
Adds a TriggerRotationRequest/Response pair to crypto_rpc.proto and refactors the trigger-fire path into a shared perform_rotation helper so the manual RPC and the automatic time/volume triggers share send code. handle_trigger_rotation resolves the default user, validates the remote PeerId, and reports the previous/new session ids back to the caller. Mirrors the existing rust/clients/cli crypto commands into qauld-ctl (config / enable / disable / set / rotate / events) with JSON output so the pytest integration harness can drive rotation scenarios. Unit-tests cover the disabled-config and invalid-remote-id rejection paths; the end-to-end rotation path requires a live libqaul stack and lives in the upcoming Phase 4 multi-node tests.
Adds the first of five multi-node rotation scenarios from plan.md Phase 4. Also extends the pytest Node helper with crypto_config / set_crypto_config / rotate_with / crypto_events so subsequent scenarios can reuse the driving code. The test converges a line-5 mesh, pins rotation config so automatic triggers cannot fire, then forces a rotation mid-stream between the two endpoints. It asserts no message loss across pre-rotation, straddling, and post-rotation traffic and that both peers log a Rotated event whose draining_session_id matches the sender's previous primary. Requires meshnet-lab (Linux netns + sudo); not runnable on CI or on macOS dev machines.
Partitions the recipient off the mesh by swapping to a line-5 variant that omits the last link, forces a rotation on the still-connected sender, emits traffic while the peer is unreachable, then heals the mesh. Asserts all messages land, both peers log matching Rotated events, and the new primary session id is reflected on both sides. Topology swap (rather than kill_node) keeps qauld alive on both ends so this exercises the messaging buffer / DTN path rather than state reload on the recipient. The restart scenario is tested separately.
Third Phase 4 scenario: two peers rotate with a 15 s grace window on the recipient, then the drain ticker (60 s interval) must retire the old draining session and emit a GraceExpired event for the previous primary. Also asserts that post-rotation traffic on the new primary delivers end-to-end, confirming that draining the old state did not disturb the live session. Notes in the module docstring why the sibling MessageDroppedPastGrace event stays in unit-test scope — reproducing it in a live mesh would require injecting ciphertext on an already-retired session, which no public API exposes.
Fourth Phase 4 scenario. Both peers trip rotation concurrently from a thread pool, then both emit bi-directional traffic across the collision window. Asserts both peers log a Rotated event and every message in both directions (pre-collision, during-collision, post-collision) is delivered exactly once. The collision-resolution rule (lower new_session_id wins, loser drops its HalfOutgoing and adopts the winner's incoming rotate_first) is the gnarliest rotation edge case in a DTN-tolerant system; this test pins the observable convergence contract.
Fifth and final Phase 4 scenario: establish, rotate, then stop qauld on every namespace and restart while the sled database and config persist on disk. After reconvergence the test sends on the post- rotation session in both directions and asserts delivery succeeds — failure would mean either CryptoState or rotation_meta did not round-trip through storage and the sender had to fall back to a new handshake. The in-memory rotation event ring buffer does not survive restart (documented), so the test does not assert on crypto_events after start_qaul.
Adds a UserInfo.capabilities bitset (router_net_info.proto) and an
in-memory Capabilities::{ROTATION, LOCAL, supports} API in
router::users. Local accounts stamp Capabilities::LOCAL into their
User row on create / on Router::init-time reload; incoming UserInfo
updates the remote peer's advertised caps through a new
add_with_check_caps / add_with_caps path.
Crypto::perform_rotation now refuses to rotate with any peer that
has not advertised Capabilities::ROTATION. Without the gate, a
legacy binary on the other end would silently drop the
RotateHandshakeFirst frame and leave the initiator stuck on a
dangling HalfOutgoing row — returning early here lets the caller
keep using the existing legacy session instead.
Also adds Users::{set_capabilities_for_tests, init_for_tests} so
unit tests can simulate UserInfo arrivals without running the full
routing stack, plus three phase5 unit tests covering the gate
rejection, gate acceptance, and bitmask semantics.
Defaults for the Phase 5 rollout are already in place: Phase 1
shipped `crypto_rotation.enabled = false` by default, and the
capability advertisement is a constant-at-compile-time bitset this
binary always includes. Flipping the default to `true` and
enabling on test nodes are operational steps.
Adds docs/protocols/Noise-Session-Rotation.md alongside the existing messaging and BLE protocol docs. Captures the design separately from plan.md (which mixes design and delivery): goals, why full session rotation rather than a per-message ratchet, trigger model, the three wire frames, receiver routing, rotation_meta layout, the capability negotiation that gates mixed-version peers, the event surface, the operator/RPC surface, threat model, and rollback procedure. References the implementation files and the integration test scenarios so the doc and the code can be navigated together.
A first-pass analytics view for DTN custody storage. Lives behind the
'DTN' tab next to Users and Feed.
Surfaces:
- DTN state (used MB / message count / unconfirmed count) refreshed
on the normal tick, with the cap from DtnConfig rendered alongside.
- A rolling sparkline of the unconfirmed-count over the last 60
samples so spikes are visible at a glance.
- The configured custodian users (DtnConfigResponse.users) in a
selectable table.
- A live event log fed by routing dtn.delivery_response events out
of the existing subscribe stream into a DTN-specific deque,
leaving the general events panel untouched.
To enable structured routing, the subscribe channel now carries an
EventLine { topic, text } instead of a raw String, and the formatter
gained a dtn.delivery_response arm (accepted/rejected status,
storage node, signature short, reason).
Adds a fourth tab next to Users / Feed / DTN that surfaces
per-transport reachability for this node.
Surfaces:
- Three KPI cards (LAN / Internet / BLE) each with a peer-count
headline and a rolling sparkline of that count over the last 60
refresh samples. LAN also shows a 'local' subline when the
daemon reports same-node peers.
- A peers table populated from Router::ConnectionsRequest: one row
per (peer, transport) pair, showing module, base58 user id,
hop count to that peer, and best-connection RTT.
- A 'Peer events' panel that pulls the live peers.connected
(and reserved peers.disconnected) events out of the subscribe
stream into a network-specific deque.
Routing logic in App::push_event_line now dispatches by topic so
DTN delivery responses, peer events, and everything else each land
in their own panel — the general Events panel stays clean.
Adds a fifth tab next to Users / Feed / DTN / Network for Noise
session rotation telemetry. Pure poll-based — the crypto module
exposes rotation events via GetRotationEventsRequest rather than a
subscribe topic, so we just fetch on the normal refresh tick and
advance a since_ms floor to avoid refetching.
Surfaces:
- Config card: master switch (green/red), period/volume triggers,
grace settings.
- Counts strip: tally of buffered events by kind (rotated /
grace_expired / dropped_past_grace), with dropped events shown
in red when non-zero so silent decrypt failures don't hide.
- Rotation events table (newest first): timestamp, kind colored
by severity, remote peer (short id), primary and draining
session ids.
App::append_crypto_events tracks the newest timestamp_ms it has
seen and the next fetch passes that as since_ms, so the buffer
grows by delta rather than re-pulling the whole log.
Plumbing-only: a new TOPIC_PEERS_DISCONNECTED constant and an emit_peer_disconnected helper that mirrors emit_peer_connected (same PeerEvent wire shape, different topic string). The two helpers now share an emit_peer_event implementation. No call sites yet. The prune-policy decision (staleness threshold, per-transport vs global, gossip semantics) is a separate design question, but having the wire surface in place means: - qauld-tui (and any future client) can bind 'peers.disconnected' today; they'll start receiving events as soon as a prune call site fires emit_peer_disconnected. - The prune logic, when it lands, doesn't have to touch the subscribe layer. Includes a mirroring unit test and a doc-comment update on PeerEvent in subscribe.proto so the wire-level docs explain the two topics together.
Adds a new 'crypto.rotation' subscribe topic so push-based clients
(qauld-tui's Crypto tab) see rotation events within ms instead of
waiting up to the 3s poll tick. Payload reuses the existing
qaul.rpc.crypto.RotationEvent proto — no new wire type.
libqaul:
- TOPIC_CRYPTO_ROTATION constant + emit_crypto_rotation helper
in rpc/subscribe.rs, mirroring the peers / dtn emitters.
- New events::record_and_emit(Option<&QaulState>, event) helper
that records in the in-memory log AND pushes the subscribe
event in one call. Production sites use it; tests that don't
have a QaulState continue to call events::record directly.
- The three production record sites switched over:
* decrypt's past-grace drop branch (MessageDroppedPastGrace)
* rotate_finalize_initiator success (Rotated)
* drain_expired_rotations grace retirement (GraceExpired)
- drain_expired_rotations now takes Option<&QaulState>; lib.rs's
rotation ticker passes Some, internal unit tests pass None.
- New crypto_rotation_event_is_delivered_to_subscribers unit
test verifies the wire shape.
qauld-tui:
- EventLine gained a structured 'parsed' field so subscribe
payloads can carry typed data alongside the rendered string.
- format_event recognises crypto.rotation and parses the proto
into a CryptoRotationEvent.
- App::push_event_line merges crypto.rotation push events into
the typed crypto_events buffer with (timestamp_ms, kind,
primary, draining) dedup, so push + poll converge on the same
view without double-counting.
- The 3s poll path stays as a backstop for events that fired
before the subscribe stream was up, and as a fallback when
the stream drops.
Two interactive affordances that scale with the tables.
Detail drawer (Enter on any row):
- Fullscreen modal listing the selected row's labelled fields,
untruncated. Solves the everywhere-short_id problem so users
can copy full peer ids, signatures, bios, etc.
- Per-tab schema via App::selected_detail() returning labelled
(key, value) pairs.
- Esc / Enter / q dismiss.
Filter ('/'):
- / opens a text input; rows substring-match (case-insensitive)
against a concatenation of the tab's relevant fields.
- Cursor clamps to the filtered count and resets on each
keystroke so the user is always on the first visible row.
- Filter persists while navigating; Esc clears it and exits the
filter mode; Tab switching also clears it (each tab starts
fresh).
- Each table's title shows 'filtered N/M for "foo"' when a
filter is active.
Internals: App grew filtered_users / filtered_feed /
filtered_dtn_custodians / filtered_peers / filtered_crypto_events
iterator helpers so the render fns and selected_detail consume the
same view. InputMode gained Filtering and Viewing variants; the
key handler treats them as exclusive modes that take precedence
over the Normal-mode bindings.
Catches the generated file up to the PeerEvent docstring change in subscribe.proto (10b2f2e); the source change was committed but the build output wasn't regenerated and re-committed at the same time.
The merge brought app.rs over from integration where it referenced 'crate::data' (sibling at the qauld-tui bin root). After the move into qauld-ctl/src/tui/, data is one level up via super, not at the crate root.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
qauld-ctl tui):peers.disconnectedplumbingEnter) for inspecting the focused row in any tab/) for narrowing rows by text matchpeers.disconnectedsubscribe topic plumbing (emitter + topic registration; prune call-sites land separately)crypto.rotationsubscribe topic — the Crypto tab consumes this instead of pollingDepends on (and is based on) these open PRs: feat: qauldctl revamp #872 feat(crypto): Session key rotation with forward and backward secrecy #853