Skip to content

feat: 010 scalability improvements#301

Open
zeljkoX wants to merge 12 commits into
mainfrom
010-scalability-improvements
Open

feat: 010 scalability improvements#301
zeljkoX wants to merge 12 commits into
mainfrom
010-scalability-improvements

Conversation

@zeljkoX

@zeljkoX zeljkoX commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Horizontal scaling correctness across replicas

Makes the Guardian server correct under 2–6 replicas behind a non-sticky load balancer. Several subsystems were written for a single instance, so a request that started on one replica and continued on another could fail, and background work ran redundantly on every replica.

What changed

  • Auth — operator/EVM challenges and sessions move from per-process memory to shared storage, so login (challenge → verify) and authenticated requests work on any replica; logout is honored fleet-wide.
  • Canonicalization — now runs on exactly one replica via a leader lease with a fencing token, concurrent renewal, and cooperative cancellation, so each candidate is promoted/discarded once (no double-processing or retry double-counts).
  • Rate limiting — partitioned by GUARDIAN_MAX_REPLICAS (wired from the autoscaling max capacity) so aggregate enforcement stays at/below the global limit instead of multiplying per replica.
  • Guardrails & ops — prod refuses the filesystem backend, migrations are serialized across simultaneous boots, a single startup log line reports the resolved coordination mode, plus an HA runbook and a local docker-compose guide.

Design note: the goal was explicitly not to introduce Redis (or any new infrastructure dependency). All coordination — sessions, challenges, and the canonicalization lease — reuses the existing Postgres backend that the prod image already ships with. The deployment surface is unchanged; coordination mode is derived from the storage backend alone (Postgres = shared, filesystem = in-memory single-process), so dev/local needs no new services.

Summary by CodeRabbit

  • New Features

    • Added support for running the server across multiple replicas with shared session, challenge, and worker coordination.
    • Added fleet-wide rate limit partitioning based on maximum replica capacity.
    • Added horizontal-scaling setup and validation guides, including local multi-replica examples.
  • Bug Fixes

    • Logout now reports failures when session revocation does not succeed.
    • Production startup now guards against unsupported configurations and missing shared settings.
    • Background processing now uses lease protection to avoid duplicate work across replicas.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73a53a82-cbd3-48da-9800-44813da0d9db

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 010-scalability-improvements

Comment @coderabbitai help to get the list of available commands.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the first slice of feature 010 horizontal scaling correctness for the Guardian server by moving previously per-process coordination state (auth sessions/challenges and canonicalization ownership) into shared Postgres-backed coordination, adding per-replica rate-limit partitioning keyed by an infra-sourced max replica capacity, and documenting the required HA configuration.

Changes:

  • Introduces a coordination subsystem with in-memory (filesystem/dev) and Postgres (prod) implementations for sessions, challenges, and canonicalization leadership.
  • Updates canonicalization to run under a shared lease with renewal + fencing checks, and adds periodic sweeps for expired sessions/challenges.
  • Wires GUARDIAN_MAX_REPLICAS through Terraform and documents HA/runbook guidance + prod-stage startup guards.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
speckit/features/010-horizontal-scaling/spec.md New feature spec describing HA correctness goals, scenarios, and requirements.
speckit/features/010-horizontal-scaling/data-model.md Defines Postgres tables + concurrency approach (advisory lock) for replica-safe coordination.
speckit/features/010-horizontal-scaling/contracts/db-schema.md Migration contract for new coordination tables + required atomic operations.
speckit/features/010-horizontal-scaling/contracts/coordination-traits.md Contract for new internal coordination traits and their selection rules.
speckit/features/010-horizontal-scaling/contracts/config-contract.md Contract for env vars and prod-stage startup guards/log line for HA mode.
packages/miden-multisig-client/test-results/.last-run.json Adds a test-results artifact file under the TS multisig client.
infra/variables.tf Adds guardian_max_replicas Terraform variable and validation.
infra/ecs.tf Injects GUARDIAN_MAX_REPLICAS into the ECS task definition environment.
infra/data.tf Computes effective_guardian_max_replicas (clamped to autoscaling max capacity).
docs/SERVER_AWS_DEPLOY.md Adds a section explaining HA behavior for the prod profile and links the runbook.
docs/runbooks/horizontal-scaling.md New operator runbook for multi-replica HA deployments and trade-offs/failure modes.
docs/CONFIGURATION.md Documents GUARDIAN_MAX_REPLICAS, prod-stage guards, and HA behavior notes.
crates/server/src/storage/postgres.rs Serializes Diesel migrations across replicas using a Postgres advisory lock.
crates/server/src/schema.rs Adds Diesel schema definitions for new coordination tables.
crates/server/src/middleware/rate_limit.rs Partitions rate limits by GUARDIAN_MAX_REPLICAS + adds tests and warnings.
crates/server/src/main.rs Threads coordination handles from storage builder into server builder.
crates/server/src/lib.rs Exposes new config and coordination modules.
crates/server/src/jobs/canonicalization/worker.rs Acquires/renews a shared lease and cancels work if leadership is lost.
crates/server/src/jobs/canonicalization/processor.rs Adds fencing checks and cooperative cancellation checkpoints during canonicalization.
crates/server/src/evm/session.rs Moves EVM auth challenges/sessions to coordination stores; adds sweep support.
crates/server/src/evm/mod.rs Allows constructing EVM app state with injected session/challenge stores.
crates/server/src/dashboard/types.rs Removes now-obsolete in-memory session/challenge record types.
crates/server/src/dashboard/state.rs Moves operator challenges/sessions to coordination stores; adds prod cursor-secret guard and sweeps.
crates/server/src/coordination/session_store.rs New SessionStore trait + in-memory implementation.
crates/server/src/coordination/postgres/session_store.rs New Postgres SessionStore implementation backed by auth_sessions.
crates/server/src/coordination/postgres/mod.rs New Postgres coordination module utilities and exports.
crates/server/src/coordination/postgres/lease.rs New Postgres lease elector implementation backed by worker_leases.
crates/server/src/coordination/postgres/challenge_store.rs New Postgres ChallengeStore implementation backed by auth_challenges.
crates/server/src/coordination/mod.rs New coordination handle wiring + backend-derived coordination mode selection.
crates/server/src/coordination/leader.rs New LeaderElector trait + AlwaysLeader in-memory implementation.
crates/server/src/coordination/challenge_store.rs New ChallengeStore trait + in-memory implementation + payload encoding/decoding.
crates/server/src/config/stage.rs Centralizes GUARDIAN_ENV=prod detection used by startup guards.
crates/server/src/config/mod.rs New config module root.
crates/server/src/builder/storage.rs Returns CoordinationHandles from storage builder (Postgres vs in-memory).
crates/server/src/builder/startup.rs Logs a single “coordination” line describing resolved mode/backend/stage/max_replicas/secret source.
crates/server/src/builder/mod.rs Wires coordination into dashboard/EVM state construction and adds prod guard for rate-limit partitioning to zero.
crates/server/src/builder/handle.rs Starts canonicalization with a leader handle and adds a periodic session/challenge sweep worker.
crates/server/src/api/evm.rs Adjusts EVM logout to handle possible revoke errors (currently logs + continues).
crates/server/src/api/dashboard.rs Adjusts operator logout to handle possible revoke errors (currently logs + continues).
crates/server/src/ack/mod.rs Reuses the new config::stage::is_prod() helper for ACK secret selection.
crates/server/migrations/2026-06-24-000001_worker_leases/up.sql New migration creating worker_leases.
crates/server/migrations/2026-06-24-000001_worker_leases/down.sql New migration dropping worker_leases.
crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql New migration creating auth_challenges + indexes.
crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql New migration dropping auth_challenges + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql New migration creating auth_sessions + indexes.
crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql New migration dropping auth_sessions + indexes.
crates/server/Cargo.toml Adds tokio-util dependency for cancellation support in canonicalization.
Cargo.lock Updates lockfile for new dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/server/src/api/dashboard.rs Outdated
Comment thread crates/server/src/api/evm.rs Outdated
Comment thread crates/server/src/dashboard/state.rs
Comment thread infra/variables.tf
Comment thread packages/miden-multisig-client/test-results/.last-run.json Outdated
Comment thread infra/variables.tf

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/server/src/jobs/canonicalization/processor.rs (2)

359-400: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Avoid partial canonicalization when the lease drops mid-sequence.

submit_state(...) commits before the second lease check at Line 400. If the lease is lost after Line 359 but before Line 400, the processor aborts with the account state advanced while the delta remains non-canonical. This needs an atomic, fenced write path for state + canonical delta + metadata, or an explicit idempotent recovery strategy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 359 - 400,
The canonicalization flow in `processor.rs` can leave state partially applied
because `submit_state`, auth syncing, and the canonical delta update happen
before the final lease validation. Update the `canonicalization` path around
`ensure_lease_held`, `submit_state`, and `update_auth` so the state write and
delta canonicalization are fenced atomically under a single lease check, or add
an idempotent recovery mechanism that safely reconciles `updated_state`,
`canonical_delta`, and metadata if the lease is lost mid-sequence. Ensure the
fix keeps `submit_state`, `update_auth`, and the final `DeltaStatus::canonical`
transition consistent with one another.

266-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fence metadata writes as well as storage writes.

set_has_pending_candidate(...) and update_auth(...) are state-mutating writes but are not guarded immediately before execution. A replica can pass an earlier fence, lose the lease, then still mutate metadata. Add ensure_lease_held(&delta).await? directly before each metadata mutation, or move these writes into a single fenced storage operation.

Also applies to: 383-386, 409-412

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 266 - 280,
The discard path in processor.rs mutates metadata after a storage delete without
re-checking the lease, so a replica can stale-write state after losing
ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.
🧹 Nitpick comments (1)
crates/server/src/dashboard/state.rs (1)

871-901: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Make the digest-keying test observe the inserted store key.

This test would still pass if the implementation regressed to storing and looking up plaintext tokens, because it only checks session_digest(token) != token and then authenticates through the same production path. Inject a small recording SessionStore and assert insert(...) receives session_digest(token), never the raw token bytes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/server/src/dashboard/state.rs` around lines 871 - 901, The
digest-keying test currently only proves the digest differs from the token and
that authentication succeeds, so it can miss a regression where plaintext tokens
are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql`:
- Around line 7-16: The auth_sessions table is still keyed only by token_digest,
so realm-scoped sessions can collide across operator and EVM namespaces. Update
the migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.

In `@crates/server/src/api/dashboard.rs`:
- Around line 183-188: The logout flow in the dashboard API currently logs
revoke failures and still reports success, so it should fail closed instead.
Update the handler in the logout path that calls state.dashboard.logout(...) to
return an error response when revocation fails instead of continuing to a
success payload, and make sure the response contract reflects this change. If
the documented responses change, update the utoipa annotations on the relevant
endpoint and regenerate the committed OpenAPI specs with the project’s
gen-openapi command.

In `@crates/server/src/api/evm.rs`:
- Around line 183-190: The EVM logout handler currently swallows failures from
state.evm.sessions.logout and still returns success; restore the previous
fail-closed behavior by propagating that error out of the logout path instead of
only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...).
Update the logout endpoint logic in evm.rs so a revoke failure results in an
error response from the handler, and if you intentionally keep success responses
then also update the OpenAPI response docs/specs to match the new HTTP contract.

In `@crates/server/src/builder/mod.rs`:
- Around line 455-505: The Postgres coordination guard in the builder is too
late because it only runs inside the dashboard branch, allowing a manual
Postgres build with a custom dashboard to still proceed with AlwaysLeader and
per-process state. Move the missing-coordination check in the builder flow
immediately after `let coordination = self.coordination;` in
`crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or
`dashboard` are derived, so any Postgres build without coordination handles
fails closed regardless of how `dashboard` is provided.

In `@crates/server/src/config/stage.rs`:
- Around line 9-11: `is_prod` currently checks ENV_GUARDIAN_ENV directly, so
values like "prod " are treated as non-production and bypass startup guards.
Update the logic in `stage.rs` to trim the environment value before comparing it
in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for
`VarError::NotPresent` and other errors unchanged.

In `@crates/server/src/coordination/postgres/challenge_store.rs`:
- Around line 76-121: The issue is that concurrent issue() calls in
ChallengeStore::issue can bypass the per-principal cap because each transaction
sees stale outstanding counts before commit. Add a transaction-scoped lock keyed
by (realm, principal) at the start of the transaction in
PostgresChallengeStore::issue, before the INSERT/DELETE cleanup sequence, so
only one issuance for the same principal runs at a time. Use the existing
transaction block around conn.transaction and place the lock before the
auth_challenges statements that enforce max_outstanding.

In `@crates/server/src/coordination/postgres/lease.rs`:
- Around line 55-69: The lease flow in acquire/release is resetting fence_token
instead of keeping it strictly increasing, which breaks monotonic fencing across
reacquire. Update the lease handling in the try_acquire logic and the release
method so a holder’s reacquisition always advances fence_token, and avoid
deleting the worker_leases row on release if that would reset the counter. Use
the existing worker_leases upsert and fence_token update logic as the place to
enforce the monotonic increment on every acquisition path.

In `@crates/server/src/middleware/rate_limit.rs`:
- Around line 463-465: The env-mutation lock is not being applied consistently
across the rate-limit test suite, so `test_rate_limit_layer_from_env` can still
race with other tests. Update that test to acquire `ENV_LOCK` before touching
shared environment state, and make sure it also clears `GUARDIAN_MAX_REPLICAS`
along with the existing rate-limit env vars. Review the other env-mutating tests
in `rate_limit.rs` and wrap them with the same `ENV_LOCK` pattern so all
shared-process-environment changes are serialized.

In `@crates/server/src/storage/postgres.rs`:
- Around line 45-51: The migration lock acquisition in the postgres startup path
uses pg_advisory_lock and can block forever, so update the lock logic in the
connection/setup flow around the advisory lock call to fail fast instead of
waiting indefinitely. Use pg_try_advisory_lock with a retry deadline/backoff, or
apply a statement/lock timeout before attempting the lock, and keep the existing
error propagation through the migration lock acquisition code so startup can
recover when the lock is held too long.

In `@docs/CONFIGURATION.md`:
- Line 288: The Multi-replica (HA) configuration row overstates what the prod
Terraform profile currently provides; update the wording in the documentation
table to reflect only the variables actually injected by the ECS task
definition. Use the existing configuration entry for the HA quick-start to make
it clear that `GUARDIAN_MAX_REPLICAS` is supplied, while
`GUARDIAN_DASHBOARD_CURSOR_SECRET` must be set separately and is not wired by
the prod Terraform profile.

In `@docs/SERVER_AWS_DEPLOY.md`:
- Around line 556-558: Update the Terraform/runtime docs to use the exact
`effective_guardian_max_replicas` terminology and clarify the
`GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain
that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from
`effective_server_autoscaling_max_capacity`, can be overridden, and is clamped
upward) rather than implying a direct autoscaling-max-only mapping, so the
wording matches the deployed behavior.

In `@infra/variables.tf`:
- Around line 305-318: The guardian_max_replicas variable currently accepts
fractional numbers, but infra/ecs.tf stringifies it and
RateLimitConfig::from_env later parses it as u32, so non-integers can fall back
to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.

In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md`:
- Around line 40-42: The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.

In `@speckit/features/010-horizontal-scaling/data-model.md`:
- Line 14: The document has a heading level jump because the “Migration
concurrency (multi-replica startup) — REQUIRED” section uses a third-level
heading directly under the H1. Update that heading in the data-model markdown to
the correct level so the outline stays consistent and markdownlint MD001 is
satisfied.

In `@speckit/features/010-horizontal-scaling/spec.md`:
- Around line 413-417: The shared-state assumption in the horizontal-scaling
spec incorrectly includes rate-limit counters, which conflicts with FR-010 and
the later dependency text. Update the sentence in the spec so the
Postgres-backed shared store is described for sessions, challenges, leadership,
and cursors only, and remove any mention of rate-limit counters from that shared
coordination/state store assumption.

---

Outside diff comments:
In `@crates/server/src/jobs/canonicalization/processor.rs`:
- Around line 359-400: The canonicalization flow in `processor.rs` can leave
state partially applied because `submit_state`, auth syncing, and the canonical
delta update happen before the final lease validation. Update the
`canonicalization` path around `ensure_lease_held`, `submit_state`, and
`update_auth` so the state write and delta canonicalization are fenced
atomically under a single lease check, or add an idempotent recovery mechanism
that safely reconciles `updated_state`, `canonical_delta`, and metadata if the
lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`,
and the final `DeltaStatus::canonical` transition consistent with one another.
- Around line 266-280: The discard path in processor.rs mutates metadata after a
storage delete without re-checking the lease, so a replica can stale-write state
after losing ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.

---

Nitpick comments:
In `@crates/server/src/dashboard/state.rs`:
- Around line 871-901: The digest-keying test currently only proves the digest
differs from the token and that authentication succeeds, so it can miss a
regression where plaintext tokens are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 58f01332-ef5e-48a4-a152-748af84dfb23

📥 Commits

Reviewing files that changed from the base of the PR and between 57a43d1 and 7e4de78.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (47)
  • crates/server/Cargo.toml
  • crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql
  • crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
  • crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql
  • crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql
  • crates/server/migrations/2026-06-24-000001_worker_leases/down.sql
  • crates/server/migrations/2026-06-24-000001_worker_leases/up.sql
  • crates/server/src/ack/mod.rs
  • crates/server/src/api/dashboard.rs
  • crates/server/src/api/evm.rs
  • crates/server/src/builder/handle.rs
  • crates/server/src/builder/mod.rs
  • crates/server/src/builder/startup.rs
  • crates/server/src/builder/storage.rs
  • crates/server/src/config/mod.rs
  • crates/server/src/config/stage.rs
  • crates/server/src/coordination/challenge_store.rs
  • crates/server/src/coordination/leader.rs
  • crates/server/src/coordination/mod.rs
  • crates/server/src/coordination/postgres/challenge_store.rs
  • crates/server/src/coordination/postgres/lease.rs
  • crates/server/src/coordination/postgres/mod.rs
  • crates/server/src/coordination/postgres/session_store.rs
  • crates/server/src/coordination/session_store.rs
  • crates/server/src/dashboard/state.rs
  • crates/server/src/dashboard/types.rs
  • crates/server/src/evm/mod.rs
  • crates/server/src/evm/session.rs
  • crates/server/src/jobs/canonicalization/processor.rs
  • crates/server/src/jobs/canonicalization/worker.rs
  • crates/server/src/lib.rs
  • crates/server/src/main.rs
  • crates/server/src/middleware/rate_limit.rs
  • crates/server/src/schema.rs
  • crates/server/src/storage/postgres.rs
  • docs/CONFIGURATION.md
  • docs/SERVER_AWS_DEPLOY.md
  • docs/runbooks/horizontal-scaling.md
  • infra/data.tf
  • infra/ecs.tf
  • infra/variables.tf
  • packages/miden-multisig-client/test-results/.last-run.json
  • speckit/features/010-horizontal-scaling/contracts/config-contract.md
  • speckit/features/010-horizontal-scaling/contracts/coordination-traits.md
  • speckit/features/010-horizontal-scaling/contracts/db-schema.md
  • speckit/features/010-horizontal-scaling/data-model.md
  • speckit/features/010-horizontal-scaling/spec.md
💤 Files with no reviewable changes (1)
  • crates/server/src/dashboard/types.rs

Comment thread crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
Comment thread crates/server/src/api/dashboard.rs Outdated
Comment thread crates/server/src/api/evm.rs Outdated
Comment thread crates/server/src/builder/mod.rs
Comment thread crates/server/src/config/stage.rs
Comment thread docs/SERVER_AWS_DEPLOY.md Outdated
Comment thread infra/variables.tf Outdated
Comment thread speckit/features/010-horizontal-scaling/contracts/config-contract.md Outdated
Comment thread speckit/features/010-horizontal-scaling/data-model.md Outdated
Comment thread speckit/features/010-horizontal-scaling/spec.md
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.24750% with 187 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.84%. Comparing base (dd66999) to head (025f07f).

❌ Your patch status has failed because the patch coverage (73.24%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
- Coverage   76.95%   76.84%   -0.11%     
==========================================
  Files         160      165       +5     
  Lines       28565    29172     +607     
==========================================
+ Hits        21981    22417     +436     
- Misses       6584     6755     +171     

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd66999...025f07f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zeljkoX added a commit that referenced this pull request Jun 24, 2026
- Attribute cursor-secret enforcement to the Compose ${VAR:?} expansion rather
  than a server-side prod guard (the server cursor secret is optional on main;
  the hard requirement lands with #301).
- Production compose: require an explicit GUARDIAN_VERSION (drop the :latest
  default) and bind the metrics port to loopback (127.0.0.1:9464).
- Track B smoke: drop the unconfirmed "storage encryption" log grep; rely on
  ECDSA-signer-ready + clean startup.
- Remove the design-spec artifact from the PR (brainstorming doc, not repo
  content).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zeljkoX zeljkoX marked this pull request as ready for review June 25, 2026 07:26
@zeljkoX zeljkoX requested a review from haseebrabbani as a code owner June 25, 2026 07:26
@zeljkoX zeljkoX requested a review from Copilot June 25, 2026 07:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 57 out of 60 changed files in this pull request and generated 8 comments.

Files not reviewed (1)
  • examples/operator-smoke-web/package-lock.json: Generated file

Comment thread crates/server/src/builder/mod.rs
Comment thread speckit/features/010-horizontal-scaling/contracts/config-contract.md Outdated
Comment thread speckit/features/010-horizontal-scaling/contracts/db-schema.md Outdated
Comment thread speckit/features/010-horizontal-scaling/data-model.md Outdated
Comment thread speckit/features/010-horizontal-scaling/spec.md Outdated
Comment thread speckit/features/010-horizontal-scaling/spec.md Outdated
Comment thread speckit/features/010-horizontal-scaling/spec.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/guides/horizontal-scaling/Caddyfile (1)

14-20: 🩺 Stability & Availability | 🔵 Trivial

Decouple health checks from the public root route.

This matches the current contracts in crates/server/src/builder/handle.rs Line 231 and infra/alb.tf Line 33, but it also means liveness is gated by the same rate-limited / path the comment is working around. A dedicated unthrottled /healthz endpoint would remove that coupling and avoid future 429-driven false negatives as rate limits or probe sources change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/guides/horizontal-scaling/Caddyfile` around lines 14 - 20, The current
health probe configuration is coupled to the public root route, which is
rate-limited and can cause false unhealthy signals. Update the Caddy health
check in the Caddyfile to use a dedicated unthrottled health endpoint such as
/healthz, and make sure the matching server routing in handle and the ALB health
check configuration are aligned with that endpoint so probes no longer depend on
the rate-limited / path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/guides/horizontal-scaling/README.md`:
- Line 7: The fenced block in the README is missing a language tag, causing
markdownlint to fail. Update the opening fence in the affected Markdown snippet
to include an appropriate hint such as text or sh, and keep the rest of the
block unchanged.

---

Nitpick comments:
In `@docs/guides/horizontal-scaling/Caddyfile`:
- Around line 14-20: The current health probe configuration is coupled to the
public root route, which is rate-limited and can cause false unhealthy signals.
Update the Caddy health check in the Caddyfile to use a dedicated unthrottled
health endpoint such as /healthz, and make sure the matching server routing in
handle and the ALB health check configuration are aligned with that endpoint so
probes no longer depend on the rate-limited / path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 93c2015e-1754-455e-abfe-2510cfe4b945

📥 Commits

Reviewing files that changed from the base of the PR and between 7e4de78 and f0dc738.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • examples/operator-smoke-web/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (24)
  • .gitignore
  • crates/server/Cargo.toml
  • crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
  • crates/server/src/ack/mod.rs
  • crates/server/src/api/dashboard.rs
  • crates/server/src/api/evm.rs
  • crates/server/src/coordination/postgres/session_store.rs
  • crates/server/src/dashboard/state.rs
  • crates/server/src/jobs/canonicalization/processor.rs
  • crates/server/src/schema.rs
  • docs/CONFIGURATION.md
  • docs/guides/README.md
  • docs/guides/horizontal-scaling/.env.example
  • docs/guides/horizontal-scaling/Caddyfile
  • docs/guides/horizontal-scaling/README.md
  • docs/guides/horizontal-scaling/docker-compose.override.yml.example
  • docs/guides/horizontal-scaling/docker-compose.yml
  • docs/guides/horizontal-scaling/operators.example.json
  • docs/openapi-dashboard.json
  • docs/openapi-evm.json
  • docs/openapi.json
  • docs/runbooks/horizontal-scaling.md
  • infra/variables.tf
  • packages/miden-multisig-client/.gitignore
✅ Files skipped from review due to trivial changes (7)
  • docs/guides/README.md
  • docs/guides/horizontal-scaling/operators.example.json
  • docs/guides/horizontal-scaling/docker-compose.override.yml.example
  • .gitignore
  • docs/guides/horizontal-scaling/docker-compose.yml
  • docs/guides/horizontal-scaling/.env.example
  • packages/miden-multisig-client/.gitignore
🚧 Files skipped from review as they are similar to previous changes (10)
  • crates/server/Cargo.toml
  • crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
  • infra/variables.tf
  • crates/server/src/schema.rs
  • docs/CONFIGURATION.md
  • docs/runbooks/horizontal-scaling.md
  • crates/server/src/dashboard/state.rs
  • crates/server/src/coordination/postgres/session_store.rs
  • crates/server/src/jobs/canonicalization/processor.rs
  • crates/server/src/ack/mod.rs

Comment thread docs/guides/horizontal-scaling/README.md Outdated
@zeljkoX zeljkoX self-assigned this Jun 25, 2026
@zeljkoX zeljkoX moved this from Backlog to Review in OZ Development for Miden Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

Ensure horizontal scaling works correctly across multiple Guardian instances Decouple Guardian canonicalization from API instances

3 participants