feat: 010 scalability improvements#301
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR implements the first slice of feature 010 horizontal scaling correctness for the Guardian server by moving previously per-process coordination state (auth sessions/challenges and canonicalization ownership) into shared Postgres-backed coordination, adding per-replica rate-limit partitioning keyed by an infra-sourced max replica capacity, and documenting the required HA configuration.
Changes:
- Introduces a
coordinationsubsystem with in-memory (filesystem/dev) and Postgres (prod) implementations for sessions, challenges, and canonicalization leadership. - Updates canonicalization to run under a shared lease with renewal + fencing checks, and adds periodic sweeps for expired sessions/challenges.
- Wires
GUARDIAN_MAX_REPLICASthrough Terraform and documents HA/runbook guidance + prod-stage startup guards.
Reviewed changes
Copilot reviewed 47 out of 48 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| speckit/features/010-horizontal-scaling/spec.md | New feature spec describing HA correctness goals, scenarios, and requirements. |
| speckit/features/010-horizontal-scaling/data-model.md | Defines Postgres tables + concurrency approach (advisory lock) for replica-safe coordination. |
| speckit/features/010-horizontal-scaling/contracts/db-schema.md | Migration contract for new coordination tables + required atomic operations. |
| speckit/features/010-horizontal-scaling/contracts/coordination-traits.md | Contract for new internal coordination traits and their selection rules. |
| speckit/features/010-horizontal-scaling/contracts/config-contract.md | Contract for env vars and prod-stage startup guards/log line for HA mode. |
| packages/miden-multisig-client/test-results/.last-run.json | Adds a test-results artifact file under the TS multisig client. |
| infra/variables.tf | Adds guardian_max_replicas Terraform variable and validation. |
| infra/ecs.tf | Injects GUARDIAN_MAX_REPLICAS into the ECS task definition environment. |
| infra/data.tf | Computes effective_guardian_max_replicas (clamped to autoscaling max capacity). |
| docs/SERVER_AWS_DEPLOY.md | Adds a section explaining HA behavior for the prod profile and links the runbook. |
| docs/runbooks/horizontal-scaling.md | New operator runbook for multi-replica HA deployments and trade-offs/failure modes. |
| docs/CONFIGURATION.md | Documents GUARDIAN_MAX_REPLICAS, prod-stage guards, and HA behavior notes. |
| crates/server/src/storage/postgres.rs | Serializes Diesel migrations across replicas using a Postgres advisory lock. |
| crates/server/src/schema.rs | Adds Diesel schema definitions for new coordination tables. |
| crates/server/src/middleware/rate_limit.rs | Partitions rate limits by GUARDIAN_MAX_REPLICAS + adds tests and warnings. |
| crates/server/src/main.rs | Threads coordination handles from storage builder into server builder. |
| crates/server/src/lib.rs | Exposes new config and coordination modules. |
| crates/server/src/jobs/canonicalization/worker.rs | Acquires/renews a shared lease and cancels work if leadership is lost. |
| crates/server/src/jobs/canonicalization/processor.rs | Adds fencing checks and cooperative cancellation checkpoints during canonicalization. |
| crates/server/src/evm/session.rs | Moves EVM auth challenges/sessions to coordination stores; adds sweep support. |
| crates/server/src/evm/mod.rs | Allows constructing EVM app state with injected session/challenge stores. |
| crates/server/src/dashboard/types.rs | Removes now-obsolete in-memory session/challenge record types. |
| crates/server/src/dashboard/state.rs | Moves operator challenges/sessions to coordination stores; adds prod cursor-secret guard and sweeps. |
| crates/server/src/coordination/session_store.rs | New SessionStore trait + in-memory implementation. |
| crates/server/src/coordination/postgres/session_store.rs | New Postgres SessionStore implementation backed by auth_sessions. |
| crates/server/src/coordination/postgres/mod.rs | New Postgres coordination module utilities and exports. |
| crates/server/src/coordination/postgres/lease.rs | New Postgres lease elector implementation backed by worker_leases. |
| crates/server/src/coordination/postgres/challenge_store.rs | New Postgres ChallengeStore implementation backed by auth_challenges. |
| crates/server/src/coordination/mod.rs | New coordination handle wiring + backend-derived coordination mode selection. |
| crates/server/src/coordination/leader.rs | New LeaderElector trait + AlwaysLeader in-memory implementation. |
| crates/server/src/coordination/challenge_store.rs | New ChallengeStore trait + in-memory implementation + payload encoding/decoding. |
| crates/server/src/config/stage.rs | Centralizes GUARDIAN_ENV=prod detection used by startup guards. |
| crates/server/src/config/mod.rs | New config module root. |
| crates/server/src/builder/storage.rs | Returns CoordinationHandles from storage builder (Postgres vs in-memory). |
| crates/server/src/builder/startup.rs | Logs a single “coordination” line describing resolved mode/backend/stage/max_replicas/secret source. |
| crates/server/src/builder/mod.rs | Wires coordination into dashboard/EVM state construction and adds prod guard for rate-limit partitioning to zero. |
| crates/server/src/builder/handle.rs | Starts canonicalization with a leader handle and adds a periodic session/challenge sweep worker. |
| crates/server/src/api/evm.rs | Adjusts EVM logout to handle possible revoke errors (currently logs + continues). |
| crates/server/src/api/dashboard.rs | Adjusts operator logout to handle possible revoke errors (currently logs + continues). |
| crates/server/src/ack/mod.rs | Reuses the new config::stage::is_prod() helper for ACK secret selection. |
| crates/server/migrations/2026-06-24-000001_worker_leases/up.sql | New migration creating worker_leases. |
| crates/server/migrations/2026-06-24-000001_worker_leases/down.sql | New migration dropping worker_leases. |
| crates/server/migrations/2026-06-23-000002_auth_challenges/up.sql | New migration creating auth_challenges + indexes. |
| crates/server/migrations/2026-06-23-000002_auth_challenges/down.sql | New migration dropping auth_challenges + indexes. |
| crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql | New migration creating auth_sessions + indexes. |
| crates/server/migrations/2026-06-23-000001_auth_sessions/down.sql | New migration dropping auth_sessions + indexes. |
| crates/server/Cargo.toml | Adds tokio-util dependency for cancellation support in canonicalization. |
| Cargo.lock | Updates lockfile for new dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/server/src/jobs/canonicalization/processor.rs (2)
359-400: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftAvoid partial canonicalization when the lease drops mid-sequence.
submit_state(...)commits before the second lease check at Line 400. If the lease is lost after Line 359 but before Line 400, the processor aborts with the account state advanced while the delta remains non-canonical. This needs an atomic, fenced write path for state + canonical delta + metadata, or an explicit idempotent recovery strategy.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 359 - 400, The canonicalization flow in `processor.rs` can leave state partially applied because `submit_state`, auth syncing, and the canonical delta update happen before the final lease validation. Update the `canonicalization` path around `ensure_lease_held`, `submit_state`, and `update_auth` so the state write and delta canonicalization are fenced atomically under a single lease check, or add an idempotent recovery mechanism that safely reconciles `updated_state`, `canonical_delta`, and metadata if the lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`, and the final `DeltaStatus::canonical` transition consistent with one another.
266-280: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winFence metadata writes as well as storage writes.
set_has_pending_candidate(...)andupdate_auth(...)are state-mutating writes but are not guarded immediately before execution. A replica can pass an earlier fence, lose the lease, then still mutate metadata. Addensure_lease_held(&delta).await?directly before each metadata mutation, or move these writes into a single fenced storage operation.Also applies to: 383-386, 409-412
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/jobs/canonicalization/processor.rs` around lines 266 - 280, The discard path in processor.rs mutates metadata after a storage delete without re-checking the lease, so a replica can stale-write state after losing ownership. In the discard flow around ensure_lease_held, call ensure_lease_held(&delta).await? immediately before set_has_pending_candidate, and apply the same fencing to every other state mutation in this file, especially update_auth in the other affected branches, or consolidate those writes into one fenced operation.
🧹 Nitpick comments (1)
crates/server/src/dashboard/state.rs (1)
871-901: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winMake the digest-keying test observe the inserted store key.
This test would still pass if the implementation regressed to storing and looking up plaintext tokens, because it only checks
session_digest(token) != tokenand then authenticates through the same production path. Inject a small recordingSessionStoreand assertinsert(...)receivessession_digest(token), never the raw token bytes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/server/src/dashboard/state.rs` around lines 871 - 901, The digest-keying test currently only proves the digest differs from the token and that authentication succeeds, so it can miss a regression where plaintext tokens are stored. Update the test around session_is_keyed_by_digest_not_plaintext_token to use a small recording SessionStore (or equivalent test double) so you can inspect the key passed into insert and confirm it matches crate::secret::session_digest(token) rather than the raw token bytes. Keep the existing round-trip check, but add a direct assertion on the recorded inserted key to verify DashboardState::verify stores only the digest.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql`:
- Around line 7-16: The auth_sessions table is still keyed only by token_digest,
so realm-scoped sessions can collide across operator and EVM namespaces. Update
the migration to make the primary key composite on (realm, token_digest), then
mirror that change in crates/server/src/schema.rs and any code or SQL that
assumes a single-column key, such as auth session insert/delete/load paths, so
Diesel and the database agree on the new realm-scoped identifier.
In `@crates/server/src/api/dashboard.rs`:
- Around line 183-188: The logout flow in the dashboard API currently logs
revoke failures and still reports success, so it should fail closed instead.
Update the handler in the logout path that calls state.dashboard.logout(...) to
return an error response when revocation fails instead of continuing to a
success payload, and make sure the response contract reflects this change. If
the documented responses change, update the utoipa annotations on the relevant
endpoint and regenerate the committed OpenAPI specs with the project’s
gen-openapi command.
In `@crates/server/src/api/evm.rs`:
- Around line 183-190: The EVM logout handler currently swallows failures from
state.evm.sessions.logout and still returns success; restore the previous
fail-closed behavior by propagating that error out of the logout path instead of
only logging it with tracing::warn!(auth_event = "evm_logout_failed", ...).
Update the logout endpoint logic in evm.rs so a revoke failure results in an
error response from the handler, and if you intentionally keep success responses
then also update the OpenAPI response docs/specs to match the new HTTP contract.
In `@crates/server/src/builder/mod.rs`:
- Around line 455-505: The Postgres coordination guard in the builder is too
late because it only runs inside the dashboard branch, allowing a manual
Postgres build with a custom dashboard to still proceed with AlwaysLeader and
per-process state. Move the missing-coordination check in the builder flow
immediately after `let coordination = self.coordination;` in
`crates/server/src/builder/mod.rs`, before `coordination_mode`, `leader`, or
`dashboard` are derived, so any Postgres build without coordination handles
fails closed regardless of how `dashboard` is provided.
In `@crates/server/src/config/stage.rs`:
- Around line 9-11: `is_prod` currently checks ENV_GUARDIAN_ENV directly, so
values like "prod " are treated as non-production and bypass startup guards.
Update the logic in `stage.rs` to trim the environment value before comparing it
in `eq_ignore_ascii_case(PROD_ENV)`, keeping the existing handling for
`VarError::NotPresent` and other errors unchanged.
In `@crates/server/src/coordination/postgres/challenge_store.rs`:
- Around line 76-121: The issue is that concurrent issue() calls in
ChallengeStore::issue can bypass the per-principal cap because each transaction
sees stale outstanding counts before commit. Add a transaction-scoped lock keyed
by (realm, principal) at the start of the transaction in
PostgresChallengeStore::issue, before the INSERT/DELETE cleanup sequence, so
only one issuance for the same principal runs at a time. Use the existing
transaction block around conn.transaction and place the lock before the
auth_challenges statements that enforce max_outstanding.
In `@crates/server/src/coordination/postgres/lease.rs`:
- Around line 55-69: The lease flow in acquire/release is resetting fence_token
instead of keeping it strictly increasing, which breaks monotonic fencing across
reacquire. Update the lease handling in the try_acquire logic and the release
method so a holder’s reacquisition always advances fence_token, and avoid
deleting the worker_leases row on release if that would reset the counter. Use
the existing worker_leases upsert and fence_token update logic as the place to
enforce the monotonic increment on every acquisition path.
In `@crates/server/src/middleware/rate_limit.rs`:
- Around line 463-465: The env-mutation lock is not being applied consistently
across the rate-limit test suite, so `test_rate_limit_layer_from_env` can still
race with other tests. Update that test to acquire `ENV_LOCK` before touching
shared environment state, and make sure it also clears `GUARDIAN_MAX_REPLICAS`
along with the existing rate-limit env vars. Review the other env-mutating tests
in `rate_limit.rs` and wrap them with the same `ENV_LOCK` pattern so all
shared-process-environment changes are serialized.
In `@crates/server/src/storage/postgres.rs`:
- Around line 45-51: The migration lock acquisition in the postgres startup path
uses pg_advisory_lock and can block forever, so update the lock logic in the
connection/setup flow around the advisory lock call to fail fast instead of
waiting indefinitely. Use pg_try_advisory_lock with a retry deadline/backoff, or
apply a statement/lock timeout before attempting the lock, and keep the existing
error propagation through the migration lock acquisition code so startup can
recover when the lock is held too long.
In `@docs/CONFIGURATION.md`:
- Line 288: The Multi-replica (HA) configuration row overstates what the prod
Terraform profile currently provides; update the wording in the documentation
table to reflect only the variables actually injected by the ECS task
definition. Use the existing configuration entry for the HA quick-start to make
it clear that `GUARDIAN_MAX_REPLICAS` is supplied, while
`GUARDIAN_DASHBOARD_CURSOR_SECRET` must be set separately and is not wired by
the prod Terraform profile.
In `@docs/SERVER_AWS_DEPLOY.md`:
- Around line 556-558: Update the Terraform/runtime docs to use the exact
`effective_guardian_max_replicas` terminology and clarify the
`GUARDIAN_MAX_REPLICAS` contract. In the `SERVER_AWS_DEPLOY` guidance, explain
that `GUARDIAN_MAX_REPLICAS` comes from the effective value (derived from
`effective_server_autoscaling_max_capacity`, can be overridden, and is clamped
upward) rather than implying a direct autoscaling-max-only mapping, so the
wording matches the deployed behavior.
In `@infra/variables.tf`:
- Around line 305-318: The guardian_max_replicas variable currently accepts
fractional numbers, but infra/ecs.tf stringifies it and
RateLimitConfig::from_env later parses it as u32, so non-integers can fall back
to 1 and break rate-limit partitioning. Update the validation on
guardian_max_replicas in variables.tf to require whole numbers (and keep the
existing >= 1 rule), so only integer values are allowed. Use the existing
guardian_max_replicas variable block and its validation stanza as the place to
enforce this.
In `@speckit/features/010-horizontal-scaling/contracts/config-contract.md`:
- Around line 40-42: The `effective_guardian_max_replicas` example in the config
contract is missing the upward clamp and should match the implemented behavior.
Update the expression near `local.effective_guardian_max_replicas` so it
preserves the explicit `var.guardian_max_replicas` override only when it is at
least the autoscaling max capacity, otherwise falls back to
`local.effective_server_autoscaling_max_capacity`; this keeps the documented
contract aligned with the runbook and the actual validation logic.
In `@speckit/features/010-horizontal-scaling/data-model.md`:
- Line 14: The document has a heading level jump because the “Migration
concurrency (multi-replica startup) — REQUIRED” section uses a third-level
heading directly under the H1. Update that heading in the data-model markdown to
the correct level so the outline stays consistent and markdownlint MD001 is
satisfied.
In `@speckit/features/010-horizontal-scaling/spec.md`:
- Around line 413-417: The shared-state assumption in the horizontal-scaling
spec incorrectly includes rate-limit counters, which conflicts with FR-010 and
the later dependency text. Update the sentence in the spec so the
Postgres-backed shared store is described for sessions, challenges, leadership,
and cursors only, and remove any mention of rate-limit counters from that shared
coordination/state store assumption.
---
Outside diff comments:
In `@crates/server/src/jobs/canonicalization/processor.rs`:
- Around line 359-400: The canonicalization flow in `processor.rs` can leave
state partially applied because `submit_state`, auth syncing, and the canonical
delta update happen before the final lease validation. Update the
`canonicalization` path around `ensure_lease_held`, `submit_state`, and
`update_auth` so the state write and delta canonicalization are fenced
atomically under a single lease check, or add an idempotent recovery mechanism
that safely reconciles `updated_state`, `canonical_delta`, and metadata if the
lease is lost mid-sequence. Ensure the fix keeps `submit_state`, `update_auth`,
and the final `DeltaStatus::canonical` transition consistent with one another.
- Around line 266-280: The discard path in processor.rs mutates metadata after a
storage delete without re-checking the lease, so a replica can stale-write state
after losing ownership. In the discard flow around ensure_lease_held, call
ensure_lease_held(&delta).await? immediately before set_has_pending_candidate,
and apply the same fencing to every other state mutation in this file,
especially update_auth in the other affected branches, or consolidate those
writes into one fenced operation.
---
Nitpick comments:
In `@crates/server/src/dashboard/state.rs`:
- Around line 871-901: The digest-keying test currently only proves the digest
differs from the token and that authentication succeeds, so it can miss a
regression where plaintext tokens are stored. Update the test around
session_is_keyed_by_digest_not_plaintext_token to use a small recording
SessionStore (or equivalent test double) so you can inspect the key passed into
insert and confirm it matches crate::secret::session_digest(token) rather than
the raw token bytes. Keep the existing round-trip check, but add a direct
assertion on the recorded inserted key to verify DashboardState::verify stores
only the digest.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 58f01332-ef5e-48a4-a152-748af84dfb23
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (47)
crates/server/Cargo.tomlcrates/server/migrations/2026-06-23-000001_auth_sessions/down.sqlcrates/server/migrations/2026-06-23-000001_auth_sessions/up.sqlcrates/server/migrations/2026-06-23-000002_auth_challenges/down.sqlcrates/server/migrations/2026-06-23-000002_auth_challenges/up.sqlcrates/server/migrations/2026-06-24-000001_worker_leases/down.sqlcrates/server/migrations/2026-06-24-000001_worker_leases/up.sqlcrates/server/src/ack/mod.rscrates/server/src/api/dashboard.rscrates/server/src/api/evm.rscrates/server/src/builder/handle.rscrates/server/src/builder/mod.rscrates/server/src/builder/startup.rscrates/server/src/builder/storage.rscrates/server/src/config/mod.rscrates/server/src/config/stage.rscrates/server/src/coordination/challenge_store.rscrates/server/src/coordination/leader.rscrates/server/src/coordination/mod.rscrates/server/src/coordination/postgres/challenge_store.rscrates/server/src/coordination/postgres/lease.rscrates/server/src/coordination/postgres/mod.rscrates/server/src/coordination/postgres/session_store.rscrates/server/src/coordination/session_store.rscrates/server/src/dashboard/state.rscrates/server/src/dashboard/types.rscrates/server/src/evm/mod.rscrates/server/src/evm/session.rscrates/server/src/jobs/canonicalization/processor.rscrates/server/src/jobs/canonicalization/worker.rscrates/server/src/lib.rscrates/server/src/main.rscrates/server/src/middleware/rate_limit.rscrates/server/src/schema.rscrates/server/src/storage/postgres.rsdocs/CONFIGURATION.mddocs/SERVER_AWS_DEPLOY.mddocs/runbooks/horizontal-scaling.mdinfra/data.tfinfra/ecs.tfinfra/variables.tfpackages/miden-multisig-client/test-results/.last-run.jsonspeckit/features/010-horizontal-scaling/contracts/config-contract.mdspeckit/features/010-horizontal-scaling/contracts/coordination-traits.mdspeckit/features/010-horizontal-scaling/contracts/db-schema.mdspeckit/features/010-horizontal-scaling/data-model.mdspeckit/features/010-horizontal-scaling/spec.md
💤 Files with no reviewable changes (1)
- crates/server/src/dashboard/types.rs
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (73.24%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #301 +/- ##
==========================================
- Coverage 76.95% 76.84% -0.11%
==========================================
Files 160 165 +5
Lines 28565 29172 +607
==========================================
+ Hits 21981 22417 +436
- Misses 6584 6755 +171 Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
- Attribute cursor-secret enforcement to the Compose ${VAR:?} expansion rather
than a server-side prod guard (the server cursor secret is optional on main;
the hard requirement lands with #301).
- Production compose: require an explicit GUARDIAN_VERSION (drop the :latest
default) and bind the metrics port to loopback (127.0.0.1:9464).
- Track B smoke: drop the unconfirmed "storage encryption" log grep; rely on
ECDSA-signer-ready + clean startup.
- Remove the design-spec artifact from the PR (brainstorming doc, not repo
content).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/guides/horizontal-scaling/Caddyfile (1)
14-20: 🩺 Stability & Availability | 🔵 TrivialDecouple health checks from the public root route.
This matches the current contracts in
crates/server/src/builder/handle.rsLine 231 andinfra/alb.tfLine 33, but it also means liveness is gated by the same rate-limited/path the comment is working around. A dedicated unthrottled/healthzendpoint would remove that coupling and avoid future 429-driven false negatives as rate limits or probe sources change.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/guides/horizontal-scaling/Caddyfile` around lines 14 - 20, The current health probe configuration is coupled to the public root route, which is rate-limited and can cause false unhealthy signals. Update the Caddy health check in the Caddyfile to use a dedicated unthrottled health endpoint such as /healthz, and make sure the matching server routing in handle and the ALB health check configuration are aligned with that endpoint so probes no longer depend on the rate-limited / path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/guides/horizontal-scaling/README.md`:
- Line 7: The fenced block in the README is missing a language tag, causing
markdownlint to fail. Update the opening fence in the affected Markdown snippet
to include an appropriate hint such as text or sh, and keep the rest of the
block unchanged.
---
Nitpick comments:
In `@docs/guides/horizontal-scaling/Caddyfile`:
- Around line 14-20: The current health probe configuration is coupled to the
public root route, which is rate-limited and can cause false unhealthy signals.
Update the Caddy health check in the Caddyfile to use a dedicated unthrottled
health endpoint such as /healthz, and make sure the matching server routing in
handle and the ALB health check configuration are aligned with that endpoint so
probes no longer depend on the rate-limited / path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 93c2015e-1754-455e-abfe-2510cfe4b945
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockexamples/operator-smoke-web/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (24)
.gitignorecrates/server/Cargo.tomlcrates/server/migrations/2026-06-23-000001_auth_sessions/up.sqlcrates/server/src/ack/mod.rscrates/server/src/api/dashboard.rscrates/server/src/api/evm.rscrates/server/src/coordination/postgres/session_store.rscrates/server/src/dashboard/state.rscrates/server/src/jobs/canonicalization/processor.rscrates/server/src/schema.rsdocs/CONFIGURATION.mddocs/guides/README.mddocs/guides/horizontal-scaling/.env.exampledocs/guides/horizontal-scaling/Caddyfiledocs/guides/horizontal-scaling/README.mddocs/guides/horizontal-scaling/docker-compose.override.yml.exampledocs/guides/horizontal-scaling/docker-compose.ymldocs/guides/horizontal-scaling/operators.example.jsondocs/openapi-dashboard.jsondocs/openapi-evm.jsondocs/openapi.jsondocs/runbooks/horizontal-scaling.mdinfra/variables.tfpackages/miden-multisig-client/.gitignore
✅ Files skipped from review due to trivial changes (7)
- docs/guides/README.md
- docs/guides/horizontal-scaling/operators.example.json
- docs/guides/horizontal-scaling/docker-compose.override.yml.example
- .gitignore
- docs/guides/horizontal-scaling/docker-compose.yml
- docs/guides/horizontal-scaling/.env.example
- packages/miden-multisig-client/.gitignore
🚧 Files skipped from review as they are similar to previous changes (10)
- crates/server/Cargo.toml
- crates/server/migrations/2026-06-23-000001_auth_sessions/up.sql
- infra/variables.tf
- crates/server/src/schema.rs
- docs/CONFIGURATION.md
- docs/runbooks/horizontal-scaling.md
- crates/server/src/dashboard/state.rs
- crates/server/src/coordination/postgres/session_store.rs
- crates/server/src/jobs/canonicalization/processor.rs
- crates/server/src/ack/mod.rs
Horizontal scaling correctness across replicas
Makes the Guardian server correct under 2–6 replicas behind a non-sticky load balancer. Several subsystems were written for a single instance, so a request that started on one replica and continued on another could fail, and background work ran redundantly on every replica.
What changed
Summary by CodeRabbit
New Features
Bug Fixes