Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
bdda79e
feat: full §10.2 HDKD agent bootstrap — broker link-code endpoints + …
hanwencheng May 31, 2026
f99b7a0
docs(runbook): fold #144 gaps into operator-runbook-wire — broker-upg…
hanwencheng May 31, 2026
fc969bf
feat(#144): drive the shipped CLI in the demo + complete the rendezvo…
hanwencheng May 31, 2026
3758261
docs(runbook)+comment: drop no-op --upgrade from setup-broker-host.sh…
hanwencheng May 31, 2026
1a7bdab
fix(setup-cloud): self-heal the SSM precondition for step 15 (prod + …
hanwencheng May 31, 2026
bd62bc9
fix(setup-cloud,docs): drop no-op --upgrade from setup-broker-host.sh…
hanwencheng May 31, 2026
a5abb93
fix(setup-cloud): default HOME in the step-15 SSM remote script (set …
hanwencheng May 31, 2026
91442cc
docs(CLAUDE): generalize the never---upgrade rule to any idempotent s…
hanwencheng May 31, 2026
3eb6f51
fix(setup-cloud): step-15 poll heartbeat + 25-min cap (cold cargo bui…
hanwencheng May 31, 2026
4bd39eb
perf(setup-mcp-host): build agentkeys-mcp-server via cached in-repo c…
hanwencheng May 31, 2026
74a9e30
build+scripts: cache sandbox MCP build, move hosted MCP off cloud set…
hanwencheng May 31, 2026
218ca2e
docs(runbook): from-scratch bring-up order + explicit on-broker setup…
hanwencheng May 31, 2026
67402ea
harness(pair): depair + re-pair each fresh run so P.2 actually tests …
hanwencheng May 31, 2026
f4d4cfa
docs(runbook): document P.depair in the walkthrough + reconcile fresh…
hanwencheng May 31, 2026
5f8051f
harness(P.0): on 404, say 'broker predates #144 — redeploy' instead o…
hanwencheng May 31, 2026
02fb7c0
docs(runbook): don't git-pull by hand on the broker; --ref handles it…
hanwencheng May 31, 2026
c22f299
setup-broker-host: self-heal repo ownership when run via sudo (fixes …
hanwencheng May 31, 2026
c27678a
harness(pair): fail-loud P.depair/P.2 instead of false-passing a fres…
hanwencheng May 31, 2026
b5f55c9
broker(oidc): gate mint-oidc-jwt for agent_hdkd sessions on on-chain …
hanwencheng May 31, 2026
210a051
daemon/mcp/harness: keep the agent bearer in the sandbox via owner-on…
hanwencheng May 31, 2026
cc6dc02
harness: fix P.2 false-FAIL (jq on stderr-mixed output) + 0x-prefix A…
hanwencheng May 31, 2026
d7a4c01
security: address 2nd adversarial review — full oidc gate invariant, …
hanwencheng May 31, 2026
55c8c7d
style: cargo fmt the finding-A oidc gate edit (fixes CI fmt gate)
hanwencheng May 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions .github/workflows/harness-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,12 @@ jobs:
(needs.detect-changes.outputs.broker_changed == 'true' ||
(github.event_name == 'workflow_dispatch' && inputs.force_deploy_broker == 'true'))
runs-on: ubuntu-latest
timeout-minutes: 15
# 25min (was 15): issue #144 adds agentkeys-core to the broker's build closure
# (aws-sdk-s3, keyring/zbus, aes-gcm, …), so a COLD cargo cache rebuild runs
# longer. Warm (sccache) builds are still ~3min; this headroom only matters on
# the first build after a dep change, and prevents the GH-job-cancels-a-still-
# running-build race that flaked PR #141's first attempt.
timeout-minutes: 25
permissions:
id-token: write
contents: read
Expand Down Expand Up @@ -408,7 +413,7 @@ jobs:
# expansion (no modifier bugs per CLAUDE.md heredoc-trap rule).
params=$(jq -n --arg script "$deploy_script" '{
commands: [$script],
executionTimeout: ["900"]
executionTimeout: ["1500"]
}')

cmd_id=$(aws ssm send-command \
Expand All @@ -428,10 +433,12 @@ jobs:
INSTANCE_ID: ${{ secrets.TEST_BROKER_INSTANCE_ID }}
run: |
set -euo pipefail
# Poll every 10s for up to 15 min. The command runs setup-broker-host.sh
# which rebuilds + restarts broker/signer/4 workers; cold cargo cache
# can be ~10min, warm ~3min.
for i in $(seq 1 90); do
# Poll every 10s for up to 25 min. The command runs setup-broker-host.sh
# which rebuilds + restarts broker/signer/4 workers; a cold cargo cache
# is longer since issue #144 grew the broker's dep closure (agentkeys-core
# → aws-sdk-s3 / keyring / aes-gcm), warm (sccache) ~3min. Must stay ≤ the
# job timeout-minutes (25) and the SSM executionTimeout (1500s).
for i in $(seq 1 150); do
sleep 10
status=$(aws ssm get-command-invocation \
--region "$REGION" \
Expand Down Expand Up @@ -469,7 +476,7 @@ jobs:
;;
esac
done
echo "::error::SSM command $SSM_COMMAND_ID did not complete within 15min"
echo "::error::SSM command $SSM_COMMAND_ID did not complete within 25min"
exit 1

harness-e2e:
Expand Down
2 changes: 2 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ Also: never gloss over a partial implementation in a demo doc or runbook. If the
## Remote broker host (single entry point)
All remote-host changes (binary upgrades, systemd edits, nginx/certbot, env tweaks, mock-server redeploys) MUST go through `bash scripts/setup-broker-host.sh` — it's idempotent and auto-detects bootstrap vs upgrade. No ad-hoc `systemctl` edits or hand-built `scp`.

**NEVER pass `--upgrade` (or `--skip-pull`) to any idempotent setup script** (`setup-broker-host.sh`, `setup-cloud.sh`, the `heima-*` / `setup-heima.sh` helpers, etc.). They are back-compat **no-ops** — these scripts are idempotent and auto-detect bootstrap vs upgrade; there is no "upgrade mode" to opt into. Invoke them **plain** (optionally with `--test` / `--yes` / `--clean` / `--only-step N`), or pass **`--ref main`** to `setup-broker-host.sh` when you also want it to fetch + checkout + redeploy `main`. Do not add an `--upgrade` flag to any new script, runbook, doc, or CLI guidance; if you find an existing `--upgrade` reference in an active (non-archived) operator path, replace it with the idempotent invocation (`--ref main` for deploy, plain for ensure) in the same change.

### SSH access to the remote broker host
On the operator machine, **SSH into the prod broker with the zsh alias `ssh-agentkeys`** (= `bash $AGENTKEYS_REPO/scripts/ssh-broker.sh prod`, which uses EC2 Instance Connect under AWS profile `agentkeys-broker`). Use it for read-only diagnostics (worker logs, env, status) — it is the sanctioned remote-shell entry point; do not hand-roll `aws ec2-instance-connect ssh` or raw `ssh`. Pass a trailing command to run non-interactively: `ssh-agentkeys 'systemctl status agentkeys-worker-memory'`. The login user is `agentkey` (uid 1001); it is in the `sudo` group but sudo **requires a password and a TTY**, so `journalctl`/reading `/etc/agentkeys/*.env` (owned `agentkeys:agentkeys 0600`) need an interactive session — non-interactive `ssh-agentkeys '<cmd>'` can only run unprivileged commands. For privileged log reads, open an interactive `ssh-agentkeys` shell and run `sudo` there. (`ssh-broker.sh test` / `--fallback` reach the test stack / use the `.pem` when EC2-IC is down.)

Expand Down
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions crates/agentkeys-broker-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ path = "src/lib.rs"

[dependencies]
agentkeys-types = { workspace = true }
# Issue #144 — shared device_crypto (EIP-191 ecrecover for link-code redeem
# pop_sig verify) + HDKD child_omni derivation. One source of truth across
# daemon/CLI/broker.
agentkeys-core = { workspace = true }
axum = { version = "0.7", features = ["json"] }
tokio = { workspace = true }
serde = { workspace = true }
Expand Down
21 changes: 20 additions & 1 deletion crates/agentkeys-broker-server/src/boot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ use crate::jwt::SessionKeypair;
use crate::oidc::OidcKeypair;
use crate::plugins::audit::{AuditAnchor, AuditPolicy};
use crate::plugins::PluginRegistry;
use crate::storage::{AuthNonceStore, GrantStore, IdentityLinkStore, WalletStore};
use crate::storage::{AuthNonceStore, GrantStore, IdentityLinkStore, LinkCodeStore, WalletStore};

/// Outcome of the synchronous Tier-1 boot phase.
pub struct BootArtifacts {
Expand All @@ -41,6 +41,8 @@ pub struct BootArtifacts {
pub nonce_store: Arc<AuthNonceStore>,
pub grant_store: Arc<GrantStore>,
pub identity_link_store: Arc<IdentityLinkStore>,
/// §10.2 agent-bootstrap link-code + pending-binding store (issue #144).
pub link_code_store: Arc<LinkCodeStore>,
/// Concrete EmailLink plugin handle (Phase A.1, US-018). Populated
/// when `email_link` is in `BROKER_AUTH_METHODS` AND the
/// `auth-email-link` feature is compiled in. The registry's auth
Expand Down Expand Up @@ -183,6 +185,14 @@ pub fn run_tier1(config: &BrokerConfig) -> anyhow::Result<BootArtifacts> {
)
})?,
);
let link_code_store = Arc::new(LinkCodeStore::open(&link_codes_path(config)).map_err(|e| {
boot_fail(
env::BROKER_AUDIT_DB_PATH,
&config.audit_db_path.display().to_string(),
format!("LinkCodeStore: {}", e),
"link-codes-db",
)
})?);

// 5. Validate + parse plugin selection env vars. Every name in each
// list must resolve at compile time (i.e. the corresponding
Expand Down Expand Up @@ -225,6 +235,7 @@ pub fn run_tier1(config: &BrokerConfig) -> anyhow::Result<BootArtifacts> {
nonce_store,
grant_store,
identity_link_store,
link_code_store,
#[cfg(feature = "auth-email-link")]
email_link: built.email_link,
#[cfg(feature = "auth-oauth2")]
Expand Down Expand Up @@ -300,6 +311,14 @@ fn identity_links_path(config: &BrokerConfig) -> std::path::PathBuf {
.unwrap_or_else(|| std::path::PathBuf::from("identity_links.sqlite"))
}

fn link_codes_path(config: &BrokerConfig) -> std::path::PathBuf {
config
.audit_db_path
.parent()
.map(|p| p.join("link_codes.sqlite"))
.unwrap_or_else(|| std::path::PathBuf::from("link_codes.sqlite"))
}

#[cfg(feature = "audit-sqlite")]
fn open_sqlite_anchor(config: &BrokerConfig) -> Result<Arc<dyn AuditAnchor>, anyhow::Error> {
use crate::plugins::audit::sqlite::SqliteAnchor;
Expand Down
79 changes: 79 additions & 0 deletions crates/agentkeys-broker-server/src/handlers/agent/create.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
//! `POST /v1/agent/create` — master mints a one-time link code (issue #144 §10.2).
//!
//! Gated by the master's `J1` session bearer. Derives the HDKD child omni
//! `O_agent = SHA256(HDKD_DOMAIN || O_master || "//label")`, mints a single-use
//! link code bound to it (TTL 600s), and records the scope the master wants the
//! agent to have (like an app manifest). The master hands the code to the agent
//! out-of-band; the agent redeems it at `/v1/auth/link-code/redeem`.

use axum::{extract::State, http::HeaderMap, http::StatusCode, response::IntoResponse, Json};
use serde::Deserialize;
use serde_json::json;

use crate::error::BrokerError;
use crate::handlers::agent::unix_now;
use crate::handlers::grant::{random_b64url, require_session_jwt};
use crate::state::SharedState;
use crate::storage::LINK_CODE_TTL_SECONDS;

#[derive(Debug, Deserialize)]
pub struct AgentCreateBody {
/// HDKD child label, e.g. `"agent-a"` (`^[a-z0-9-]{1,32}$`).
pub label: String,
/// Scope the master intends to grant the agent (the "app manifest").
/// Defaults to `"memory"`. Comma-separated service list mirrors
/// `heima-scope-set.sh --services`.
#[serde(default)]
pub requested_scope: Option<String>,
}

pub async fn agent_create(
State(state): State<SharedState>,
headers: HeaderMap,
Json(body): Json<AgentCreateBody>,
) -> Result<impl IntoResponse, BrokerError> {
let session = require_session_jwt(&headers, &state)?;
let master_omni = session.agentkeys.omni_account;

agentkeys_core::actor_omni::validate_label(&body.label)
.map_err(|e| BrokerError::BadRequest(format!("invalid label: {e}")))?;
let child_omni = agentkeys_core::actor_omni::child_omni_hex(&master_omni, &body.label)
.map_err(|e| BrokerError::BadRequest(format!("derive child omni: {e}")))?;

let requested_scope = body
.requested_scope
.filter(|s| !s.trim().is_empty())
.unwrap_or_else(|| "memory".to_string());

let link_code = random_b64url(32);
let now = unix_now()?;
let expires_at = now + LINK_CODE_TTL_SECONDS;
state.link_code_store.issue(
&link_code,
&child_omni,
&master_omni,
&body.label,
&requested_scope,
now,
expires_at,
)?;

tracing::info!(
operator_omni = %master_omni,
child_omni = %child_omni,
label = %body.label,
"issued §10.2 agent link code"
);

Ok((
StatusCode::OK,
Json(json!({
"link_code": link_code,
"child_omni": child_omni,
"operator_omni": master_omni,
"label": body.label,
"requested_scope": requested_scope,
"expires_at": expires_at,
})),
))
}
41 changes: 41 additions & 0 deletions crates/agentkeys-broker-server/src/handlers/agent/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
//! §10.2 agent-bootstrap endpoints (issue #144).
//!
//! Three endpoints implement the link-code ceremony with the master submitting
//! the on-chain binding (decision 1 — no contract change, no broker chain key):
//!
//! - `POST /v1/agent/create` (master, `J1_master`-gated) — mint a one-time link
//! code bound to the HDKD child omni `O_agent = SHA256(.. || O_master || "//label")`.
//! - `POST /v1/auth/link-code/redeem` (agent, no bearer) — verify the agent's
//! `pop_sig`, consume the code, mint `J1_agent`, and stash the device artifact
//! as a pending binding.
//! - `GET /v1/agent/pending-bindings` (master, `J1_master`-gated) — pull the
//! redeemed-but-unbound rows to approve (the push-notification substrate).
//!
//! The broker never K11-verifies on the agent path — agents are K10-only per the
//! contract (`registerAgentDevice` writes `k11CredId = 0`). The master's K11
//! gesture happens later, when it submits the on-chain binding + scope grant.

pub mod create;
pub mod pending;
pub mod redeem;

use std::time::{SystemTime, UNIX_EPOCH};

use crate::error::{BrokerError, BrokerResult};

/// Unix seconds, mapped to `BrokerError::Internal` on the (impossible) clock-skew error.
pub(crate) fn unix_now() -> BrokerResult<i64> {
Ok(SystemTime::now()
.duration_since(UNIX_EPOCH)
.map_err(|e| BrokerError::Internal(format!("clock before unix epoch: {e}")))?
.as_secs() as i64)
}

/// Session-JWT TTL (seconds) for `J1_agent` — same env + default as the wallet
/// session path (`wallet_verify`), so agent and master sessions age uniformly.
pub(crate) fn session_jwt_ttl_seconds() -> u64 {
std::env::var(crate::env::BROKER_SESSION_JWT_TTL_SECONDS)
.ok()
.and_then(|s| s.parse::<u64>().ok())
.unwrap_or(18_000)
}
79 changes: 79 additions & 0 deletions crates/agentkeys-broker-server/src/handlers/agent/pending.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
//! `GET /v1/agent/pending-bindings` — master pulls redeemed-but-unbound agents
//! (issue #144 §10.2).
//!
//! Gated by the master's `J1` session bearer. Returns the operator's rows that
//! have been redeemed (`device_pubkey` + `pop_sig` captured) but not yet bound
//! on-chain — i.e. "agent-A wants to pair + wants `[requested_scope]`". This is
//! the substrate the production push notification carries; the master pulls it,
//! then approves with one K11 gesture (bind + scope). `device_key_hash` is
//! pre-computed so the master can submit `registerAgentDevice` without recomputing.

use axum::{extract::State, http::HeaderMap, http::StatusCode, response::IntoResponse, Json};
use serde::Deserialize;
use serde_json::json;

use crate::error::BrokerError;
use crate::handlers::agent::unix_now;
use crate::handlers::grant::require_session_jwt;
use crate::state::SharedState;

pub async fn pending_bindings(
State(state): State<SharedState>,
headers: HeaderMap,
) -> Result<impl IntoResponse, BrokerError> {
let session = require_session_jwt(&headers, &state)?;
let master_omni = session.agentkeys.omni_account;

let rows = state.link_code_store.pending_bindings(&master_omni)?;
let pending: Vec<_> = rows
.into_iter()
.map(|b| {
// Best-effort device_key_hash so the master needn't recompute. A
// malformed stored address (shouldn't happen — it round-tripped
// through redeem) degrades to an empty string rather than failing
// the whole list.
let device_key_hash = agentkeys_core::device_crypto::device_key_hash(&b.device_pubkey)
.unwrap_or_default();
json!({
"link_code": b.link_code,
"child_omni": b.child_omni,
"operator_omni": b.operator_omni,
"label": b.label,
"requested_scope": b.requested_scope,
"device_pubkey": b.device_pubkey,
"pop_sig": b.pop_sig,
"device_key_hash": device_key_hash,
})
})
.collect();

Ok((StatusCode::OK, Json(json!({ "pending": pending }))))
}

#[derive(Debug, Deserialize)]
pub struct AckBody {
/// The link code whose redeemed binding the master just submitted on chain.
pub link_code: String,
}

/// `POST /v1/agent/pending-bindings/ack` — the master acks that it submitted
/// `registerAgentDevice` for this binding, so it drops out of the pending list
/// (issue #144). Without this the rendezvous would never clear — every redeemed
/// agent would show as "pending" forever even after it's bound on chain. Scoped
/// to the master's omni; idempotent (a second ack is a no-op → `acked: false`).
pub async fn ack_binding(
State(state): State<SharedState>,
headers: HeaderMap,
Json(body): Json<AckBody>,
) -> Result<impl IntoResponse, BrokerError> {
let session = require_session_jwt(&headers, &state)?;
let master_omni = session.agentkeys.omni_account;
let now = unix_now()?;
let updated = state
.link_code_store
.mark_bound(&body.link_code, &master_omni, now)?;
Ok((
StatusCode::OK,
Json(json!({ "acked": updated > 0, "link_code": body.link_code })),
))
}
Loading
Loading