Skip to content

fix(setup-cloud): self-heal the SSM precondition for step 15 (prod + test brokers)#151

Closed
hanwencheng wants to merge 4 commits into
mainfrom
claude/harden-setup-cloud-ssm
Closed

fix(setup-cloud): self-heal the SSM precondition for step 15 (prod + test brokers)#151
hanwencheng wants to merge 4 commits into
mainfrom
claude/harden-setup-cloud-ssm

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Harden setup-cloud.sh step 15 — self-heal the SSM precondition (prod + test brokers)

Problem (hit live during a prod setup-cloud.sh run)

Step 15 brings up agentkeys-mcp-server on the broker via aws ssm send-command, but it assumed the broker EC2 was already a registered SSM managed instance and never ensured it. Two distinct failures surfaced, both with the same misleading "does it have amazon-ssm-agent + the SSM instance profile?" message:

  1. Caller IAM gap — the operator's user lacked ssm:SendCommandAccessDenied.
  2. Instance-role IAM gap — the broker-host role (agentkeys-broker-host) was created without AmazonSSMManagedInstanceCore, so the on-host amazon-ssm-agent can't register → empty SSM inventory → SendCommand: InvalidInstanceId. (The instance is running and has the profile — the role just lacked the policy.)

Fix

  • ensure_ssm_managed() runs before SendCommand. It resolves the role from the instance's actual attached profile (naming-agnostic — the same code fixes the prod agentkeys-broker-host and the test broker's own profile, which is named differently), idempotently attaches AmazonSSMManagedInstanceCore if missing, then polls describe-instance-information until PingStatus=Online. If the agent never registers (role now correct ⇒ the agent itself isn't running), it dies with the exact restart remediation (ssh-broker.sh + setup-broker-host.sh --upgrade, or reboot).
  • SendCommand error handling now captures stderr and distinguishes a caller ssm:SendCommand AccessDenied (identity-based policy gap) from a real instance problem, pointing at provision-ci-deploy-role.sh for the exact policy shape — instead of the misleading instance-agent message.
  • aws iam calls are global (no --region); ec2/ssm reads pass --region "$REGION" per the agentkeys-admin-defaults-to-us-west-2 trap (CLAUDE.md).

Works for both envs

The role is derived from $INSTANCE_ID's profile, so setup-cloud.sh (prod) and setup-cloud.sh --test each self-heal their own broker — no hardcoded role name (the test role agentkeys-broker-host-test doesn't even exist; CI's test broker uses a separately-provisioned profile, and this resolves whatever is actually attached).

Idempotent

A re-run with the policy already attached logs skip and the poll converges once the agent is online — satisfies the idempotent-remote-setup rule.

Verification

  • bash -n scripts/setup-cloud.sh — clean.
  • The AWS resolution chain (describe-instances → instance-profile → role → list-attached-role-policies) and the idempotency JMESPath (length(AttachedPolicies[?PolicyArn=='…AmazonSSMManagedInstanceCore'])) were validated live against the prod broker role.
  • scripts/setup-cloud.sh is not in the broker CI path filter (it's a laptop-run operator script), so this doesn't trigger a test-broker redeploy.

🤖 Generated with Claude Code

…test brokers)

setup-cloud.sh step 15 (SSM SendCommand to bring up the MCP server) assumed the
broker EC2 was already a registered SSM managed instance but never ensured it.
Operators hit `SendCommand -> InvalidInstanceId` because the broker-host role was
created WITHOUT AmazonSSMManagedInstanceCore, so the on-host amazon-ssm-agent
can't register. (And separately, a caller lacking ssm:SendCommand got a
misleading "does the instance have the agent?" message.)

- ensure_ssm_managed(): runs before SendCommand. Resolves the role from the
  INSTANCE's attached profile (naming-agnostic, so the SAME code fixes BOTH the
  prod `agentkeys-broker-host` and the test broker's own profile), idempotently
  attaches AmazonSSMManagedInstanceCore if missing, then polls
  describe-instance-information until PingStatus=Online. If the agent never
  registers (role now correct => the agent itself isn't running), it dies with
  the exact restart remediation (ssh-broker.sh + setup-broker-host.sh --upgrade,
  or reboot). Idempotent: a re-run with the policy already attached skips.
- SendCommand now captures stderr and distinguishes a CALLER ssm:SendCommand
  AccessDenied (identity-based policy gap) from a real instance problem, with a
  precise remediation (put-user-policy; see provision-ci-deploy-role.sh for the
  policy shape) instead of the misleading instance-agent message.
- aws iam calls are global (no --region); ec2/ssm reads pass --region "$REGION"
  per the agentkeys-admin-defaults-to-us-west-2 trap (CLAUDE.md).

Env-agnostic + idempotent: works for both broker envs and converges on re-run.
… refs + add CLAUDE.md rule

setup-broker-host.sh treats --upgrade (and --skip-pull) as back-compat NO-OPS
(it is idempotent + auto-detects bootstrap vs upgrade), so emitting it is
misleading. Replace active-path references with --ref main (the canonical
idempotent deploy invocation per CLAUDE.md) and codify the rule:
- setup-cloud.sh: ensure_ssm_managed remediation suggests --ref main.
- docs/ci-setup.md: prod-broker manual deploy uses --ref main.
- CLAUDE.md (Remote broker host): explicit never-pass---upgrade rule.
…-u unbound-var)

SSM RunShellScript executes the step-15 mcp-bring-up script as root with a
MINIMAL env (no HOME). Under set -euo pipefail the first $HOME use
(export PATH=$HOME/.cargo/bin) aborted with 'HOME: unbound variable' — a latent
bug that only surfaced once SSM delivery started working (previously it failed at
send-command). Default HOME to /root before any $HOME use. Also document
--only-step 15 in the script's re-run examples.
…etup script

Broaden the rule from setup-broker-host.sh to all idempotent setup scripts
(setup-cloud.sh, setup-heima.sh, heima-* helpers) and restore the actionable
guidance (invoke plain / --only-step N, or --ref main for a broker redeploy;
replace existing active-path --upgrade references).
@hanwencheng
Copy link
Copy Markdown
Member Author

Folded into #149 — all four commits (step-15 SSM self-heal ensure_ssm_managed, the no-op --upgrade--ref main cleanup + generalized CLAUDE.md rule, and the HOME unbound-var fix) were cherry-picked onto claude/impl-144-hdkd-bootstrap. Continuing on #149.

@hanwencheng hanwencheng deleted the claude/harden-setup-cloud-ssm branch May 31, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant