fix(setup-cloud): self-heal the SSM precondition for step 15 (prod + test brokers)#151
Closed
hanwencheng wants to merge 4 commits into
Closed
fix(setup-cloud): self-heal the SSM precondition for step 15 (prod + test brokers)#151hanwencheng wants to merge 4 commits into
hanwencheng wants to merge 4 commits into
Conversation
…test brokers) setup-cloud.sh step 15 (SSM SendCommand to bring up the MCP server) assumed the broker EC2 was already a registered SSM managed instance but never ensured it. Operators hit `SendCommand -> InvalidInstanceId` because the broker-host role was created WITHOUT AmazonSSMManagedInstanceCore, so the on-host amazon-ssm-agent can't register. (And separately, a caller lacking ssm:SendCommand got a misleading "does the instance have the agent?" message.) - ensure_ssm_managed(): runs before SendCommand. Resolves the role from the INSTANCE's attached profile (naming-agnostic, so the SAME code fixes BOTH the prod `agentkeys-broker-host` and the test broker's own profile), idempotently attaches AmazonSSMManagedInstanceCore if missing, then polls describe-instance-information until PingStatus=Online. If the agent never registers (role now correct => the agent itself isn't running), it dies with the exact restart remediation (ssh-broker.sh + setup-broker-host.sh --upgrade, or reboot). Idempotent: a re-run with the policy already attached skips. - SendCommand now captures stderr and distinguishes a CALLER ssm:SendCommand AccessDenied (identity-based policy gap) from a real instance problem, with a precise remediation (put-user-policy; see provision-ci-deploy-role.sh for the policy shape) instead of the misleading instance-agent message. - aws iam calls are global (no --region); ec2/ssm reads pass --region "$REGION" per the agentkeys-admin-defaults-to-us-west-2 trap (CLAUDE.md). Env-agnostic + idempotent: works for both broker envs and converges on re-run.
… refs + add CLAUDE.md rule setup-broker-host.sh treats --upgrade (and --skip-pull) as back-compat NO-OPS (it is idempotent + auto-detects bootstrap vs upgrade), so emitting it is misleading. Replace active-path references with --ref main (the canonical idempotent deploy invocation per CLAUDE.md) and codify the rule: - setup-cloud.sh: ensure_ssm_managed remediation suggests --ref main. - docs/ci-setup.md: prod-broker manual deploy uses --ref main. - CLAUDE.md (Remote broker host): explicit never-pass---upgrade rule.
…-u unbound-var) SSM RunShellScript executes the step-15 mcp-bring-up script as root with a MINIMAL env (no HOME). Under set -euo pipefail the first $HOME use (export PATH=$HOME/.cargo/bin) aborted with 'HOME: unbound variable' — a latent bug that only surfaced once SSM delivery started working (previously it failed at send-command). Default HOME to /root before any $HOME use. Also document --only-step 15 in the script's re-run examples.
…etup script Broaden the rule from setup-broker-host.sh to all idempotent setup scripts (setup-cloud.sh, setup-heima.sh, heima-* helpers) and restore the actionable guidance (invoke plain / --only-step N, or --ref main for a broker redeploy; replace existing active-path --upgrade references).
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Harden
setup-cloud.shstep 15 — self-heal the SSM precondition (prod + test brokers)Problem (hit live during a prod
setup-cloud.shrun)Step 15 brings up
agentkeys-mcp-serveron the broker viaaws ssm send-command, but it assumed the broker EC2 was already a registered SSM managed instance and never ensured it. Two distinct failures surfaced, both with the same misleading "does it have amazon-ssm-agent + the SSM instance profile?" message:ssm:SendCommand→AccessDenied.agentkeys-broker-host) was created withoutAmazonSSMManagedInstanceCore, so the on-hostamazon-ssm-agentcan't register → empty SSM inventory →SendCommand: InvalidInstanceId. (The instance is running and has the profile — the role just lacked the policy.)Fix
ensure_ssm_managed()runs beforeSendCommand. It resolves the role from the instance's actual attached profile (naming-agnostic — the same code fixes the prodagentkeys-broker-hostand the test broker's own profile, which is named differently), idempotently attachesAmazonSSMManagedInstanceCoreif missing, then pollsdescribe-instance-informationuntilPingStatus=Online. If the agent never registers (role now correct ⇒ the agent itself isn't running), itdies with the exact restart remediation (ssh-broker.sh+setup-broker-host.sh --upgrade, or reboot).SendCommanderror handling now captures stderr and distinguishes a callerssm:SendCommandAccessDenied(identity-based policy gap) from a real instance problem, pointing atprovision-ci-deploy-role.shfor the exact policy shape — instead of the misleading instance-agent message.aws iamcalls are global (no--region);ec2/ssmreads pass--region "$REGION"per theagentkeys-admin-defaults-to-us-west-2trap (CLAUDE.md).Works for both envs
The role is derived from
$INSTANCE_ID's profile, sosetup-cloud.sh(prod) andsetup-cloud.sh --testeach self-heal their own broker — no hardcoded role name (the test roleagentkeys-broker-host-testdoesn't even exist; CI's test broker uses a separately-provisioned profile, and this resolves whatever is actually attached).Idempotent
A re-run with the policy already attached logs
skipand the poll converges once the agent is online — satisfies the idempotent-remote-setup rule.Verification
bash -n scripts/setup-cloud.sh— clean.describe-instances→ instance-profile → role →list-attached-role-policies) and the idempotency JMESPath (length(AttachedPolicies[?PolicyArn=='…AmazonSSMManagedInstanceCore'])) were validated live against the prod broker role.scripts/setup-cloud.shis not in the broker CI path filter (it's a laptop-run operator script), so this doesn't trigger a test-broker redeploy.🤖 Generated with Claude Code