Raise alarm on integration test failures and attempt auto-fix#3287
Raise alarm on integration test failures and attempt auto-fix#3287prasden wants to merge 4 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3287 +/- ##
==========================================
- Coverage 78.17% 78.15% -0.02%
==========================================
Files 689 689
Lines 123732 123733 +1
Branches 17199 17200 +1
==========================================
- Hits 96723 96705 -18
- Misses 26089 26107 +18
- Partials 920 921 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| report-failures: | ||
| name: report-failures | ||
| needs: [integrations, python, ruby] | ||
| if: ${{ always() && github.event_name == 'schedule' && (needs.integrations.result == 'failure' || needs.python.result == 'failure' || needs.ruby.result == 'failure') }} |
There was a problem hiding this comment.
This reports failure, but it's best-practice to also emit metrics on success so you can distinguish between "this is not running" from "this is always succeeding".
There was a problem hiding this comment.
Ack, I modified it so it runs on every scheduled run and emits 0 if no failures are reported, and no artifact is produced.
| ANTHROPIC_DEFAULT_OPUS_MODEL: us.anthropic.claude-opus-4-8 | ||
| ANTHROPIC_DEFAULT_SONNET_MODEL: us.anthropic.claude-sonnet-4-6 | ||
| ANTHROPIC_DEFAULT_HAIKU_MODEL: us.anthropic.claude-haiku-4-5-20251001-v1:0 |
There was a problem hiding this comment.
How often will these need bumping? Bedrock doesn't seem to offer a -latest alias for these, so is there a plan/reminder to keep them current (or pin them somewhere more discoverable)?
There was a problem hiding this comment.
How often we need to update depends on Anthropic. For now, I think we can pin the model/names to this file because nothing else uses Claude Code for any automation. When a new feature in AWS-LC wants to use Claude, I think we can then move these model pins into a more centralized file we can iteratively change. But maybe after Mythos drops we won't need to ever change it again..
| super().__init__(scope, id, env=env, **kwargs) | ||
| self.ignore_failure = ignore_failure | ||
| self.timeout = timeout | ||
| self.env = env |
There was a problem hiding this comment.
Yes it is, I forgot to include this in the call outs, my mistake. CDK was failing to deploy on my local account with the new IAM roles and I traced to this variable. We remove this as CDK already exposes self.env as a built-in read-only property: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk/Stack.html#aws_cdk.Stack.env.
| "enableWeakerNestedSandbox": true, | ||
| "network": { | ||
| "allowedDomains": [ | ||
| "downloads.nwtime.org" |
There was a problem hiding this comment.
This domain allowlist and the Bash(wget ...) allow on line 20 have to be kept in sync by hand, and a new wget-based integration on a different host would need both updated. Is there a way to enforce/centralize that?
There was a problem hiding this comment.
Good call, removed the Bash(wget) from the allowlist and centralized which external domains are allowed in the sandbox in allowedDomains.
| open_pr() { | ||
| local target="$1" | ||
| local branch_name="$2" | ||
| local push_url="https://x-access-token:${GH_TOKEN}@github.com/${REPO}.git" |
There was a problem hiding this comment.
We need to be cautious about exposing the GH_TOKEN. Can we use git -c http.extraheader=... or some credential helper instead?
There was a problem hiding this comment.
Dropped it from the URL and switched to gh auth setup-git, we now read GH_TOKEN from the env
| failure_count=$(gh run view "$GITHUB_RUN_ID" --json jobs \ | ||
| --jq '[.jobs[] | select(.conclusion == "failure" and .name != "report-failures")] | length') |
There was a problem hiding this comment.
Might we be able to filter out failures due to operational flakiness?
There was a problem hiding this comment.
Yes, added an id on each integration test to catch failure only on steps which contain the id for integration tests
| for job_id in $(gh api "/repos/${REPO}/actions/runs/${RUN_ID}/jobs" \ | ||
| --paginate \ | ||
| --jq ".jobs[] | ||
| | select(.conclusion == \"failure\" and (.name | startswith(\"${prefix}\"))) | ||
| | .id") | ||
| do | ||
| gh api "/repos/${REPO}/actions/jobs/${job_id}/logs" \ | ||
| | tail -n 200 | sanitize_log > "${logs_dir}/${job_id}.log" || true | ||
| done | ||
| } | ||
|
|
There was a problem hiding this comment.
This matches failed jobs via startswith("${prefix}") where prefix is the runner-script-derived integration name. Will this assumption always hold? Can we enforce it?
There was a problem hiding this comment.
It's not currently enforced but I think we should for new tests, maybe somewhere like CONTRIBUTING.md. The integration suite has many jobs and each job can have different build flags like ACCP where it has 4 rows across architectures/FIPS and some patches like python_patch have different versions like python_patch/3.13. We should lock down the naming across job, script, and the patch directory.
| "$AUTOFIX_SCRIPT" reason "$integration" "$version" "$TARGET_RUN_ID" | ||
| env: | ||
| TARGET: ${{ matrix.target }} | ||
| GIT_ALLOW_PROTOCOL: file:git:http:https:ssh |
There was a problem hiding this comment.
Do we need ssh here if we deny ssh in claude-settings.json?
There was a problem hiding this comment.
Good call, no we don't
Issues:
Resolves #V2139430768
Description of changes:
This PR adds two things: a CloudWatch alarm that automatically gets raised when an integration test is failing by converting
integration_omnibusinto a nightly run and triggering an alarm in AWS-LC's CloudWatch namespace when there is a failure, and an automated workflow that uses Claude to investigate and raise a draft PR for the fix automatically.A new workflow that triggers Claude to create a PR for failing integration,
autofix-integration-omnibuswas added with three jobs: recognize, reason and resolve.recognizedownloads the artifacts and emits a deduped JSON matrix of(integration, version)targets, dropping any we have no patch for.reasonfans out one job per target, clones the downstream repo and fetches the failure logs, then runs Claude to repair the patch and commit it.resolvescans the patch for leaked secrets, creates a branch, and opens a draft PR. The jobs are split so privilege is isolated from untrusted input. Two IAM roles are added,AwsLcGitHubActionIntegrationFailureReasoningRoleforreason(Bedrock only, no GitHub token) andAwsLcGitHubActionIntegrationFailureResolveRoleforresolve(the bot PAT, only to push and open the PR).Two small composite actions were added to support this.
emit-autofix-targetruns on each failing job and uploads a smallautofix-target-<job>artifact saying which(integration, version)broke.fetch-github-tokengrabs the aws-lc-ci bot PAT from Secrets Manager for the steps that push branches and open PRs. The two workflows talk only through these artifacts, so whenintegration-omnibusfinishes with failures, aworkflow_runtrigger kicks offautofix-integration-omnibus, which reads the artifacts and starts fixing.Two IAM roles are added, split so each job holds only what it needs: AwsLcGitHubActionIntegrationFailureReasoningRole (Bedrock only, no GitHub token) and AwsLcGitHubActionIntegrationFailureResolveRole (the bot PAT, only to push and open the PR).
Call-outs:
The main call out is security, here is how we defend against prompt injection and other vulnerabilities:
reasonjob which runs Claude over untrusted input holds Bedrock access only and no access to Secrets Manager or Github. Theresolvejob, whose job is to create the draft PR of the change only holds the bot PAT but does not have access to Bedrock, meaning it cannot invoke Claude while holding access to the Github bot PAT. Both jobs are independent of each other.CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1strips AWS/secret env vars from the shells Claude spawnsdenyReadblocks~/.awsand~/.sshanddefaultMode: dontAskmakes the allow list the boundary (deny-by-default)resolvejob runs a secret scan (AWS keys, GitHub tokens/PATs, private keys, bearer tokens) over the patch before pushing and aborts the PR if any are foundreasoning= Bedrock-invoke only,resolve=secretsmanager:GetSecretValueon the one token secret onlyTesting:
To test end-to-end on a fork, I had to set up the same infrastructure that the repo currently follows. In a personal Isengard account I deployed the CDK changes (both IAM roles) following the CDK README. I then created a GitHub PAT for the bot and stored it in Secrets Manager under the expected secret name, and set up the CodeBuild project that backs the self-hosted GitHub Actions runners.
I then ran the full pipeline against broken integration tests by asking Claude to emulate failures and edit
integration_omnibus.ymlto fail deliberately. I dispatchedintegration-omnibusto produce the failed run and itsautofix-targetartifacts, then dispatchedautofix-integration-omnibusagainst that run id. PRs for failing patches were succesfully created, with an example being here.I also verified correctness by re-applying every resulting patch to the real downstream source at the ref the runner uses and confirmed the sandbox blocks egress to non-allowlisted hosts and that re-running autofix skips targets whose branch already exists.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.