Skip to content

Raise alarm on integration test failures and attempt auto-fix#3287

Open
prasden wants to merge 4 commits into
aws:mainfrom
prasden:autofix-integrations
Open

Raise alarm on integration test failures and attempt auto-fix#3287
prasden wants to merge 4 commits into
aws:mainfrom
prasden:autofix-integrations

Conversation

@prasden
Copy link
Copy Markdown
Contributor

@prasden prasden commented Jun 4, 2026

Issues:

Resolves #V2139430768

Description of changes:

This PR adds two things: a CloudWatch alarm that automatically gets raised when an integration test is failing by converting integration_omnibus into a nightly run and triggering an alarm in AWS-LC's CloudWatch namespace when there is a failure, and an automated workflow that uses Claude to investigate and raise a draft PR for the fix automatically.

A new workflow that triggers Claude to create a PR for failing integration, autofix-integration-omnibus was added with three jobs: recognize, reason and resolve. recognize downloads the artifacts and emits a deduped JSON matrix of (integration, version) targets, dropping any we have no patch for. reason fans out one job per target, clones the downstream repo and fetches the failure logs, then runs Claude to repair the patch and commit it. resolve scans the patch for leaked secrets, creates a branch, and opens a draft PR. The jobs are split so privilege is isolated from untrusted input. Two IAM roles are added, AwsLcGitHubActionIntegrationFailureReasoningRole for reason (Bedrock only, no GitHub token) and AwsLcGitHubActionIntegrationFailureResolveRole for resolve (the bot PAT, only to push and open the PR).

Two small composite actions were added to support this. emit-autofix-target runs on each failing job and uploads a small autofix-target-<job> artifact saying which (integration, version) broke. fetch-github-token grabs the aws-lc-ci bot PAT from Secrets Manager for the steps that push branches and open PRs. The two workflows talk only through these artifacts, so when integration-omnibus finishes with failures, a workflow_run trigger kicks off autofix-integration-omnibus, which reads the artifacts and starts fixing.

Two IAM roles are added, split so each job holds only what it needs: AwsLcGitHubActionIntegrationFailureReasoningRole (Bedrock only, no GitHub token) and AwsLcGitHubActionIntegrationFailureResolveRole (the bot PAT, only to push and open the PR).

Call-outs:

The main call out is security, here is how we defend against prompt injection and other vulnerabilities:

  • Privilege isolation: the reason job which runs Claude over untrusted input holds Bedrock access only and no access to Secrets Manager or Github. The resolve job, whose job is to create the draft PR of the change only holds the bot PAT but does not have access to Bedrock, meaning it cannot invoke Claude while holding access to the Github bot PAT. Both jobs are independent of each other.
  • Credential exfiltration via the agent's env: CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1 strips AWS/secret env vars from the shells Claude spawns
  • Network access can't be bypassed as the OS-level Claude Code sandbox limits connections to the allowed hosts, no matter what the model runs
  • Sandbox denyRead blocks ~/.aws and ~/.ssh and defaultMode: dontAsk makes the allow list the boundary (deny-by-default)
  • the resolve job runs a secret scan (AWS keys, GitHub tokens/PATs, private keys, bearer tokens) over the patch before pushing and aborts the PR if any are found
  • Logs from the down stream failures are byte-sanitized (printable ASCII only) and a system prompt marks all downstream data as untrusted.
  • Least-privilege IAM: two scoped roles: reasoning = Bedrock-invoke only, resolve = secretsmanager:GetSecretValue on the one token secret only
  • PRs open as draft and the bot is excluded from CODEOWNERS and can't approve or merge

Testing:

To test end-to-end on a fork, I had to set up the same infrastructure that the repo currently follows. In a personal Isengard account I deployed the CDK changes (both IAM roles) following the CDK README. I then created a GitHub PAT for the bot and stored it in Secrets Manager under the expected secret name, and set up the CodeBuild project that backs the self-hosted GitHub Actions runners.

I then ran the full pipeline against broken integration tests by asking Claude to emulate failures and edit integration_omnibus.yml to fail deliberately. I dispatched integration-omnibus to produce the failed run and its autofix-target artifacts, then dispatched autofix-integration-omnibus against that run id. PRs for failing patches were succesfully created, with an example being here.

I also verified correctness by re-applying every resulting patch to the real downstream source at the ref the runner uses and confirmed the sandbox blocks egress to non-allowlisted hosts and that re-running autofix skips targets whose branch already exists.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@prasden prasden requested a review from a team as a code owner June 4, 2026 09:07
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.15%. Comparing base (7f7d548) to head (82e008e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3287      +/-   ##
==========================================
- Coverage   78.17%   78.15%   -0.02%     
==========================================
  Files         689      689              
  Lines      123732   123733       +1     
  Branches    17199    17200       +1     
==========================================
- Hits        96723    96705      -18     
- Misses      26089    26107      +18     
- Partials      920      921       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread .github/workflows/integration_omnibus.yml Outdated
Comment on lines +541 to +544
report-failures:
name: report-failures
needs: [integrations, python, ruby]
if: ${{ always() && github.event_name == 'schedule' && (needs.integrations.result == 'failure' || needs.python.result == 'failure' || needs.ruby.result == 'failure') }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reports failure, but it's best-practice to also emit metrics on success so you can distinguish between "this is not running" from "this is always succeeding".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I modified it so it runs on every scheduled run and emits 0 if no failures are reported, and no artifact is produced.

Comment on lines +61 to +63
ANTHROPIC_DEFAULT_OPUS_MODEL: us.anthropic.claude-opus-4-8
ANTHROPIC_DEFAULT_SONNET_MODEL: us.anthropic.claude-sonnet-4-6
ANTHROPIC_DEFAULT_HAIKU_MODEL: us.anthropic.claude-haiku-4-5-20251001-v1:0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often will these need bumping? Bedrock doesn't seem to offer a -latest alias for these, so is there a plan/reminder to keep them current (or pin them somewhere more discoverable)?

Copy link
Copy Markdown
Contributor Author

@prasden prasden Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often we need to update depends on Anthropic. For now, I think we can pin the model/names to this file because nothing else uses Claude Code for any automation. When a new feature in AWS-LC wants to use Claude, I think we can then move these model pins into a more centralized file we can iteratively change. But maybe after Mythos drops we won't need to ever change it again..

super().__init__(scope, id, env=env, **kwargs)
self.ignore_failure = ignore_failure
self.timeout = timeout
self.env = env
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, I forgot to include this in the call outs, my mistake. CDK was failing to deploy on my local account with the new IAM roles and I traced to this variable. We remove this as CDK already exposes self.env as a built-in read-only property: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk/Stack.html#aws_cdk.Stack.env.

"enableWeakerNestedSandbox": true,
"network": {
"allowedDomains": [
"downloads.nwtime.org"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This domain allowlist and the Bash(wget ...) allow on line 20 have to be kept in sync by hand, and a new wget-based integration on a different host would need both updated. Is there a way to enforce/centralize that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, removed the Bash(wget) from the allowlist and centralized which external domains are allowed in the sandbox in allowedDomains.

open_pr() {
local target="$1"
local branch_name="$2"
local push_url="https://x-access-token:${GH_TOKEN}@github.com/${REPO}.git"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be cautious about exposing the GH_TOKEN. Can we use git -c http.extraheader=... or some credential helper instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped it from the URL and switched to gh auth setup-git, we now read GH_TOKEN from the env

Comment on lines +561 to +562
failure_count=$(gh run view "$GITHUB_RUN_ID" --json jobs \
--jq '[.jobs[] | select(.conclusion == "failure" and .name != "report-failures")] | length')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we be able to filter out failures due to operational flakiness?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added an id on each integration test to catch failure only on steps which contain the id for integration tests

Comment on lines +49 to +59
for job_id in $(gh api "/repos/${REPO}/actions/runs/${RUN_ID}/jobs" \
--paginate \
--jq ".jobs[]
| select(.conclusion == \"failure\" and (.name | startswith(\"${prefix}\")))
| .id")
do
gh api "/repos/${REPO}/actions/jobs/${job_id}/logs" \
| tail -n 200 | sanitize_log > "${logs_dir}/${job_id}.log" || true
done
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches failed jobs via startswith("${prefix}") where prefix is the runner-script-derived integration name. Will this assumption always hold? Can we enforce it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not currently enforced but I think we should for new tests, maybe somewhere like CONTRIBUTING.md. The integration suite has many jobs and each job can have different build flags like ACCP where it has 4 rows across architectures/FIPS and some patches like python_patch have different versions like python_patch/3.13. We should lock down the naming across job, script, and the patch directory.

Comment thread .github/workflows/autofix_integration_omnibus.yml Outdated
Comment thread .github/workflows/integration_omnibus.yml Outdated
"$AUTOFIX_SCRIPT" reason "$integration" "$version" "$TARGET_RUN_ID"
env:
TARGET: ${{ matrix.target }}
GIT_ALLOW_PROTOCOL: file:git:http:https:ssh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need ssh here if we deny ssh in claude-settings.json?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, no we don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants