Raise alarm on integration test failures and attempt auto-fix by prasden · Pull Request #3287 · aws/aws-lc

prasden · 2026-06-04T09:07:58Z

Issues:

Resolves #V2139430768

Description of changes:

This PR adds two things: a CloudWatch alarm that automatically gets raised when an integration test is failing by converting integration_omnibus into a nightly run and triggering an alarm in AWS-LC's CloudWatch namespace when there is a failure, and an automated workflow that uses Claude to investigate and raise a draft PR for the fix automatically.

A new workflow that triggers Claude to create a PR for failing integration, autofix-integration-omnibus was added with three jobs: recognize, reason and resolve. recognize downloads the artifacts and emits a deduped JSON matrix of (integration, version) targets, dropping any we have no patch for. reason fans out one job per target, clones the downstream repo and fetches the failure logs, then runs Claude to repair the patch and commit it. resolve scans the patch for leaked secrets, creates a branch, and opens a draft PR. The jobs are split so privilege is isolated from untrusted input. Two IAM roles are added, AwsLcGitHubActionIntegrationFailureReasoningRole for reason (Bedrock only, no GitHub token) and AwsLcGitHubActionIntegrationFailureResolveRole for resolve (the bot PAT, only to push and open the PR).

Two small composite actions were added to support this. emit-autofix-target runs on each failing job and uploads a small autofix-target-<job> artifact saying which (integration, version) broke. fetch-github-token grabs the aws-lc-ci bot PAT from Secrets Manager for the steps that push branches and open PRs. The two workflows talk only through these artifacts, so when integration-omnibus finishes with failures, a workflow_run trigger kicks off autofix-integration-omnibus, which reads the artifacts and starts fixing.

Two IAM roles are added, split so each job holds only what it needs: AwsLcGitHubActionIntegrationFailureReasoningRole (Bedrock only, no GitHub token) and AwsLcGitHubActionIntegrationFailureResolveRole (the bot PAT, only to push and open the PR).

Call-outs:

The main call out is security, here is how we defend against prompt injection and other vulnerabilities:

Privilege isolation: the reason job which runs Claude over untrusted input holds Bedrock access only and no access to Secrets Manager or Github. The resolve job, whose job is to create the draft PR of the change only holds the bot PAT but does not have access to Bedrock, meaning it cannot invoke Claude while holding access to the Github bot PAT. Both jobs are independent of each other.
Credential exfiltration via the agent's env: CLAUDE_CODE_SUBPROCESS_ENV_SCRUB=1 strips AWS/secret env vars from the shells Claude spawns
Network access can't be bypassed as the OS-level Claude Code sandbox limits connections to the allowed hosts, no matter what the model runs
Sandbox denyRead blocks ~/.aws and ~/.ssh and defaultMode: dontAsk makes the allow list the boundary (deny-by-default)
the resolve job runs a secret scan (AWS keys, GitHub tokens/PATs, private keys, bearer tokens) over the patch before pushing and aborts the PR if any are found
Logs from the down stream failures are byte-sanitized (printable ASCII only) and a system prompt marks all downstream data as untrusted.
Least-privilege IAM: two scoped roles: reasoning = Bedrock-invoke only, resolve = secretsmanager:GetSecretValue on the one token secret only
PRs open as draft and the bot is excluded from CODEOWNERS and can't approve or merge

Testing:

To test end-to-end on a fork, I had to set up the same infrastructure that the repo currently follows. In a personal Isengard account I deployed the CDK changes (both IAM roles) following the CDK README. I then created a GitHub PAT for the bot and stored it in Secrets Manager under the expected secret name, and set up the CodeBuild project that backs the self-hosted GitHub Actions runners.

I then ran the full pipeline against broken integration tests by asking Claude to emulate failures and edit integration_omnibus.yml to fail deliberately. I dispatched integration-omnibus to produce the failed run and its autofix-target artifacts, then dispatched autofix-integration-omnibus against that run id. PRs for failing patches were succesfully created, with an example being here.

I also verified correctness by re-applying every resulting patch to the real downstream source at the ref the runner uses and confirmed the sandbox blocks egress to non-allowlisted hosts and that re-running autofix skips targets whose branch already exists.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

codecov-commenter · 2026-06-04T09:47:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.15%. Comparing base (7f7d548) to head (82e008e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3287      +/-   ##
==========================================
- Coverage   78.17%   78.15%   -0.02%     
==========================================
  Files         689      689              
  Lines      123732   123733       +1     
  Branches    17199    17200       +1     
==========================================
- Hits        96723    96705      -18     
- Misses      26089    26107      +18     
- Partials      920      921       +1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

justsmth · 2026-06-04T15:59:24Z

+  report-failures:
+    name: report-failures
+    needs: [integrations, python, ruby]
+    if: ${{ always() && github.event_name == 'schedule' && (needs.integrations.result == 'failure' || needs.python.result == 'failure' || needs.ruby.result == 'failure') }}


This reports failure, but it's best-practice to also emit metrics on success so you can distinguish between "this is not running" from "this is always succeeding".

Ack, I modified it so it runs on every scheduled run and emits 0 if no failures are reported, and no artifact is produced.

justsmth · 2026-06-04T16:01:28Z

+      ANTHROPIC_DEFAULT_OPUS_MODEL: us.anthropic.claude-opus-4-8
+      ANTHROPIC_DEFAULT_SONNET_MODEL: us.anthropic.claude-sonnet-4-6
+      ANTHROPIC_DEFAULT_HAIKU_MODEL: us.anthropic.claude-haiku-4-5-20251001-v1:0


How often will these need bumping? Bedrock doesn't seem to offer a -latest alias for these, so is there a plan/reminder to keep them current (or pin them somewhere more discoverable)?

How often we need to update depends on Anthropic. For now, I think we can pin the model/names to this file because nothing else uses Claude Code for any automation. When a new feature in AWS-LC wants to use Claude, I think we can then move these model pins into a more centralized file we can iteratively change. But maybe after Mythos drops we won't need to ever change it again..

justsmth · 2026-06-04T16:02:14Z

        super().__init__(scope, id, env=env, **kwargs)
        self.ignore_failure = ignore_failure
        self.timeout = timeout
-        self.env = env


Is this change needed?

Yes it is, I forgot to include this in the call outs, my mistake. CDK was failing to deploy on my local account with the new IAM roles and I traced to this variable. We remove this as CDK already exposes self.env as a built-in read-only property: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk/Stack.html#aws_cdk.Stack.env.

justsmth · 2026-06-04T16:06:02Z

+    "enableWeakerNestedSandbox": true,
+    "network": {
+      "allowedDomains": [
+        "downloads.nwtime.org"


This domain allowlist and the Bash(wget ...) allow on line 20 have to be kept in sync by hand, and a new wget-based integration on a different host would need both updated. Is there a way to enforce/centralize that?

Good call, removed the Bash(wget) from the allowlist and centralized which external domains are allowed in the sandbox in allowedDomains.

justsmth · 2026-06-04T16:08:29Z

+open_pr() {
+  local target="$1"
+  local branch_name="$2"
+  local push_url="https://x-access-token:${GH_TOKEN}@github.com/${REPO}.git"


We need to be cautious about exposing the GH_TOKEN. Can we use git -c http.extraheader=... or some credential helper instead?

Dropped it from the URL and switched to gh auth setup-git, we now read GH_TOKEN from the env

justsmth · 2026-06-04T16:12:10Z

+          failure_count=$(gh run view "$GITHUB_RUN_ID" --json jobs \
+            --jq '[.jobs[] | select(.conclusion == "failure" and .name != "report-failures")] | length')


Might we be able to filter out failures due to operational flakiness?

Yes, added an id on each integration test to catch failure only on steps which contain the id for integration tests

justsmth · 2026-06-04T16:25:42Z

+  for job_id in $(gh api "/repos/${REPO}/actions/runs/${RUN_ID}/jobs" \
+                    --paginate \
+                    --jq ".jobs[]
+                          | select(.conclusion == \"failure\" and (.name | startswith(\"${prefix}\")))
+                          | .id")
+  do
+    gh api "/repos/${REPO}/actions/jobs/${job_id}/logs" \
+      | tail -n 200 | sanitize_log > "${logs_dir}/${job_id}.log" || true
+  done
+}
+


This matches failed jobs via startswith("${prefix}") where prefix is the runner-script-derived integration name. Will this assumption always hold? Can we enforce it?

It's not currently enforced but I think we should for new tests, maybe somewhere like CONTRIBUTING.md. The integration suite has many jobs and each job can have different build flags like ACCP where it has 4 rows across architectures/FIPS and some patches like python_patch have different versions like python_patch/3.13. We should lock down the naming across job, script, and the patch directory.

nhatnghiho · 2026-06-04T19:39:41Z

+          "$AUTOFIX_SCRIPT" reason "$integration" "$version" "$TARGET_RUN_ID"
+        env:
+          TARGET: ${{ matrix.target }}
+          GIT_ALLOW_PROTOCOL: file:git:http:https:ssh


Do we need ssh here if we deny ssh in claude-settings.json?

Good call, no we don't

autofix integrations

61d7f40

prasden requested a review from a team as a code owner June 4, 2026 09:07

prasden temporarily deployed to auto-approve June 4, 2026 09:08 — with GitHub Actions Inactive

justsmth reviewed Jun 4, 2026

View reviewed changes

nhatnghiho reviewed Jun 4, 2026

View reviewed changes

pr feedback

232e94f

prasden temporarily deployed to auto-approve June 4, 2026 20:22 — with GitHub Actions Inactive

prasden temporarily deployed to auto-approve June 4, 2026 20:23 — with GitHub Actions Inactive

prasden requested review from WillChilds-Klein, justsmth and nhatnghiho June 4, 2026 22:03

Merge branch 'main' into autofix-integrations

e1aa1ed

prasden temporarily deployed to auto-approve June 4, 2026 23:02 — with GitHub Actions Inactive

prasden temporarily deployed to auto-approve June 4, 2026 23:03 — with GitHub Actions Inactive

security review

82e008e

prasden had a problem deploying to auto-approve June 4, 2026 23:58 — with GitHub Actions Failure

prasden temporarily deployed to auto-approve June 4, 2026 23:58 — with GitHub Actions Inactive

prasden deployed to auto-approve June 4, 2026 23:58 — with GitHub Actions Active

prasden temporarily deployed to auto-approve June 4, 2026 23:58 — with GitHub Actions Inactive

		failure_count=$(gh run view "$GITHUB_RUN_ID" --json jobs \
		--jq '[.jobs[] \| select(.conclusion == "failure" and .name != "report-failures")] \| length')

Conversation

prasden commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues:

Description of changes:

Call-outs:

Testing:

Uh oh!

codecov-commenter commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prasden Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

prasden commented Jun 4, 2026 •

edited

Loading

codecov-commenter commented Jun 4, 2026 •

edited

Loading

prasden Jun 4, 2026 •

edited

Loading