CI Doctor: evidence packs, prow job analysis agent, double check causal chains by pmtk · Pull Request #214 · openshift-eng/edge-tooling

pmtk · 2026-07-03T14:42:24Z

Summary by CodeRabbit

New Features
- Added a new CI analysis workflow that extracts structured evidence from job artifacts and generates standardized failure reports.
- Expanded job troubleshooting support with automatic report validation and retry handling for citation issues.
Documentation
- Added new guidance for interpreting CI artifacts, analyzing failures, and formatting structured summaries.
Bug Fixes
- Improved JSON parsing robustness for report content with malformed control characters.

extract-evidence.py condenses a failed job's downloaded artifacts into one structured evidence file per job: failed steps, test and phase failures, journal alerts, and container restart counts — every entry stamped with its timestamp and merged into a single time-sorted failure timeline. doctor.sh gains an `evidence` phase that runs it for every downloaded job, and the lvms-ci/microshift-ci plugins symlink the shared script. The evidence pack becomes the single starting point for analysis agents instead of each agent re-scanning raw artifacts.

Replace the prow-job skill's inline RCA instructions with a dedicated analyze-evidence agent that starts from the evidence pack and consults the MicroShift CI artifact primer (moved under agents/references/) and a structured-summary contract with tightened causal-chain rules. The doctor skill launches the same agent for its per-job analyses; prow-job becomes a thin wrapper that downloads artifacts, extracts evidence, and spawns the agent. validate-reports.py checks every agent report against the structured summary contract, and the doctor skill re-launches fix agents for reports that fail; parse.py sanitizes structured summaries before parsing.

The validator previously only checked that 'evidence' looked like a path — a hallucinated-but-plausible citation passed. It now resolves each citation against the job's downloaded artifacts (build dir derived from the entry's job_url), checks the file exists, the line is in range, and the quote actually appears near the cited line (timestamps stripped, whitespace normalized). Error messages include where the quote really is so fix agents can re-ground citations instead of guessing. Fix agents are no longer told to delete unsupported links to pass validation — they must re-ground each link or move the claim to analysis_gaps and downgrade confidence, then re-run the validator on their own output. Evidence packs now record the source file for every rf/boot_and_run/ journal alert entry (journal alerts from multiple files are merged, so line numbers alone were ambiguous). Drop missing_patterns from the agent contract: nothing consumed it — parse.py discarded it at aggregation — so it was pure token cost.

openshift-ci · 2026-07-03T14:42:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-07-03T14:42:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [pmtk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-07-03T14:42:45Z

Walkthrough

Adds a shared evidence-extraction pipeline (extract-evidence.py, validate-reports.py) under plugins/shared/scripts, wires a new evidence command into doctor.sh, updates parse.py's JSON sanitization, adds analyze-evidence agent/reference docs, rewrites doctor and prow-job SKILL.md workflows, removes the old prow-job primer, and adds symlinks in lvms-ci/microshift-ci plugins.

Changes

Evidence Extraction and Validation Pipeline

Layer / File(s)	Summary
Core evidence extraction: metadata, classification, failed-step detection `plugins/shared/scripts/extract-evidence.py`	Adds imports/constants, JSON/log helpers, metadata extraction, job-type classification, failed-step detection, and infra-indicator scanning.
Scenario, conformance, and build-error parsing `plugins/shared/scripts/extract-evidence.py`	Adds junit/RF-debug/journal parsing, container-restart and sosreport helpers, full scenario extraction, conformance failure extraction, and build/config error extraction.
Source context, timeline, orchestration, batch mode, CLI `plugins/shared/scripts/extract-evidence.py`	Adds source-context extraction, PCP graph discovery, timeline correlation, top-level `extract_evidence()`, batch mode, and CLI argument parsing.
Report validation and structured-summary sanitization `plugins/shared/scripts/validate-reports.py`, `plugins/shared/scripts/parse.py`	Adds validate-reports.py with citation/quote resolution and `validate_file()`/`main()`, and updates `parse_structured_summary()` to sanitize control characters before JSON parsing.
doctor.sh evidence command and doctor SKILL.md workflow `plugins/shared/scripts/doctor.sh`, `plugins/microshift-ci/skills/doctor/SKILL.md`	Adds `cmd_evidence()` with usage/dispatch wiring, and rewrites Step 1c/Step 2 to run evidence extraction, spawn analyze-evidence agents, validate reports, and iterate fix agents.
Analyze-evidence agent and reference docs `plugins/microshift-ci/agents/analyze-evidence.md`, `plugins/microshift-ci/agents/references/*`	Adds the analyze-evidence agent spec, MicroShift CI artifact primer, and structured-summary output contract documentation.
prow-job SKILL.md rewrite `plugins/microshift-ci/skills/prow-job/SKILL.md`, `plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md`	Condenses the prow-job skill into a four-step evidence-based workflow and removes the old primer reference doc.
Symlink shims `plugins/lvms-ci/scripts/extract-evidence.py`, `plugins/microshift-ci/scripts/extract-evidence.py`, `plugins/microshift-ci/scripts/validate-reports.py`	Adds symlinks in lvms-ci and microshift-ci plugin directories pointing to the shared scripts.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

openshift-eng/edge-tooling#145: Builds on the shared doctor.sh infrastructure this PR extends with the new evidence subcommand.
openshift-eng/edge-tooling#147: Introduced the lvms-ci plugin's shared-script symlink structure that this PR follows for extract-evidence.py/validate-reports.py.
openshift-eng/edge-tooling#193: Prior changes to parse_structured_summary() in parse.py that this PR's sanitization update builds directly on.

Suggested reviewers: ggiguash, pacevedom

🚥 Pre-merge checks | ✅ 5 | ❌ 6

❌ Failed checks (6 inconclusive)

Check name	Status	Explanation	Resolution
No-Weak-Crypto	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Hardcoded-Secrets	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Injection-Vectors	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ai-Attribution	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: evidence packs, the prow job analysis agent, and stronger causal-chain checking.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 markdownlint-cli2 (0.22.1)

plugins/microshift-ci/agents/analyze-evidence.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

plugins/microshift-ci/agents/references/microshift-ci-primer.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

plugins/microshift-ci/agents/references/structured-summary.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

2 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

plugins/shared/scripts/validate-reports.py (2)
222-223: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Nested err closure captures loop variables (ruff B023).

err closes over sig, li, and cause, and is redefined every iteration. It works today because it's called synchronously within the same iteration, but ruff flags it (B023) and the file must pass ruff. Hoist it to a module/nested helper taking explicit args.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/validate-reports.py` around lines 222 - 223, The
nested err closure in validate-reports.py is capturing loop variables sig, li,
and cause, which triggers ruff B023. Replace the inner closure with a helper
that takes sig, li, and cause as explicit parameters, and update the call site
inside the surrounding loop so the formatted error string is built from
passed-in values instead of closed-over state.
Source: Linters/SAST tools

170-195: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicated STRUCTURED SUMMARY extraction/sanitization — consume the canonical parser instead.

This block re-implements the exact marker regex + tab/control-char sanitization + json.loads that already lives in parse.py's parse_structured_summary. Keeping two copies means the sanitization rules (e.g. the tab handling above) drift between the two consumers.

Per CONTRIBUTING.md Review Principles — "Single source of truth — no derived state": extract a shared helper (e.g. _extract_summary_json(content) in parse.py) and call it from both, so validation and parsing stay in lock-step.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/validate-reports.py` around lines 170 - 195, The
`_load_entries` logic is duplicating STRUCTURED SUMMARY parsing and sanitization
that already exists in `parse.py`’s `parse_structured_summary`, so make the
parser the single source of truth. Extract the shared summary
extraction/sanitization into a helper in `parse.py` (for example, a function
that returns the parsed entries or the normalized JSON text) and have
`_load_entries` call that helper instead of redoing the marker regex,
tab/control-character cleanup, and `json.loads`. Keep the existing fallback
behavior for missing or malformed summaries, but ensure both consumers use the
same canonical implementation.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/microshift-ci/skills/doctor/SKILL.md`:
- Around line 121-129: Step 2 uses undefined output filename placeholders for
JOB_ID and JOB_NAME_SUFFIX, so update the placeholder mapping in SKILL.md to use
only fields introduced from the prepare script JSON or rename them to
already-defined placeholders. Make sure the OUTPUT_FILE examples in the step
reference concrete values derived from job, url, and build_id, and that any
PR-specific suffix logic is expressed using the existing JOB_NAME / PR
placeholders rather than new undefined symbols.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 26-60: The workflow in SKILL.md is missing inline stop conditions
for failure cases and the agent completion check. Update the numbered steps
around the artifact download, evidence extraction, and analyze stages to
explicitly say what to do if gsutil fails, extract-evidence.py fails, or the
spawned agent does not return DONE, using the existing workflow structure and
the analyze-evidence.md-driven agent step as the anchor. Keep the guard
conditions adjacent to their relevant step, and make the steps actionable with
clear fallback/abort behavior rather than deferring these checks to a separate
section.
- Around line 22-33: The workdir setup in the SKILL workflow can fail on a clean
run because mktemp -d assumes the parent directory already exists. Update the
setup steps in SKILL.md so the URL path branch explicitly creates the parent
workdir first with mkdir -p before calling mktemp -d, and keep the existing
WORKDIR/TMP flow intact.
- Around line 17-32: The `SKILL.md` workflow currently only treats `/`-prefixed
inputs as local in the artifact setup logic, so relative directories will be
misclassified as URLs. Update the `<ARGUMENTS>` contract and the Step 1 handling
in the Prow job workflow to either explicitly require an absolute local
artifacts path or accept existing relative directories as local inputs, and keep
the guidance aligned with the `Work Directory` and `Workflow` sections.

In `@plugins/shared/scripts/extract-evidence.py`:
- Line 330: The script has Ruff violations that will fail validation: rename the
ambiguous comprehension variable `l` in `extract-evidence.py` to a clearer name
in the `context_lines` assignment and the other similar spot, prefix the unused
unpacked `infra_fail_count` with `_` where it is assigned, and split the chained
statements separated by semicolons into separate lines in the affected block.
Update the relevant logic in `extract-evidence.py` around `context_lines`,
`infra_fail_count`, and the code near the multiple-statement lines so the file
passes Ruff.
- Around line 182-192: The refs selection logic in extract-evidence.py can still
index into an empty extra_refs array and raise IndexError in single-mode jobs.
Refactor the refs assignment near the prowjob handling to check whether
extra_refs is present and non-empty before accessing its first element, and keep
the existing fallback behavior when spec.refs is missing. Use the refs, prowjob,
and extra_refs lookup path to locate the fix.

In `@plugins/shared/scripts/parse.py`:
- Line 51: The pre-parse normalization in parse.py is escaping tabs in the whole
JSON payload, which breaks valid tab-indented JSON before json.loads() runs.
Update the parse flow in the json_text handling to stop rewriting all tab
characters globally; instead, only handle control characters inside string
values or remove the tab replacement entirely while keeping the existing JSON
parsing path intact.

---

Nitpick comments:
In `@plugins/shared/scripts/validate-reports.py`:
- Around line 222-223: The nested err closure in validate-reports.py is
capturing loop variables sig, li, and cause, which triggers ruff B023. Replace
the inner closure with a helper that takes sig, li, and cause as explicit
parameters, and update the call site inside the surrounding loop so the
formatted error string is built from passed-in values instead of closed-over
state.
- Around line 170-195: The `_load_entries` logic is duplicating STRUCTURED
SUMMARY parsing and sanitization that already exists in `parse.py`’s
`parse_structured_summary`, so make the parser the single source of truth.
Extract the shared summary extraction/sanitization into a helper in `parse.py`
(for example, a function that returns the parsed entries or the normalized JSON
text) and have `_load_entries` call that helper instead of redoing the marker
regex, tab/control-character cleanup, and `json.loads`. Keep the existing
fallback behavior for missing or malformed summaries, but ensure both consumers
use the same canonical implementation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a046fc16-c033-4567-a1dd-f6c0b098029e

📥 Commits

Reviewing files that changed from the base of the PR and between 54f8bc8 and 65a528c.

📒 Files selected for processing (13)

plugins/lvms-ci/scripts/extract-evidence.py
plugins/microshift-ci/agents/analyze-evidence.md
plugins/microshift-ci/agents/references/microshift-ci-primer.md
plugins/microshift-ci/agents/references/structured-summary.md
plugins/microshift-ci/scripts/extract-evidence.py
plugins/microshift-ci/scripts/validate-reports.py
plugins/microshift-ci/skills/doctor/SKILL.md
plugins/microshift-ci/skills/prow-job/SKILL.md
plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
plugins/shared/scripts/doctor.sh
plugins/shared/scripts/extract-evidence.py
plugins/shared/scripts/parse.py
plugins/shared/scripts/validate-reports.py

💤 Files with no reviewable changes (1)

plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md

coderabbitai · 2026-07-03T14:52:35Z

+   Substitute these placeholders from the prepare script's JSON output (`job`, `url`, `build_id` fields):

-   **For release jobs:**
+   | Placeholder | Value |
+   |---|---|
+   | `{EVIDENCE_PACK}` | `<WORKDIR>/evidence/evidence-<BUILD_ID>.json` |
+   | `{JOB_NAME}` | `job` field (for PR jobs, append a space and `(PR #<PR>)`) |
+   | `{JOB_URL}` | `url` field |
+   | `{OUTPUT_FILE}` | Release: `<WORKDIR>/jobs/release-<RELEASE>-job-<N>-<JOB_ID>.txt`. PR: `<WORKDIR>/jobs/prs-job-<N>-pr<PR>-<JOB_NAME_SUFFIX>.txt` |
+


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/bash set -euo pipefail printf '--- file outline ---\n' ast-grep outline plugins/microshift-ci/skills/doctor/SKILL.md || true printf '\n--- search for JOB_ID / JOB_NAME_SUFFIX / build_id ---\n' rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|\bbuild_id\b|\bjob\b|\burl\b' plugins/microshift-ci/skills/doctor/SKILL.md printf '\n--- surrounding lines 110-140 ---\n' sed -n '110,140p' plugins/microshift-ci/skills/doctor/SKILL.md

Repository: openshift-eng/edge-tooling

Length of output: 255

🏁 Script executed:

#!/usr/bin/bash set -euo pipefail printf '--- broader search in plugin docs ---\n' rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|JOB_NAME_SUFFIX|JOB_ID' plugins/microshift-ci -g '!**/node_modules/**' || true

Repository: openshift-eng/edge-tooling

Length of output: 255

🏁 Script executed:

#!/usr/bin/bash set -euo pipefail printf '--- relevant section with line numbers ---\n' cat -n plugins/microshift-ci/skills/doctor/SKILL.md | sed -n '118,132p'

Repository: openshift-eng/edge-tooling

Length of output: 255

🏁 Script executed:

#!/usr/bin/bash set -euo pipefail echo '--- relevant section ---' sed -n '118,132p' plugins/microshift-ci/skills/doctor/SKILL.md echo echo '--- placeholder search in file ---' rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|\bbuild_id\b|\bjob\b|\burl\b' plugins/microshift-ci/skills/doctor/SKILL.md echo echo '--- broader search in plugin docs ---' rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b' plugins/microshift-ci -g '!**/node_modules/**' || true

Repository: openshift-eng/edge-tooling

Length of output: 4967

🏁 Script executed:

#!/usr/bin/bash set -euo pipefail python3 - <<'PY' from pathlib import Path path = Path('plugins/microshift-ci/skills/doctor/SKILL.md') text = path.read_text() for needle in ['JOB_ID', 'JOB_NAME_SUFFIX', 'build_id', 'job', 'url']: print(f'-- {needle} --') for i, line in enumerate(text.splitlines(), 1): if needle in line: print(f'{i}: {line}') PY

Repository: openshift-eng/edge-tooling

Length of output: 5417

Define the output filename placeholders. JOB_ID and JOB_NAME_SUFFIX are not defined anywhere in Step 2, while the surrounding table only maps job, url, and build_id. Map them to concrete fields or rename them to placeholders already introduced.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/doctor/SKILL.md` around lines 121 - 129, Step 2 uses undefined output filename placeholders for JOB_ID and JOB_NAME_SUFFIX, so update the placeholder mapping in SKILL.md to use only fields introduced from the prepare script JSON or rename them to already-defined placeholders. Make sure the OUTPUT_FILE examples in the step reference concrete values derived from job, url, and build_id, and that any PR-specific suffix logic is expressed using the existing JOB_NAME / PR placeholders rather than new undefined symbols.

coderabbitai · 2026-07-03T14:52:35Z

+`<ARGUMENTS>`: Prow URL, GCS web URL, or local artifacts directory.

- `<TMP>/build-log.txt`: Log containing prow job output and most likely place to identify AWS infra related or hypervisor related errors.
- `<STEP>/build-log.txt`: Each step in the CI job is individually logged in a build-log.txt file.
- `<TMP>/artifacts/<TEST_NAME>/openshift-microshift-infra-sos-aws/artifacts/sosreport-*.tar.xz`: Compressed archive containing select portions of the test host's filesystem, relevant logs, and system configurations. `<TEST_NAME>` varies by job (e.g., `e2e-aws-tests`, `e2e-aws-ovn-ocp-conformance-arm64`).
- `<TMP>/artifacts/<TEST_NAME>/openshift-microshift-e2e-origin-conformance/build-log.txt`: Step-specific build log for origin conformance tests.
-
-## Important Links
-
-**Step Diagram URL** (found at the end of the main build-log):
-
-```text
-https://steps.ci.openshift.org/job?org=openshift&repo=microshift&branch=release-4.19&test=e2e-aws-tests-bootc-nightly&variant=periodics
-```
-
-This link provides a diagram of the steps that make up the test. Think about reading this diagram when identifying step failures because not all fatal errors cause the current step to fail but may cause the next step to fail.
-
-**SOS Report** (contains pod/container logs and cluster-scoped resources)
-
-**Journals:** use the plain-text `journal_*.log` files next to the sosreport tarballs (e.g., `scenario-info/<scenario>/vms/host1/sos/journal_*.log`). These are readable directly with Read/Grep and contain the journal evidence you need (service failures, x509 errors, OOM kills, microshift unit logs).
-
-**Pod logs, cluster state, inspect outputs:** extract a specific sosreport tarball when you need pod logs (container crashes, restarts, probe failures). The extraction script pulls pod logs, inspect outputs, and cluster-scoped resources.
-
-**When to extract a sosreport:** when the journal shows `CrashLoopBackOff`, `Back-off restarting`, repeated `Created container` events, or probe failures after readiness. Pod and container logs — in particular `previous.log`, the only record of WHY a dead container exited — exist exclusively inside the sosreport tarball.
-
-**How to extract:** find the tarball for the scenario, then run the extraction script on that single tarball:
-
-```bash
-# Find sosreport tarballs for the scenario
-find <scenario-dir>/.. -name 'sosreport-*.tar.xz'
-
-# Extract only pod logs, inspect outputs, and cluster-scoped resources
-bash plugins/shared/scripts/extract-sosreport.sh <tarball-path>
-```
-
-The script prints the extraction directory to stdout. Extracted files land in `<tarball-parent>/sos-extracted/<sosreport-name>/`. The extraction is idempotent — running it again on the same tarball is a no-op. Inside the extracted tree:
-
- `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/{current,previous}.log` — container logs
- `sos_commands/microshift/namespaces/<namespace>/core/{pods.yaml,events.yaml}` — pod status and events
- `sos_commands/microshift/cluster-scoped-resources/` — nodes, CRDs, webhooks
- `sos_commands/*/inspect_*` — component command outputs
-
-**There may be several sosreports for a single scenario**: the test framework's sos-on-failure listener (`test/resources/sos-on-failure-listener.py` in openshift/microshift) captures a sosreport at the moment of each test failure, in addition to the one collected at the end of the scenario. **Prefer the on-failure sosreport when investigating a specific test failure**: it contains the pods and container logs of the namespaces created specifically for that test (suite), which are absent from the end-of-scenario sosreport because they have already been cleaned up by then. Match a sosreport to its test failure by capture time.
-
-Correlate journal entries with the failure timestamp recorded during the Characterize phase.
-
-## Performance Graphs
-
-When the input is a local artifacts directory of the form `<WORKDIR>/artifacts/<BUILD_ID>` (the doctor workflow), pre-generated PCP performance graphs may exist in the sibling directory:
-
-```text
-<WORKDIR>/graphs/<BUILD_ID>/
-  1_cpu_usage.png    — CPU usage (user, system, I/O wait)
-  2_mem_usage.png    — Memory usage (used, cached)
-  3_disk_io.png      — Disk I/O (read/write OPS, await)
-  4_disk_usage.png   — Disk usage by partition (% fill)
-```
-
-Use the Read tool to view these PNGs during the drill-down phase whenever the failure involves a timeout, slowness, readiness/health-check expiry, eviction, OOM, or any resource-related error. Look for CPU saturation, memory exhaustion, or disk I/O stalls overlapping the failure window. If the directory does not exist (e.g., standalone URL invocation), skip graph correlation — do not attempt to generate graphs.
+URL formats — periodic: `.../logs/{JOB_NAME}/{JOB_ID}`, presubmit: `.../pr-logs/pull/openshift_microshift/{PR}/{JOB_NAME}/{JOB_ID}`.
+Hosts: `prow.ci.openshift.org/view/gs/test-platform-results/...` or `gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/...`.

 ## Work Directory

-Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.
-
-```text
-/tmp/microshift-ci-claude-workdir.<YYMMDD>
-```
-
-## Common Commands
-
-Scan the build log for arbitrary text:
-
-```bash
-grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt
-```
-
-Download all prow job artifacts (only needed when given a URL, not a local path):
-
-```bash
-GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
-gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
-```
+`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

 ## Workflow

 The user argument is: `<ARGUMENTS>`

-0. **Determine input type and set up artifacts directory**:
-   - If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
-   - If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
+1. **Set up artifacts**:
+   - Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
+   - URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail git ls-files plugins/microshift-ci/skills/prow-job/SKILL.md plugins/docs/SKILL-GUIDELINES.md CONTRIBUTING.md echo echo "=== plugins/microshift-ci/skills/prow-job/SKILL.md (selected lines) ===" nl -ba plugins/microshift-ci/skills/prow-job/SKILL.md | sed -n '1,120p' echo echo "=== Search for local-path / absolute-path wording in related skill docs ===" rg -n "local path|absolute path|starts with /|artifacts directory|TMP|mktemp -d|prow-job" plugins/microshift-ci/skills plugins/docs CONTRIBUTING.md

Repository: openshift-eng/edge-tooling

Length of output: 373

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "=== plugins/microshift-ci/skills/prow-job/SKILL.md ===" cat -n plugins/microshift-ci/skills/prow-job/SKILL.md | sed -n '1,140p' echo echo "=== Relevant matches ===" rg -n "local path|absolute path|starts with /|artifacts directory|TMP|mktemp -d|existing directory|directory" plugins/microshift-ci/skills/prow-job/SKILL.md plugins/microshift-ci/skills -g 'SKILL.md'

Repository: openshift-eng/edge-tooling

Length of output: 7340

Require an absolute local-path contract. <ARGUMENTS> says “local artifacts directory,” but step 1 only treats /-prefixed inputs as local. Relative paths will go down the URL branch. Document the absolute-path requirement or accept existing directories.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 17 - 32, The `SKILL.md` workflow currently only treats `/`-prefixed inputs as local in the artifact setup logic, so relative directories will be misclassified as URLs. Update the `<ARGUMENTS>` contract and the Step 1 handling in the Prow job workflow to either explicitly require an absolute local artifacts path or accept existing relative directories as local inputs, and keep the guidance aligned with the `Work Directory` and `Workflow` sections.

coderabbitai · 2026-07-03T14:52:35Z

 ## Work Directory

-Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.
-
-```text
-/tmp/microshift-ci-claude-workdir.<YYMMDD>
-```
-
-## Common Commands
-
-Scan the build log for arbitrary text:
-
-```bash
-grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt
-```
-
-Download all prow job artifacts (only needed when given a URL, not a local path):
-
-```bash
-GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
-gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
-```
+`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

 ## Workflow

 The user argument is: `<ARGUMENTS>`

-0. **Determine input type and set up artifacts directory**:
-   - If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
-   - If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
+1. **Set up artifacts**:
+   - Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
+   - URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.



🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Create the parent workdir before mktemp -d.

mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX assumes <WORKDIR> already exists, so a clean run can fail before download/extraction starts. Add an explicit mkdir -p "<WORKDIR>" first.

Suggested fix

1. **Set up artifacts**: + - Ensure `<WORKDIR>` exists: `mkdir -p "<WORKDIR>"`. - Local path (starts with `/`): use it as `<TMP>`. Skip step 2. - URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Work Directory

Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.

```text

/tmp/microshift-ci-claude-workdir.<YYMMDD>

```

## Common Commands

Scan the build log for arbitrary text:

```bash

grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt

```

Download all prow job artifacts (only needed when given a URL, not a local path):

```bash

GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')

gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/

```

`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

## Workflow

The user argument is: `<ARGUMENTS>`

0. **Determine input type and set up artifacts directory**:

- If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.

- If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.

1. **Set up artifacts**:

- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.

- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

## Work Directory

`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

## Workflow

The user argument is: `<ARGUMENTS>`

1. **Set up artifacts**:

- Ensure `<WORKDIR>` exists: `mkdir -p "<WORKDIR>"`.

- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.

- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 22 - 33, The workdir setup in the SKILL workflow can fail on a clean run because mktemp -d assumes the parent directory already exists. Update the setup steps in SKILL.md so the URL path branch explicitly creates the parent workdir first with mkdir -p before calling mktemp -d, and keep the existing WORKDIR/TMP flow intact.

coderabbitai · 2026-07-03T14:52:35Z

 ## Workflow

 The user argument is: `<ARGUMENTS>`

-0. **Determine input type and set up artifacts directory**:
-   - If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
-   - If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
+1. **Set up artifacts**:
+   - Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
+   - URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

-1. **Download all artifacts** (skip if using pre-downloaded artifacts from step 0):
-   Download all prow job artifacts using `gsutil -q -m cp -r` into the temporary working directory. Derive the GCS path by stripping the web prefix from the job URL (handles both Prow and GCS web URL formats):
+2. **Download** (URL only):

   ```bash
-   GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
-   gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
+   GCS_PATH=$(echo "<URL>" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' \
+                                  -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
+   gsutil -q -m cp -r "${GCS_PATH}/" <TMP>/
   ```

-   This works for both periodic (`logs/...`) and presubmit PR (`pr-logs/pull/...`) job URLs, and for both Prow and GCS web URL formats.
-   This makes all build logs, step logs, and SOS reports available locally for analysis.
-
-2. **Localize — identify the failed step and the anchor error**:
-   - Scan the top level `build-log.txt` to determine the step where the failure occurred (the last `Running step ...` line before the container logs is a quick anchor — see Tips), then open that step's own `build-log.txt`.
-   - Record each candidate error with its filepath, line number, and timestamp. Read 50 lines before and 50 lines after each to separate the fatal error from setup/teardown noise.
-   - Select the **anchor error**: the first fatal error that caused the step to fail. This becomes `raw_error` in the report.
-   - **The anchor identifies the failure for deduplication — it is NOT the conclusion of the investigation. The first error found is rarely the root cause.**
-
-3. **Characterize — establish exactly WHAT failed before asking why**:
-   - For test steps with scenarios: enumerate the failing tests from `scenario-info/<scenario>/junit.xml` under the step's artifacts, then read the failing scenario's `rf-debug.log` and `phase_*/` logs (Robot Framework marks failures with `| FAIL |`). Record the failing scenario name(s) — the top-level `testsuite name` in each junit.xml — they populate the `scenarios` field in the report.
-   - For each failing scenario, check the plain-text `journal_*.log` files (next to the sosreport tarballs) for fatal patterns (panics, OOM kills, `leader election lost`, container exits). If the journal shows container crashes or restarts, extract the specific sosreport tarball with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>` and read the pod logs (see SOS Report section).
-   - For conformance steps: extract the failing test names and their failure output from the step's `build-log.txt`.
-   - For build/infra steps: extract the failing command and its complete error output from the step log.
-   - Record the failure timestamp(s) — they drive the journal and graph correlation in the next phase.
-   - When the MicroShift source checkout is available — check with Glob for `<WORKDIR>/src/microshift-release-<RELEASE>/` (release jobs) or `<WORKDIR>/src/microshift/` (main) — read the failing test's source: Robot Framework suites under `test/suites/`, scenario definitions under `test/scenarios*/`. Its assertions, timeouts, and setup are how you distinguish a test bug from a product bug. If the checkout is absent, note `"source checkout not available"` in `analysis_gaps` and continue.
-   - Decide the stack layer: cloud infra, ci-config, hypervisor, or a legitimate test failure — and for test failures, the stage: setup, testing, teardown.
+3. **Extract evidence**:

-4. **Drill down — iterate hypothesis → evidence until the cause is actionable**:
-   Repeat this loop until you reach a cause that is **actionable** (a specific code, configuration, test, or infrastructure problem someone can act on) or until the available evidence is exhausted:
-   - State a hypothesis for WHY the error in hand occurred.
-   - Seek confirming or refuting evidence ONE LAYER DEEPER than the current log:
-     - **Journal** — ALWAYS check the plain-text `journal_*.log` files for the scenario (see SOS Report section). Correlate with the failure timestamp (entries within ±5 minutes) and scan for OOM kills, segfaults, service restarts, and disk pressure.
-     - **Sosreport** — when the journal shows container crashes or restarts, extract the specific sosreport tarball with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>` (see SOS Report section for how to pick the right one when several exist). Read the pod/container logs of the failing workload.
-     - **Performance graphs** — when the failure involves a timeout, slowness, readiness/health-check expiry, eviction, or any resource error, Read the PNGs (see Performance Graphs section) and look for saturation overlapping the failure window.
-   - Treat restating errors as symptoms: an error like "timed out waiting for X" is NOT a root cause — explain why X was slow or absent, or explicitly record that the evidence ran out.
-   - **A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves a product component that was unavailable, not ready, crashed, or slow ("no endpoints available", "connection refused", "not ready", "CrashLoopBackOff", probe failures), you MUST reconstruct that component's story from the journal and its pod logs before concluding. Build an exact timestamped timeline: when was the pod created, when did each container start, when did it become ready, did probes fail afterwards, did it restart, and why. Only then attribute the failure:
-     - **Product defect** — the component became ready and later flapped, crashed, or stopped serving (e.g., readiness flips back to not-ready, liveness probe connection refused after startup, container exits and restarts). Report the product mechanism as the root cause even if a test-side wait would also "fix" the symptom.
-     - **Test defect** — the component was still starting up normally and the test simply ran too early against a documented startup sequence.
-   - **Always check for container restarts.** Grep the journal for repeated `Created container`/`Started container` (crio) and `RemoveContainer`/PLEG events (kubelet) for the same pod. Two container instances for one pod means the first one DIED — a single startup story is the wrong narrative. Extract the sosreport (`bash plugins/shared/scripts/extract-sosreport.sh <tarball>`) and read the dead container's log at `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/previous.log` (`current.log` is the running instance). The last ~20 lines of `previous.log` usually state the exit reason (fatal error, leader election lost, panic, OOM).
-   - Record every accepted hop as a causal-chain link with its evidence file and line — these become `causal_chain` in the report. Discarded hypotheses do not go into the chain.
+   ```bash
+   python3 plugins/shared/scripts/extract-evidence.py --artifacts-dir <TMP> --workdir <WORKDIR>
+   ```

-5. **Corroborate — cross-check the explanation**:
-   - When the source checkout is available, list commits from the last month that could be related:
+   Produces `<WORKDIR>/evidence/evidence-<BUILD_ID>.json`. The `<BUILD_ID>` is the last path component of `<TMP>`.

-     ```text
-     bash plugins/microshift-ci/scripts/repo-log.sh <SRC_DIR> --since <1_MONTH_BEFORE_FINISHED> --until <FINISHED_DATE> --paths test/
-     ```
+4. **Analyze**: Read `plugins/microshift-ci/agents/analyze-evidence.md`. Substitute placeholders:

-     Derive `FINISHED_DATE` from the job's `finished.json` timestamp. Drop `--paths` to see all changes. Name candidate commits in the causal chain when their timing and touched paths match the failure.
-   - If multiple scenarios in this job failed, decide cascade vs independent using the **timeline** (which failed first; did the earlier failure poison shared state?), not just error-text similarity.
+   | Placeholder | Value |
+   |---|---|
+   | `{EVIDENCE_PACK}` | `<WORKDIR>/evidence/evidence-<BUILD_ID>.json` |
+   | `{JOB_NAME}` | job name extracted from URL or directory path |
+   | `{JOB_URL}` | the original URL (or reconstruct from artifacts path) |
+   | `{OUTPUT_FILE}` | `<WORKDIR>/report-<BUILD_ID>.txt` |

-6. **Produce a report**: Create a concise report of the failure. The report MUST specify:
-   - Where in the pipeline the error occurred
-   - The specific step the error occurred in
-   - Whether the test failure was legitimate (i.e., a test failed) or due to an infrastructure failure (i.e., build image was not found, AWS infra failed due to quota, hypervisor failed to create test host VM, etc.)
-   - The causal chain from the observed symptom to the root cause, each link backed by evidence (file and line)
-   - A confidence rating for the root cause (see the field rules below)
+   Spawn the agent with the substituted content. When it replies `DONE`, read the output file and present the report to the user.



🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore inline stop conditions for the agent workflow.

This compact rewrite no longer says what to do when gsutil, evidence extraction, or the agent step fails, or when the agent never returns DONE. Per CONTRIBUTING.md: “Steps/workflow must be numbered and actionable with clear stop conditions” and “Orchestrator skills should include output structure plus Edge Cases and guard checks for sub-agent output.” Based on learnings, keep those guards inline with the relevant step rather than moving them to a separate section.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 26 - 60, The workflow in SKILL.md is missing inline stop conditions for failure cases and the agent completion check. Update the numbered steps around the artifact download, evidence extraction, and analyze stages to explicitly say what to do if gsutil fails, extract-evidence.py fails, or the spawned agent does not return DONE, using the existing workflow structure and the analyze-evidence.md-driven agent step as the anchor. Keep the guard conditions adjacent to their relevant step, and make the steps actionable with clear fallback/abort behavior rather than deferring these checks to a separate section.

Sources: Path instructions, Learnings

coderabbitai · 2026-07-03T14:52:35Z

+    refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
+    if isinstance(refs, dict):
+        org = refs.get("org", "")
+        repo = refs.get("repo", "")
+        if job_name and build_id:
+            pulls = refs.get("pulls", [])
+            if pulls:
+                pr_num = pulls[0].get("number", "")
+                job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"
+            else:
+                job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="plugins/shared/scripts/extract-evidence.py" printf '\n== Context around lines 182-192 ==\n' nl -ba "$FILE" | sed -n '165,205p' printf '\n== Search for refs/extra_refs usage ==\n' rg -n "extra_refs|refs =" "$FILE" printf '\n== Small Python probe for the expression shape ==\n' python3 - <<'PY' cases = [ {"prowjob": None}, {"prowjob": {"spec": {}}}, {"prowjob": {"spec": {"refs": {"org": "o"}}}}, {"prowjob": {"spec": {"extra_refs": []}}}, {"prowjob": {"spec": {"extra_refs": [{}]}}}, {"prowjob": {"spec": {"extra_refs": [{"org": "o"}]}}}, ] for case in cases: prowjob = case["prowjob"] try: refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {} print(case, "=>", refs) except Exception as e: print(case, "=>", type(e).__name__, e) PY

Repository: openshift-eng/edge-tooling

Length of output: 240

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="plugins/shared/scripts/extract-evidence.py" printf '\n== Context around lines 175-205 ==\n' awk 'NR>=175 && NR<=205 { printf "%d:%s\n", NR, $0 }' "$FILE" printf '\n== Exact refs/extra_refs expression ==\n' grep -n "extra_refs\|refs =" "$FILE" || true printf '\n== Python probe for expression behavior ==\n' python3 - <<'PY' cases = [ ("no prowjob", None), ("empty spec", {"spec": {}}), ("refs present", {"spec": {"refs": {"org": "o"}}}), ("extra_refs empty", {"spec": {"extra_refs": []}}), ("extra_refs one empty dict", {"spec": {"extra_refs": [{}]}}), ("extra_refs one entry", {"spec": {"extra_refs": [{"org": "o"}]}}), ] for label, prowjob in cases: try: refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {} print(label, "=>", refs) except Exception as e: print(label, "=>", type(e).__name__, str(e)) PY

Repository: openshift-eng/edge-tooling

Length of output: 2026

Guard extra_refs before indexing

A present-but-empty extra_refs: [] still reaches [0] and raises IndexError in single-mode jobs. Split this into two steps so the fallback checks extra_refs before indexing.

Suggested guard

- refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {} + spec = prowjob.get("spec", {}) if prowjob else {} + extra_refs = spec.get("extra_refs") or [] + refs = spec.get("refs") or (extra_refs[0] if extra_refs else {})

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}

if isinstance(refs, dict):

org = refs.get("org", "")

repo = refs.get("repo", "")

if job_name and build_id:

pulls = refs.get("pulls", [])

if pulls:

pr_num = pulls[0].get("number", "")

job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"

else:

job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"

spec = prowjob.get("spec", {}) if prowjob else {}

extra_refs = spec.get("extra_refs") or []

refs = spec.get("refs") or (extra_refs[0] if extra_refs else {})

if isinstance(refs, dict):

org = refs.get("org", "")

repo = refs.get("repo", "")

if job_name and build_id:

pulls = refs.get("pulls", [])

if pulls:

pr_num = pulls[0].get("number", "")

job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"

else:

job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/extract-evidence.py` around lines 182 - 192, The refs selection logic in extract-evidence.py can still index into an empty extra_refs array and raise IndexError in single-mode jobs. Refactor the refs assignment near the prowjob handling to check whether extra_refs is present and non-empty before accessing its first element, and keep the existing fallback behavior when spec.refs is missing. Use the refs, prowjob, and extra_refs lookup path to locate the fix.

coderabbitai · 2026-07-03T14:52:35Z

+    if anchor_line_idx >= 0:
+        start = max(0, anchor_line_idx - 5)
+        end = min(len(lines), anchor_line_idx + 6)
+        context_lines = [l.rstrip() for l in lines[start:end]]


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Ruff violations — the file will fail ruff validation.

Lines 330 & 679: ambiguous variable name l (E741).

Line 535: infra_fail_count unpacked but unused (RUF059); prefix with _.

Lines 1006, 1008, 1010, 1012, 1014: multiple statements per line via ; (E702).

Per CONTRIBUTING.md Code Standards: "Python — Must pass ruff".

Also applies to: 535-535, 679-679, 1006-1014

🧰 Tools

🪛 Ruff (0.15.20)

[error] 330-330: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/extract-evidence.py` at line 330, The script has Ruff violations that will fail validation: rename the ambiguous comprehension variable `l` in `extract-evidence.py` to a clearer name in the `context_lines` assignment and the other similar spot, prefix the unused unpacked `infra_fail_count` with `_` where it is assigned, and split the chained statements separated by semicolons into separate lines in the affected block. Update the relevant logic in `extract-evidence.py` around `context_lines`, `infra_fail_count`, and the code near the multiple-statement lines so the file passes Ruff.

Sources: Coding guidelines, Linters/SAST tools

coderabbitai · 2026-07-03T14:52:35Z

+    json_text = m.group(1)
+    # LLM agents sometimes copy raw control characters (tabs, CRs) from
+    # build logs into JSON string values.  Sanitize before parsing.
+    json_text = json_text.replace('\t', '\\t').replace('\r', '\\r')


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

sed -n '1,140p' plugins/shared/scripts/parse.py printf '\n---\n' rg -n "parse\.py|json_text\.replace|replace\('\\t'" -S plugins tests . -g '!**/.git/**'

Repository: openshift-eng/edge-tooling

Length of output: 5882

🏁 Script executed:

python3 - <<'PY' import json original = '{\n\t"a": 1,\n\t"b": 2\n}' sanitized = original.replace('\t', '\\t').replace('\r', '\\r') print("original parses:", json.loads(original)) try: json.loads(sanitized) print("sanitized parses: yes") except json.JSONDecodeError as e: print("sanitized parses: no") print(type(e).__name__, e.msg) PY

Repository: openshift-eng/edge-tooling

Length of output: 285

Avoid escaping tabs before parsing. json_text.replace('\t', '\\t') rewrites valid JSON whitespace, so tab-indented payloads become invalid and json.loads() fails. Escape control chars only inside string values, or remove the pre-parse rewrite.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/parse.py` at line 51, The pre-parse normalization in parse.py is escaping tabs in the whole JSON payload, which breaks valid tab-indented JSON before json.loads() runs. Update the parse flow in the json_text handling to stop rewriting all tab characters globally; instead, only handle control characters inside string values or remove the tab replacement entirely while keeping the existing JSON parsing path intact.

pmtk added 3 commits July 3, 2026 11:58

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2026

pmtk mentioned this pull request Jul 3, 2026

CI Doctor: failure fingerprint grouping, slim doctor prose #215

Draft

coderabbitai Bot requested changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI Doctor: evidence packs, prow job analysis agent, double check causal chains#214

CI Doctor: evidence packs, prow job analysis agent, double check causal chains#214
pmtk wants to merge 3 commits into
openshift-eng:mainfrom
pmtk:ci-doctor-shiftweek-1

pmtk commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

❌ Failed checks (6 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pmtk commented Jul 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (6 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pmtk commented Jul 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading