Skip to content

CI Doctor: evidence packs, prow job analysis agent, double check causal chains#214

Draft
pmtk wants to merge 3 commits into
openshift-eng:mainfrom
pmtk:ci-doctor-shiftweek-1
Draft

CI Doctor: evidence packs, prow job analysis agent, double check causal chains#214
pmtk wants to merge 3 commits into
openshift-eng:mainfrom
pmtk:ci-doctor-shiftweek-1

Conversation

@pmtk

@pmtk pmtk commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Added a new CI analysis workflow that extracts structured evidence from job artifacts and generates standardized failure reports.
    • Expanded job troubleshooting support with automatic report validation and retry handling for citation issues.
  • Documentation

    • Added new guidance for interpreting CI artifacts, analyzing failures, and formatting structured summaries.
  • Bug Fixes

    • Improved JSON parsing robustness for report content with malformed control characters.

pmtk added 3 commits July 3, 2026 11:58
extract-evidence.py condenses a failed job's downloaded artifacts into
one structured evidence file per job: failed steps, test and phase
failures, journal alerts, and container restart counts — every entry
stamped with its timestamp and merged into a single time-sorted
failure timeline. doctor.sh gains an `evidence` phase that runs it for
every downloaded job, and the lvms-ci/microshift-ci plugins symlink
the shared script.

The evidence pack becomes the single starting point for analysis
agents instead of each agent re-scanning raw artifacts.
Replace the prow-job skill's inline RCA instructions with a dedicated
analyze-evidence agent that starts from the evidence pack and consults
the MicroShift CI artifact primer (moved under agents/references/) and
a structured-summary contract with tightened causal-chain rules. The
doctor skill launches the same agent for its per-job analyses;
prow-job becomes a thin wrapper that downloads artifacts, extracts
evidence, and spawns the agent.

validate-reports.py checks every agent report against the structured
summary contract, and the doctor skill re-launches fix agents for
reports that fail; parse.py sanitizes structured summaries before
parsing.
The validator previously only checked that 'evidence' looked like a
path — a hallucinated-but-plausible citation passed. It now resolves
each citation against the job's downloaded artifacts (build dir derived
from the entry's job_url), checks the file exists, the line is in
range, and the quote actually appears near the cited line (timestamps
stripped, whitespace normalized). Error messages include where the
quote really is so fix agents can re-ground citations instead of
guessing.

Fix agents are no longer told to delete unsupported links to pass
validation — they must re-ground each link or move the claim to
analysis_gaps and downgrade confidence, then re-run the validator on
their own output.

Evidence packs now record the source file for every rf/boot_and_run/
journal alert entry (journal alerts from multiple files are merged, so
line numbers alone were ambiguous).

Drop missing_patterns from the agent contract: nothing consumed it —
parse.py discarded it at aggregation — so it was pure token cost.
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2026
@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2026
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds a shared evidence-extraction pipeline (extract-evidence.py, validate-reports.py) under plugins/shared/scripts, wires a new evidence command into doctor.sh, updates parse.py's JSON sanitization, adds analyze-evidence agent/reference docs, rewrites doctor and prow-job SKILL.md workflows, removes the old prow-job primer, and adds symlinks in lvms-ci/microshift-ci plugins.

Changes

Evidence Extraction and Validation Pipeline

Layer / File(s) Summary
Core evidence extraction: metadata, classification, failed-step detection
plugins/shared/scripts/extract-evidence.py
Adds imports/constants, JSON/log helpers, metadata extraction, job-type classification, failed-step detection, and infra-indicator scanning.
Scenario, conformance, and build-error parsing
plugins/shared/scripts/extract-evidence.py
Adds junit/RF-debug/journal parsing, container-restart and sosreport helpers, full scenario extraction, conformance failure extraction, and build/config error extraction.
Source context, timeline, orchestration, batch mode, CLI
plugins/shared/scripts/extract-evidence.py
Adds source-context extraction, PCP graph discovery, timeline correlation, top-level extract_evidence(), batch mode, and CLI argument parsing.
Report validation and structured-summary sanitization
plugins/shared/scripts/validate-reports.py, plugins/shared/scripts/parse.py
Adds validate-reports.py with citation/quote resolution and validate_file()/main(), and updates parse_structured_summary() to sanitize control characters before JSON parsing.
doctor.sh evidence command and doctor SKILL.md workflow
plugins/shared/scripts/doctor.sh, plugins/microshift-ci/skills/doctor/SKILL.md
Adds cmd_evidence() with usage/dispatch wiring, and rewrites Step 1c/Step 2 to run evidence extraction, spawn analyze-evidence agents, validate reports, and iterate fix agents.
Analyze-evidence agent and reference docs
plugins/microshift-ci/agents/analyze-evidence.md, plugins/microshift-ci/agents/references/*
Adds the analyze-evidence agent spec, MicroShift CI artifact primer, and structured-summary output contract documentation.
prow-job SKILL.md rewrite
plugins/microshift-ci/skills/prow-job/SKILL.md, plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
Condenses the prow-job skill into a four-step evidence-based workflow and removes the old primer reference doc.
Symlink shims
plugins/lvms-ci/scripts/extract-evidence.py, plugins/microshift-ci/scripts/extract-evidence.py, plugins/microshift-ci/scripts/validate-reports.py
Adds symlinks in lvms-ci and microshift-ci plugin directories pointing to the shared scripts.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

Suggested reviewers: ggiguash, pacevedom

🚥 Pre-merge checks | ✅ 5 | ❌ 6

❌ Failed checks (6 inconclusive)

Check name Status Explanation Resolution
No-Weak-Crypto ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Hardcoded-Secrets ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Injection-Vectors ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ai-Attribution ❓ Inconclusive Repository clone failed, so this custom check could not run with code access. Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: evidence packs, the prow job analysis agent, and stronger causal-chain checking.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 markdownlint-cli2 (0.22.1)
plugins/microshift-ci/agents/analyze-evidence.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

plugins/microshift-ci/agents/references/microshift-ci-primer.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

plugins/microshift-ci/agents/references/structured-summary.md

markdownlint-cli2 v0.22.1 (markdownlint v0.40.0)
Finding: plugins/microshift-ci/agents/analyze-evidence.md plugins/microshift-ci/agents/references/microshift-ci-primer.md plugins/microshift-ci/agents/references/structured-summary.md plugins/microshift-ci/skills/doctor/SKILL.md plugins/microshift-ci/skills/prow-job/SKILL.md !node_modules/** !two-node-toolbox/**
Linting: 5 file(s)
Summary: 0 error(s)
AggregateError: Unable to import module 'markdownlint-cli2-formatter-pretty'.
at importModule (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:90:11)
at async Promise.all (index 0)
at async outputResults (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:838:9)
at async main (file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2.mjs:1029:5)
at async file:///usr/local/lib/node_modules/markdownlint-cli2/markdownlint-cli2-bin.mjs:14:22 {
[errors]: [
Error: Cannot find module 'markdownlint-cli2-formatt

... [truncated 1314 characters] ...

node:internal/modules/esm/resolve:271:11)
at moduleResolve (node:internal/modules/esm/resolve:861:10)
at defaultResolve (node:internal/modules/esm/resolve:988:11)
at #cachedDefaultResolve (node:internal/modules/esm/loader:697:20)
at #resolveAndMaybeBlockOnLoaderThread (node:internal/modules/esm/loader:714:38)
at ModuleLoader.resolveSync (node:internal/modules/esm/loader:746:52)
at #resolve (node:internal/modules/esm/loader:679:17)
at ModuleLoader.getOrCreateModuleJob (node:internal/modules/esm/loader:599:35)
at node:internal/modules/esm/loader:628:32
at TracingChannel.tracePromise (node:diagnostics_channel:362:14) {
code: 'ERR_MODULE_NOT_FOUND',
url: 'file:///markdownlint-cli2-formatter-pretty'
}
]
}

  • 2 others

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
plugins/shared/scripts/validate-reports.py (2)

222-223: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Nested err closure captures loop variables (ruff B023).

err closes over sig, li, and cause, and is redefined every iteration. It works today because it's called synchronously within the same iteration, but ruff flags it (B023) and the file must pass ruff. Hoist it to a module/nested helper taking explicit args.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/validate-reports.py` around lines 222 - 223, The
nested err closure in validate-reports.py is capturing loop variables sig, li,
and cause, which triggers ruff B023. Replace the inner closure with a helper
that takes sig, li, and cause as explicit parameters, and update the call site
inside the surrounding loop so the formatted error string is built from
passed-in values instead of closed-over state.

Source: Linters/SAST tools


170-195: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Duplicated STRUCTURED SUMMARY extraction/sanitization — consume the canonical parser instead.

This block re-implements the exact marker regex + tab/control-char sanitization + json.loads that already lives in parse.py's parse_structured_summary. Keeping two copies means the sanitization rules (e.g. the tab handling above) drift between the two consumers.

Per CONTRIBUTING.md Review Principles — "Single source of truth — no derived state": extract a shared helper (e.g. _extract_summary_json(content) in parse.py) and call it from both, so validation and parsing stay in lock-step.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/validate-reports.py` around lines 170 - 195, The
`_load_entries` logic is duplicating STRUCTURED SUMMARY parsing and sanitization
that already exists in `parse.py`’s `parse_structured_summary`, so make the
parser the single source of truth. Extract the shared summary
extraction/sanitization into a helper in `parse.py` (for example, a function
that returns the parsed entries or the normalized JSON text) and have
`_load_entries` call that helper instead of redoing the marker regex,
tab/control-character cleanup, and `json.loads`. Keep the existing fallback
behavior for missing or malformed summaries, but ensure both consumers use the
same canonical implementation.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugins/microshift-ci/skills/doctor/SKILL.md`:
- Around line 121-129: Step 2 uses undefined output filename placeholders for
JOB_ID and JOB_NAME_SUFFIX, so update the placeholder mapping in SKILL.md to use
only fields introduced from the prepare script JSON or rename them to
already-defined placeholders. Make sure the OUTPUT_FILE examples in the step
reference concrete values derived from job, url, and build_id, and that any
PR-specific suffix logic is expressed using the existing JOB_NAME / PR
placeholders rather than new undefined symbols.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md`:
- Around line 26-60: The workflow in SKILL.md is missing inline stop conditions
for failure cases and the agent completion check. Update the numbered steps
around the artifact download, evidence extraction, and analyze stages to
explicitly say what to do if gsutil fails, extract-evidence.py fails, or the
spawned agent does not return DONE, using the existing workflow structure and
the analyze-evidence.md-driven agent step as the anchor. Keep the guard
conditions adjacent to their relevant step, and make the steps actionable with
clear fallback/abort behavior rather than deferring these checks to a separate
section.
- Around line 22-33: The workdir setup in the SKILL workflow can fail on a clean
run because mktemp -d assumes the parent directory already exists. Update the
setup steps in SKILL.md so the URL path branch explicitly creates the parent
workdir first with mkdir -p before calling mktemp -d, and keep the existing
WORKDIR/TMP flow intact.
- Around line 17-32: The `SKILL.md` workflow currently only treats `/`-prefixed
inputs as local in the artifact setup logic, so relative directories will be
misclassified as URLs. Update the `<ARGUMENTS>` contract and the Step 1 handling
in the Prow job workflow to either explicitly require an absolute local
artifacts path or accept existing relative directories as local inputs, and keep
the guidance aligned with the `Work Directory` and `Workflow` sections.

In `@plugins/shared/scripts/extract-evidence.py`:
- Line 330: The script has Ruff violations that will fail validation: rename the
ambiguous comprehension variable `l` in `extract-evidence.py` to a clearer name
in the `context_lines` assignment and the other similar spot, prefix the unused
unpacked `infra_fail_count` with `_` where it is assigned, and split the chained
statements separated by semicolons into separate lines in the affected block.
Update the relevant logic in `extract-evidence.py` around `context_lines`,
`infra_fail_count`, and the code near the multiple-statement lines so the file
passes Ruff.
- Around line 182-192: The refs selection logic in extract-evidence.py can still
index into an empty extra_refs array and raise IndexError in single-mode jobs.
Refactor the refs assignment near the prowjob handling to check whether
extra_refs is present and non-empty before accessing its first element, and keep
the existing fallback behavior when spec.refs is missing. Use the refs, prowjob,
and extra_refs lookup path to locate the fix.

In `@plugins/shared/scripts/parse.py`:
- Line 51: The pre-parse normalization in parse.py is escaping tabs in the whole
JSON payload, which breaks valid tab-indented JSON before json.loads() runs.
Update the parse flow in the json_text handling to stop rewriting all tab
characters globally; instead, only handle control characters inside string
values or remove the tab replacement entirely while keeping the existing JSON
parsing path intact.

---

Nitpick comments:
In `@plugins/shared/scripts/validate-reports.py`:
- Around line 222-223: The nested err closure in validate-reports.py is
capturing loop variables sig, li, and cause, which triggers ruff B023. Replace
the inner closure with a helper that takes sig, li, and cause as explicit
parameters, and update the call site inside the surrounding loop so the
formatted error string is built from passed-in values instead of closed-over
state.
- Around line 170-195: The `_load_entries` logic is duplicating STRUCTURED
SUMMARY parsing and sanitization that already exists in `parse.py`’s
`parse_structured_summary`, so make the parser the single source of truth.
Extract the shared summary extraction/sanitization into a helper in `parse.py`
(for example, a function that returns the parsed entries or the normalized JSON
text) and have `_load_entries` call that helper instead of redoing the marker
regex, tab/control-character cleanup, and `json.loads`. Keep the existing
fallback behavior for missing or malformed summaries, but ensure both consumers
use the same canonical implementation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a046fc16-c033-4567-a1dd-f6c0b098029e

📥 Commits

Reviewing files that changed from the base of the PR and between 54f8bc8 and 65a528c.

📒 Files selected for processing (13)
  • plugins/lvms-ci/scripts/extract-evidence.py
  • plugins/microshift-ci/agents/analyze-evidence.md
  • plugins/microshift-ci/agents/references/microshift-ci-primer.md
  • plugins/microshift-ci/agents/references/structured-summary.md
  • plugins/microshift-ci/scripts/extract-evidence.py
  • plugins/microshift-ci/scripts/validate-reports.py
  • plugins/microshift-ci/skills/doctor/SKILL.md
  • plugins/microshift-ci/skills/prow-job/SKILL.md
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md
  • plugins/shared/scripts/doctor.sh
  • plugins/shared/scripts/extract-evidence.py
  • plugins/shared/scripts/parse.py
  • plugins/shared/scripts/validate-reports.py
💤 Files with no reviewable changes (1)
  • plugins/microshift-ci/skills/prow-job/references/microshift-ci-primer.md

Comment on lines +121 to +129
Substitute these placeholders from the prepare script's JSON output (`job`, `url`, `build_id` fields):

**For release jobs:**
| Placeholder | Value |
|---|---|
| `{EVIDENCE_PACK}` | `<WORKDIR>/evidence/evidence-<BUILD_ID>.json` |
| `{JOB_NAME}` | `job` field (for PR jobs, append a space and `(PR #<PR>)`) |
| `{JOB_URL}` | `url` field |
| `{OUTPUT_FILE}` | Release: `<WORKDIR>/jobs/release-<RELEASE>-job-<N>-<JOB_ID>.txt`. PR: `<WORKDIR>/jobs/prs-job-<N>-pr<PR>-<JOB_NAME_SUFFIX>.txt` |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

printf '--- file outline ---\n'
ast-grep outline plugins/microshift-ci/skills/doctor/SKILL.md || true

printf '\n--- search for JOB_ID / JOB_NAME_SUFFIX / build_id ---\n'
rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|\bbuild_id\b|\bjob\b|\burl\b' plugins/microshift-ci/skills/doctor/SKILL.md

printf '\n--- surrounding lines 110-140 ---\n'
sed -n '110,140p' plugins/microshift-ci/skills/doctor/SKILL.md

Repository: openshift-eng/edge-tooling

Length of output: 255


🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

printf '--- broader search in plugin docs ---\n'
rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|JOB_NAME_SUFFIX|JOB_ID' plugins/microshift-ci -g '!**/node_modules/**' || true

Repository: openshift-eng/edge-tooling

Length of output: 255


🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

printf '--- relevant section with line numbers ---\n'
cat -n plugins/microshift-ci/skills/doctor/SKILL.md | sed -n '118,132p'

Repository: openshift-eng/edge-tooling

Length of output: 255


🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

echo '--- relevant section ---'
sed -n '118,132p' plugins/microshift-ci/skills/doctor/SKILL.md

echo
echo '--- placeholder search in file ---'
rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b|\bbuild_id\b|\bjob\b|\burl\b' plugins/microshift-ci/skills/doctor/SKILL.md

echo
echo '--- broader search in plugin docs ---'
rg -n '\bJOB_ID\b|\bJOB_NAME_SUFFIX\b' plugins/microshift-ci -g '!**/node_modules/**' || true

Repository: openshift-eng/edge-tooling

Length of output: 4967


🏁 Script executed:

#!/usr/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
path = Path('plugins/microshift-ci/skills/doctor/SKILL.md')
text = path.read_text()
for needle in ['JOB_ID', 'JOB_NAME_SUFFIX', 'build_id', 'job', 'url']:
    print(f'-- {needle} --')
    for i, line in enumerate(text.splitlines(), 1):
        if needle in line:
            print(f'{i}: {line}')
PY

Repository: openshift-eng/edge-tooling

Length of output: 5417


Define the output filename placeholders. JOB_ID and JOB_NAME_SUFFIX are not defined anywhere in Step 2, while the surrounding table only maps job, url, and build_id. Map them to concrete fields or rename them to placeholders already introduced.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/doctor/SKILL.md` around lines 121 - 129, Step 2
uses undefined output filename placeholders for JOB_ID and JOB_NAME_SUFFIX, so
update the placeholder mapping in SKILL.md to use only fields introduced from
the prepare script JSON or rename them to already-defined placeholders. Make
sure the OUTPUT_FILE examples in the step reference concrete values derived from
job, url, and build_id, and that any PR-specific suffix logic is expressed using
the existing JOB_NAME / PR placeholders rather than new undefined symbols.

Comment on lines +17 to +32
`<ARGUMENTS>`: Prow URL, GCS web URL, or local artifacts directory.

- `<TMP>/build-log.txt`: Log containing prow job output and most likely place to identify AWS infra related or hypervisor related errors.
- `<STEP>/build-log.txt`: Each step in the CI job is individually logged in a build-log.txt file.
- `<TMP>/artifacts/<TEST_NAME>/openshift-microshift-infra-sos-aws/artifacts/sosreport-*.tar.xz`: Compressed archive containing select portions of the test host's filesystem, relevant logs, and system configurations. `<TEST_NAME>` varies by job (e.g., `e2e-aws-tests`, `e2e-aws-ovn-ocp-conformance-arm64`).
- `<TMP>/artifacts/<TEST_NAME>/openshift-microshift-e2e-origin-conformance/build-log.txt`: Step-specific build log for origin conformance tests.

## Important Links

**Step Diagram URL** (found at the end of the main build-log):

```text
https://steps.ci.openshift.org/job?org=openshift&repo=microshift&branch=release-4.19&test=e2e-aws-tests-bootc-nightly&variant=periodics
```

This link provides a diagram of the steps that make up the test. Think about reading this diagram when identifying step failures because not all fatal errors cause the current step to fail but may cause the next step to fail.

**SOS Report** (contains pod/container logs and cluster-scoped resources)

**Journals:** use the plain-text `journal_*.log` files next to the sosreport tarballs (e.g., `scenario-info/<scenario>/vms/host1/sos/journal_*.log`). These are readable directly with Read/Grep and contain the journal evidence you need (service failures, x509 errors, OOM kills, microshift unit logs).

**Pod logs, cluster state, inspect outputs:** extract a specific sosreport tarball when you need pod logs (container crashes, restarts, probe failures). The extraction script pulls pod logs, inspect outputs, and cluster-scoped resources.

**When to extract a sosreport:** when the journal shows `CrashLoopBackOff`, `Back-off restarting`, repeated `Created container` events, or probe failures after readiness. Pod and container logs — in particular `previous.log`, the only record of WHY a dead container exited — exist exclusively inside the sosreport tarball.

**How to extract:** find the tarball for the scenario, then run the extraction script on that single tarball:

```bash
# Find sosreport tarballs for the scenario
find <scenario-dir>/.. -name 'sosreport-*.tar.xz'

# Extract only pod logs, inspect outputs, and cluster-scoped resources
bash plugins/shared/scripts/extract-sosreport.sh <tarball-path>
```

The script prints the extraction directory to stdout. Extracted files land in `<tarball-parent>/sos-extracted/<sosreport-name>/`. The extraction is idempotent — running it again on the same tarball is a no-op. Inside the extracted tree:

- `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/{current,previous}.log` — container logs
- `sos_commands/microshift/namespaces/<namespace>/core/{pods.yaml,events.yaml}` — pod status and events
- `sos_commands/microshift/cluster-scoped-resources/` — nodes, CRDs, webhooks
- `sos_commands/*/inspect_*` — component command outputs

**There may be several sosreports for a single scenario**: the test framework's sos-on-failure listener (`test/resources/sos-on-failure-listener.py` in openshift/microshift) captures a sosreport at the moment of each test failure, in addition to the one collected at the end of the scenario. **Prefer the on-failure sosreport when investigating a specific test failure**: it contains the pods and container logs of the namespaces created specifically for that test (suite), which are absent from the end-of-scenario sosreport because they have already been cleaned up by then. Match a sosreport to its test failure by capture time.

Correlate journal entries with the failure timestamp recorded during the Characterize phase.

## Performance Graphs

When the input is a local artifacts directory of the form `<WORKDIR>/artifacts/<BUILD_ID>` (the doctor workflow), pre-generated PCP performance graphs may exist in the sibling directory:

```text
<WORKDIR>/graphs/<BUILD_ID>/
1_cpu_usage.png — CPU usage (user, system, I/O wait)
2_mem_usage.png — Memory usage (used, cached)
3_disk_io.png — Disk I/O (read/write OPS, await)
4_disk_usage.png — Disk usage by partition (% fill)
```

Use the Read tool to view these PNGs during the drill-down phase whenever the failure involves a timeout, slowness, readiness/health-check expiry, eviction, OOM, or any resource-related error. Look for CPU saturation, memory exhaustion, or disk I/O stalls overlapping the failure window. If the directory does not exist (e.g., standalone URL invocation), skip graph correlation — do not attempt to generate graphs.
URL formats — periodic: `.../logs/{JOB_NAME}/{JOB_ID}`, presubmit: `.../pr-logs/pull/openshift_microshift/{PR}/{JOB_NAME}/{JOB_ID}`.
Hosts: `prow.ci.openshift.org/view/gs/test-platform-results/...` or `gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/...`.

## Work Directory

Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.

```text
/tmp/microshift-ci-claude-workdir.<YYMMDD>
```

## Common Commands

Scan the build log for arbitrary text:

```bash
grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt
```

Download all prow job artifacts (only needed when given a URL, not a local path):

```bash
GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
```
`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

## Workflow

The user argument is: `<ARGUMENTS>`

0. **Determine input type and set up artifacts directory**:
- If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
- If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
1. **Set up artifacts**:
- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

git ls-files plugins/microshift-ci/skills/prow-job/SKILL.md plugins/docs/SKILL-GUIDELINES.md CONTRIBUTING.md
echo
echo "=== plugins/microshift-ci/skills/prow-job/SKILL.md (selected lines) ==="
nl -ba plugins/microshift-ci/skills/prow-job/SKILL.md | sed -n '1,120p'
echo
echo "=== Search for local-path / absolute-path wording in related skill docs ==="
rg -n "local path|absolute path|starts with /|artifacts directory|TMP|mktemp -d|prow-job" plugins/microshift-ci/skills plugins/docs CONTRIBUTING.md

Repository: openshift-eng/edge-tooling

Length of output: 373


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== plugins/microshift-ci/skills/prow-job/SKILL.md ==="
cat -n plugins/microshift-ci/skills/prow-job/SKILL.md | sed -n '1,140p'

echo
echo "=== Relevant matches ==="
rg -n "local path|absolute path|starts with /|artifacts directory|TMP|mktemp -d|existing directory|directory" plugins/microshift-ci/skills/prow-job/SKILL.md plugins/microshift-ci/skills -g 'SKILL.md'

Repository: openshift-eng/edge-tooling

Length of output: 7340


Require an absolute local-path contract. <ARGUMENTS> says “local artifacts directory,” but step 1 only treats /-prefixed inputs as local. Relative paths will go down the URL branch. Document the absolute-path requirement or accept existing directories.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 17 - 32, The
`SKILL.md` workflow currently only treats `/`-prefixed inputs as local in the
artifact setup logic, so relative directories will be misclassified as URLs.
Update the `<ARGUMENTS>` contract and the Step 1 handling in the Prow job
workflow to either explicitly require an absolute local artifacts path or accept
existing relative directories as local inputs, and keep the guidance aligned
with the `Work Directory` and `Workflow` sections.

Comment on lines 22 to 33
## Work Directory

Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.

```text
/tmp/microshift-ci-claude-workdir.<YYMMDD>
```

## Common Commands

Scan the build log for arbitrary text:

```bash
grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt
```

Download all prow job artifacts (only needed when given a URL, not a local path):

```bash
GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
```
`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.

## Workflow

The user argument is: `<ARGUMENTS>`

0. **Determine input type and set up artifacts directory**:
- If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
- If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
1. **Set up artifacts**:
- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Create the parent workdir before mktemp -d.

mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX assumes <WORKDIR> already exists, so a clean run can fail before download/extraction starts. Add an explicit mkdir -p "<WORKDIR>" first.

Suggested fix
 1. **Set up artifacts**:
+   - Ensure `<WORKDIR>` exists: `mkdir -p "<WORKDIR>"`.
    - Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
    - URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Work Directory
Compute once at the start by running `date +%y%m%d` and substituting into the path below. In all commands, replace `<WORKDIR>` with the computed path — do not store the work directory in a shell variable.
```text
/tmp/microshift-ci-claude-workdir.<YYMMDD>
```
## Common Commands
Scan the build log for arbitrary text:
```bash
grep '${SOME_TEXT}' ${GREP_OPTS} ${TMP}/build-log.txt
```
Download all prow job artifacts (only needed when given a URL, not a local path):
```bash
GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
```
`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.
## Workflow
The user argument is: `<ARGUMENTS>`
0. **Determine input type and set up artifacts directory**:
- If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
- If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
1. **Set up artifacts**:
- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.
## Work Directory
`/tmp/microshift-ci-claude-workdir.<YYMMDD>` — compute `<YYMMDD>` once via `date +%y%m%d`.
## Workflow
The user argument is: `<ARGUMENTS>`
1. **Set up artifacts**:
- Ensure `<WORKDIR>` exists: `mkdir -p "<WORKDIR>"`.
- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 22 - 33, The
workdir setup in the SKILL workflow can fail on a clean run because mktemp -d
assumes the parent directory already exists. Update the setup steps in SKILL.md
so the URL path branch explicitly creates the parent workdir first with mkdir -p
before calling mktemp -d, and keep the existing WORKDIR/TMP flow intact.

Comment on lines 26 to 60
## Workflow

The user argument is: `<ARGUMENTS>`

0. **Determine input type and set up artifacts directory**:
- If `<ARGUMENTS>` is a **local directory path** (starts with `/` and contains `build-log.txt`): set `TMP` to that directory. Skip step 1.
- If `<ARGUMENTS>` is a **URL** (starts with `http`): create a temporary working directory with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`, set `TMP` to that directory, and proceed to step 1.
1. **Set up artifacts**:
- Local path (starts with `/`): use it as `<TMP>`. Skip step 2.
- URL: create `<TMP>` with `mktemp -d <WORKDIR>/openshift-ci-analysis-XXXX`.

1. **Download all artifacts** (skip if using pre-downloaded artifacts from step 0):
Download all prow job artifacts using `gsutil -q -m cp -r` into the temporary working directory. Derive the GCS path by stripping the web prefix from the job URL (handles both Prow and GCS web URL formats):
2. **Download** (URL only):

```bash
GCS_PATH=$(echo "${PROW_URL}" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' -e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
gsutil -q -m cp -r "${GCS_PATH}/" ${TMP}/
GCS_PATH=$(echo "<URL>" | sed -e 's|https://prow.ci.openshift.org/view/gs/|gs://|' \
-e 's|https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/|gs://|')
gsutil -q -m cp -r "${GCS_PATH}/" <TMP>/
```

This works for both periodic (`logs/...`) and presubmit PR (`pr-logs/pull/...`) job URLs, and for both Prow and GCS web URL formats.
This makes all build logs, step logs, and SOS reports available locally for analysis.

2. **Localize — identify the failed step and the anchor error**:
- Scan the top level `build-log.txt` to determine the step where the failure occurred (the last `Running step ...` line before the container logs is a quick anchor — see Tips), then open that step's own `build-log.txt`.
- Record each candidate error with its filepath, line number, and timestamp. Read 50 lines before and 50 lines after each to separate the fatal error from setup/teardown noise.
- Select the **anchor error**: the first fatal error that caused the step to fail. This becomes `raw_error` in the report.
- **The anchor identifies the failure for deduplication — it is NOT the conclusion of the investigation. The first error found is rarely the root cause.**

3. **Characterize — establish exactly WHAT failed before asking why**:
- For test steps with scenarios: enumerate the failing tests from `scenario-info/<scenario>/junit.xml` under the step's artifacts, then read the failing scenario's `rf-debug.log` and `phase_*/` logs (Robot Framework marks failures with `| FAIL |`). Record the failing scenario name(s) — the top-level `testsuite name` in each junit.xml — they populate the `scenarios` field in the report.
- For each failing scenario, check the plain-text `journal_*.log` files (next to the sosreport tarballs) for fatal patterns (panics, OOM kills, `leader election lost`, container exits). If the journal shows container crashes or restarts, extract the specific sosreport tarball with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>` and read the pod logs (see SOS Report section).
- For conformance steps: extract the failing test names and their failure output from the step's `build-log.txt`.
- For build/infra steps: extract the failing command and its complete error output from the step log.
- Record the failure timestamp(s) — they drive the journal and graph correlation in the next phase.
- When the MicroShift source checkout is available — check with Glob for `<WORKDIR>/src/microshift-release-<RELEASE>/` (release jobs) or `<WORKDIR>/src/microshift/` (main) — read the failing test's source: Robot Framework suites under `test/suites/`, scenario definitions under `test/scenarios*/`. Its assertions, timeouts, and setup are how you distinguish a test bug from a product bug. If the checkout is absent, note `"source checkout not available"` in `analysis_gaps` and continue.
- Decide the stack layer: cloud infra, ci-config, hypervisor, or a legitimate test failure — and for test failures, the stage: setup, testing, teardown.
3. **Extract evidence**:

4. **Drill down — iterate hypothesis → evidence until the cause is actionable**:
Repeat this loop until you reach a cause that is **actionable** (a specific code, configuration, test, or infrastructure problem someone can act on) or until the available evidence is exhausted:
- State a hypothesis for WHY the error in hand occurred.
- Seek confirming or refuting evidence ONE LAYER DEEPER than the current log:
- **Journal** — ALWAYS check the plain-text `journal_*.log` files for the scenario (see SOS Report section). Correlate with the failure timestamp (entries within ±5 minutes) and scan for OOM kills, segfaults, service restarts, and disk pressure.
- **Sosreport** — when the journal shows container crashes or restarts, extract the specific sosreport tarball with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>` (see SOS Report section for how to pick the right one when several exist). Read the pod/container logs of the failing workload.
- **Performance graphs** — when the failure involves a timeout, slowness, readiness/health-check expiry, eviction, or any resource error, Read the PNGs (see Performance Graphs section) and look for saturation overlapping the failure window.
- Treat restating errors as symptoms: an error like "timed out waiting for X" is NOT a root cause — explain why X was slow or absent, or explicitly record that the evidence ran out.
- **A test-layer fix is never the bottom when a product component misbehaved.** When the failure involves a product component that was unavailable, not ready, crashed, or slow ("no endpoints available", "connection refused", "not ready", "CrashLoopBackOff", probe failures), you MUST reconstruct that component's story from the journal and its pod logs before concluding. Build an exact timestamped timeline: when was the pod created, when did each container start, when did it become ready, did probes fail afterwards, did it restart, and why. Only then attribute the failure:
- **Product defect** — the component became ready and later flapped, crashed, or stopped serving (e.g., readiness flips back to not-ready, liveness probe connection refused after startup, container exits and restarts). Report the product mechanism as the root cause even if a test-side wait would also "fix" the symptom.
- **Test defect** — the component was still starting up normally and the test simply ran too early against a documented startup sequence.
- **Always check for container restarts.** Grep the journal for repeated `Created container`/`Started container` (crio) and `RemoveContainer`/PLEG events (kubelet) for the same pod. Two container instances for one pod means the first one DIED — a single startup story is the wrong narrative. Extract the sosreport (`bash plugins/shared/scripts/extract-sosreport.sh <tarball>`) and read the dead container's log at `sos_commands/microshift/namespaces/<namespace>/pods/<pod>/<container>/<container>/logs/previous.log` (`current.log` is the running instance). The last ~20 lines of `previous.log` usually state the exit reason (fatal error, leader election lost, panic, OOM).
- Record every accepted hop as a causal-chain link with its evidence file and line — these become `causal_chain` in the report. Discarded hypotheses do not go into the chain.
```bash
python3 plugins/shared/scripts/extract-evidence.py --artifacts-dir <TMP> --workdir <WORKDIR>
```

5. **Corroborate — cross-check the explanation**:
- When the source checkout is available, list commits from the last month that could be related:
Produces `<WORKDIR>/evidence/evidence-<BUILD_ID>.json`. The `<BUILD_ID>` is the last path component of `<TMP>`.

```text
bash plugins/microshift-ci/scripts/repo-log.sh <SRC_DIR> --since <1_MONTH_BEFORE_FINISHED> --until <FINISHED_DATE> --paths test/
```
4. **Analyze**: Read `plugins/microshift-ci/agents/analyze-evidence.md`. Substitute placeholders:

Derive `FINISHED_DATE` from the job's `finished.json` timestamp. Drop `--paths` to see all changes. Name candidate commits in the causal chain when their timing and touched paths match the failure.
- If multiple scenarios in this job failed, decide cascade vs independent using the **timeline** (which failed first; did the earlier failure poison shared state?), not just error-text similarity.
| Placeholder | Value |
|---|---|
| `{EVIDENCE_PACK}` | `<WORKDIR>/evidence/evidence-<BUILD_ID>.json` |
| `{JOB_NAME}` | job name extracted from URL or directory path |
| `{JOB_URL}` | the original URL (or reconstruct from artifacts path) |
| `{OUTPUT_FILE}` | `<WORKDIR>/report-<BUILD_ID>.txt` |

6. **Produce a report**: Create a concise report of the failure. The report MUST specify:
- Where in the pipeline the error occurred
- The specific step the error occurred in
- Whether the test failure was legitimate (i.e., a test failed) or due to an infrastructure failure (i.e., build image was not found, AWS infra failed due to quota, hypervisor failed to create test host VM, etc.)
- The causal chain from the observed symptom to the root cause, each link backed by evidence (file and line)
- A confidence rating for the root cause (see the field rules below)
Spawn the agent with the substituted content. When it replies `DONE`, read the output file and present the report to the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Restore inline stop conditions for the agent workflow.

This compact rewrite no longer says what to do when gsutil, evidence extraction, or the agent step fails, or when the agent never returns DONE. Per CONTRIBUTING.md: “Steps/workflow must be numbered and actionable with clear stop conditions” and “Orchestrator skills should include output structure plus Edge Cases and guard checks for sub-agent output.” Based on learnings, keep those guards inline with the relevant step rather than moving them to a separate section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/microshift-ci/skills/prow-job/SKILL.md` around lines 26 - 60, The
workflow in SKILL.md is missing inline stop conditions for failure cases and the
agent completion check. Update the numbered steps around the artifact download,
evidence extraction, and analyze stages to explicitly say what to do if gsutil
fails, extract-evidence.py fails, or the spawned agent does not return DONE,
using the existing workflow structure and the analyze-evidence.md-driven agent
step as the anchor. Keep the guard conditions adjacent to their relevant step,
and make the steps actionable with clear fallback/abort behavior rather than
deferring these checks to a separate section.

Sources: Path instructions, Learnings

Comment on lines +182 to +192
refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
if isinstance(refs, dict):
org = refs.get("org", "")
repo = refs.get("repo", "")
if job_name and build_id:
pulls = refs.get("pulls", [])
if pulls:
pr_num = pulls[0].get("number", "")
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"
else:
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="plugins/shared/scripts/extract-evidence.py"

printf '\n== Context around lines 182-192 ==\n'
nl -ba "$FILE" | sed -n '165,205p'

printf '\n== Search for refs/extra_refs usage ==\n'
rg -n "extra_refs|refs =" "$FILE"

printf '\n== Small Python probe for the expression shape ==\n'
python3 - <<'PY'
cases = [
    {"prowjob": None},
    {"prowjob": {"spec": {}}},
    {"prowjob": {"spec": {"refs": {"org": "o"}}}},
    {"prowjob": {"spec": {"extra_refs": []}}},
    {"prowjob": {"spec": {"extra_refs": [{}]}}},
    {"prowjob": {"spec": {"extra_refs": [{"org": "o"}]}}},
]

for case in cases:
    prowjob = case["prowjob"]
    try:
        refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
        print(case, "=>", refs)
    except Exception as e:
        print(case, "=>", type(e).__name__, e)
PY

Repository: openshift-eng/edge-tooling

Length of output: 240


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="plugins/shared/scripts/extract-evidence.py"

printf '\n== Context around lines 175-205 ==\n'
awk 'NR>=175 && NR<=205 { printf "%d:%s\n", NR, $0 }' "$FILE"

printf '\n== Exact refs/extra_refs expression ==\n'
grep -n "extra_refs\|refs =" "$FILE" || true

printf '\n== Python probe for expression behavior ==\n'
python3 - <<'PY'
cases = [
    ("no prowjob", None),
    ("empty spec", {"spec": {}}),
    ("refs present", {"spec": {"refs": {"org": "o"}}}),
    ("extra_refs empty", {"spec": {"extra_refs": []}}),
    ("extra_refs one empty dict", {"spec": {"extra_refs": [{}]}}),
    ("extra_refs one entry", {"spec": {"extra_refs": [{"org": "o"}]}}),
]

for label, prowjob in cases:
    try:
        refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
        print(label, "=>", refs)
    except Exception as e:
        print(label, "=>", type(e).__name__, str(e))
PY

Repository: openshift-eng/edge-tooling

Length of output: 2026


Guard extra_refs before indexing

A present-but-empty extra_refs: [] still reaches [0] and raises IndexError in single-mode jobs. Split this into two steps so the fallback checks extra_refs before indexing.

Suggested guard
-    refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
+    spec = prowjob.get("spec", {}) if prowjob else {}
+    extra_refs = spec.get("extra_refs") or []
+    refs = spec.get("refs") or (extra_refs[0] if extra_refs else {})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
refs = prowjob.get("spec", {}).get("refs") or prowjob.get("spec", {}).get("extra_refs", [{}])[0] if prowjob else {}
if isinstance(refs, dict):
org = refs.get("org", "")
repo = refs.get("repo", "")
if job_name and build_id:
pulls = refs.get("pulls", [])
if pulls:
pr_num = pulls[0].get("number", "")
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"
else:
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"
spec = prowjob.get("spec", {}) if prowjob else {}
extra_refs = spec.get("extra_refs") or []
refs = spec.get("refs") or (extra_refs[0] if extra_refs else {})
if isinstance(refs, dict):
org = refs.get("org", "")
repo = refs.get("repo", "")
if job_name and build_id:
pulls = refs.get("pulls", [])
if pulls:
pr_num = pulls[0].get("number", "")
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/{org}_{repo}/{pr_num}/{job_name}/{build_id}"
else:
job_url = f"https://prow.ci.openshift.org/view/gs/test-platform-results/logs/{job_name}/{build_id}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/extract-evidence.py` around lines 182 - 192, The refs
selection logic in extract-evidence.py can still index into an empty extra_refs
array and raise IndexError in single-mode jobs. Refactor the refs assignment
near the prowjob handling to check whether extra_refs is present and non-empty
before accessing its first element, and keep the existing fallback behavior when
spec.refs is missing. Use the refs, prowjob, and extra_refs lookup path to
locate the fix.

if anchor_line_idx >= 0:
start = max(0, anchor_line_idx - 5)
end = min(len(lines), anchor_line_idx + 6)
context_lines = [l.rstrip() for l in lines[start:end]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Ruff violations — the file will fail ruff validation.

  • Lines 330 & 679: ambiguous variable name l (E741).
  • Line 535: infra_fail_count unpacked but unused (RUF059); prefix with _.
  • Lines 1006, 1008, 1010, 1012, 1014: multiple statements per line via ; (E702).

Per CONTRIBUTING.md Code Standards: "Python — Must pass ruff".

Also applies to: 535-535, 679-679, 1006-1014

🧰 Tools
🪛 Ruff (0.15.20)

[error] 330-330: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/extract-evidence.py` at line 330, The script has Ruff
violations that will fail validation: rename the ambiguous comprehension
variable `l` in `extract-evidence.py` to a clearer name in the `context_lines`
assignment and the other similar spot, prefix the unused unpacked
`infra_fail_count` with `_` where it is assigned, and split the chained
statements separated by semicolons into separate lines in the affected block.
Update the relevant logic in `extract-evidence.py` around `context_lines`,
`infra_fail_count`, and the code near the multiple-statement lines so the file
passes Ruff.

Sources: Coding guidelines, Linters/SAST tools

json_text = m.group(1)
# LLM agents sometimes copy raw control characters (tabs, CRs) from
# build logs into JSON string values. Sanitize before parsing.
json_text = json_text.replace('\t', '\\t').replace('\r', '\\r')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

sed -n '1,140p' plugins/shared/scripts/parse.py
printf '\n---\n'
rg -n "parse\.py|json_text\.replace|replace\('\\t'" -S plugins tests . -g '!**/.git/**'

Repository: openshift-eng/edge-tooling

Length of output: 5882


🏁 Script executed:

python3 - <<'PY'
import json

original = '{\n\t"a": 1,\n\t"b": 2\n}'
sanitized = original.replace('\t', '\\t').replace('\r', '\\r')

print("original parses:", json.loads(original))
try:
    json.loads(sanitized)
    print("sanitized parses: yes")
except json.JSONDecodeError as e:
    print("sanitized parses: no")
    print(type(e).__name__, e.msg)
PY

Repository: openshift-eng/edge-tooling

Length of output: 285


Avoid escaping tabs before parsing. json_text.replace('\t', '\\t') rewrites valid JSON whitespace, so tab-indented payloads become invalid and json.loads() fails. Escape control chars only inside string values, or remove the pre-parse rewrite.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/shared/scripts/parse.py` at line 51, The pre-parse normalization in
parse.py is escaping tabs in the whole JSON payload, which breaks valid
tab-indented JSON before json.loads() runs. Update the parse flow in the
json_text handling to stop rewriting all tab characters globally; instead, only
handle control characters inside string values or remove the tab replacement
entirely while keeping the existing JSON parsing path intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant