openshift-eng · pmtk · Jul 3, 2026 · Jul 3, 2026 · Jul 3, 2026 · Jul 3, 2026
diff --git a/plugins/lvms-ci/scripts/extract-evidence.py b/plugins/lvms-ci/scripts/extract-evidence.py
@@ -0,0 +1 @@
+../../shared/scripts/extract-evidence.py
diff --git a/plugins/lvms-ci/scripts/plan-analysis.py b/plugins/lvms-ci/scripts/plan-analysis.py
@@ -0,0 +1 @@
+../../shared/scripts/plan-analysis.py
diff --git a/plugins/microshift-ci/agents/analyze-evidence.md b/plugins/microshift-ci/agents/analyze-evidence.md
@@ -0,0 +1,74 @@
+# Analyze Evidence Agent
+
+Analyze a group of MicroShift Prow CI jobs that share the same deterministic failure fingerprint — they failed the same way, so ONE analysis covers all of them. Your goal is the UNDERLYING root cause, not the first error in the log. Follow the drill-down and causal-chain requirements below, consulting the sosreport and performance graphs when relevant.
+
+## Inputs
+
+Jobs in this group (same failure fingerprint):
+
+{GROUP_JOBS}
+
+Output file for the analysis report: `{OUTPUT_FILE}`
+
+## Instructions
+
+### 1. Read the evidence and references
+
+Read `plugins/microshift-ci/agents/references/microshift-ci-primer.md` and every evidence pack listed above. The packs describe the same failure in different jobs (often different releases) — note what varies between them (release, OS, scenario set); it constrains the root cause.
+
+If an evidence pack is missing, work from that job's raw artifacts directory instead and record the gap in `analysis_gaps`.
+
+### 2. Assess the failure
+
+- `scenario-e2e` → examine each scenario's alerts, failures, and journal. Use `failure_timeline` to distinguish cascade from independent failures.
+- `conformance` → examine `conformance_failures`.
+- `build`/`config`/`rebase` → examine `build_errors`.
+- `infrastructure_indicators.is_infra_failure` true alongside test evidence → weigh whether infrastructure caused the test failures (shared-hypervisor contention, CI capacity) before blaming product or tests.
+- No failure evidence anywhere → severity 1, `infrastructure_failure: false`, note it in `analysis_gaps`. Do NOT drill down.
+
+### 3. Drill down
+
+Drill down in the group member with the most complete evidence (journal + sosreport + graphs). Iterate hypothesis → evidence until the cause is actionable. When the group has more than one job, spot-check your conclusion against a second member's evidence pack — if it does not hold there, say so in `analysis_gaps`.
+
+**Mandatory raw-log verification** — BEFORE concluding, even when the evidence pack looks sufficient:
+
+- Read ~200 lines of raw journal around the failure timestamp — look for patterns NOT in the evidence pack (authorization denials, scheduler errors, admission failures, kubelet sandbox errors).
+- When a sosreport exists, check **kube-apiserver** pod logs for authorization/admission/scheduling decisions.
+- "Timed out waiting for X" is a symptom — read raw logs to find WHY X was slow or absent.
+
+**Deeper investigation** via raw artifacts:
+
+- **Sosreport pod logs**: read from `extracted_sosreport_dirs` when available, or run `bash plugins/microshift-ci/scripts/extract-sosreport.sh <tarball>` on paths in `sosreport_paths`.
+- **PCP graphs**: read PNGs listed in `pcp_graphs` when the failure involves timeouts, slowness, or resource exhaustion.
+- **Source code**: use `source_checkout.path` to read `test/suites/` or product code. Check `recent_commits` for related changes.
+
+**Critical rules**:
+
+- A test-layer fix is never the bottom when a product component misbehaved — reconstruct the component's story from journal and pod logs before concluding.
+- Two `Created container` events for the same pod = the first instance died. Read `previous.log` for the exit reason.
+- Multiple scenario failures: decide cascade vs independent using the **timeline**, not error-text similarity.
+- **Every causal-chain link MUST cite an artifact file path** (e.g., `artifacts/.../boot_and_run.log:4629`). Do NOT cite the evidence JSON, general knowledge, or architectural statements. The evidence pack includes `file` and `line` for each match — trace back to those. Drop unsupported links or record as analysis gaps.
+
+### 4. Validate causal chain
+
+Before producing the report, validate every causal-chain link:
+
+- Every link MUST have an `evidence` field containing an artifact file path with `:line` (e.g., `artifacts/.../boot_and_run.log:4629`).
+- Every link MUST have a `quote` field with verbatim text from that file.
+- If any link cites the evidence JSON, general knowledge, or architectural statements instead of an artifact file — fix it now by finding the actual artifact file, or drop the link.
+
+### 5. Produce the report
+
+Write the report per `plugins/microshift-ci/agents/references/structured-summary.md`. Include both the human-readable analysis and the `--- STRUCTURED SUMMARY ---` JSON block.
+
+Group handling in the structured summary:
+
+- Emit one entry per INDEPENDENT failure (not per job — the group shares the failures).
+- Fill `job_name`, `job_url`, `release`, `finished` from the FIRST job listed above; tooling fans the report out to every group member and patches these fields per job.
+- Do NOT emit a `fingerprint` field; tooling injects it.
+
+### 6. Save and reply
+
+Save the FULL report output (including the `--- STRUCTURED SUMMARY ---` block) to `{OUTPUT_FILE}` using the Write tool. The file must contain the complete analysis report.
+
+After saving, reply with EXACTLY one line: `DONE {OUTPUT_FILE}`. Do NOT include the report text in your reply.
diff --git a/plugins/microshift-ci/agents/references/microshift-ci-primer.md b/plugins/microshift-ci/agents/references/microshift-ci-primer.md
@@ -0,0 +1,92 @@
+# MicroShift CI Artifact Primer
+
+Reference for analyzing MicroShift Prow job artifacts — which file answers which question.
+
+## Job types
+
+- **Scenario-based e2e** (`e2e-aws-tests-*`): the `openshift-microshift-e2e-metal-tests` step boots ~20 VM-based test scenarios on a shared hypervisor. Failures are per-scenario.
+- **Direct-test** (`*-ocp-conformance-*`, `e2e-aws-ai-model-serving-*`, `e2e-aws-footprint-*`): run their test suite directly, no scenario fan-out.
+
+## Test framework
+
+Tests use [Robot Framework](https://robotframework.org). Suites: `test/suites/*.robot`. Shared keywords: `test/resources/`. Scenario definitions: `test/scenarios*/`.
+
+`TEST_EXECUTION_TIMEOUT` (default `30m`) wraps Robot Framework in `timeout`. When exceeded, the current test dies with `Execution terminated by signal` and every subsequent test reports `Test execution stopped due to a fatal error` — a cascade with ONE root cause (the time budget).
+
+## Deployment types
+
+Three deployment pipelines: **ostree** (scenarios in `test/scenarios/`), **bootc** (`test/scenarios-bootc/`), **RPM** (`test/suites/rpm/`). Job name indicates which (e.g. `e2e-aws-tests-bootc-*`). All produce the same artifact layout.
+
+## Scenario naming
+
+Scenario names encode OS, MicroShift version source, and suite. The `@` separator chains stages left-to-right; the **last segment** is always the test suite.
+
+### Version-source markers
+
+| Marker | Meaning |
+|---|---|
+| `src` | Built from source (PR or branch) |
+| `base` | Built from PR's target branch |
+| `prel` | Previous minor release (Y-1) |
+| `crel` | Current minor release (EC/RC/z-stream) |
+| `lrel` | Latest available release from staging repos |
+| `zprel` | Latest z-stream from rhocp |
+| `y1`/`y2` | Y-1/Y-2 minor versions back (also `yminus1`/`yminus2`) |
+
+### OS tokens
+
+`el96`/`el98`/`el102` — RHEL 9.6/9.8/10.2
+
+### Reading multi-@ names
+
+| Name | Meaning |
+|---|---|
+| `el96-lrel@standard1` | RHEL 9.6 + latest release, standard suite 1 |
+| `el94-y2@el96-lrel@standard1` | Start Y-2 on RHEL 9.4, upgrade to RHEL 9.6 + latest release, run standard1 |
+| `el96-yminus2@prel@src@delta-upgrade-ok` | Y-2 → Y-1 (prel) → source, static delta upgrade |
+
+## Artifact layout
+
+Per scenario, under `artifacts/<TEST_NAME>/openshift-microshift-e2e-metal-tests/artifacts/scenario-info/<scenario>/`:
+
+| File | Answers |
+|---|---|
+| `junit.xml` | Which tests failed; `testsuite name` = scenario name |
+| `rf-debug.log` | Robot Framework trace — failures marked `\| FAIL \|` |
+| `boot_and_run.log` | VM boot + orchestration; scenario-killing timeouts appear here |
+| `phase_create/junit.xml` | Infra junit from VM creation (greenboot check) |
+| `phase_run/junit.xml` | Infra junit from test run phase |
+| `vms/host1/sos/journal_*.log` | Plain-text journal exports — check FIRST for service failures, OOM, x509 |
+| `vms/host1/sos/sosreport-*.tar.xz` | Full sosreports (see below) |
+
+## Sosreports
+
+Two types: **on-failure** (captured at each test failure, includes test-created namespaces — **prefer this one**) and **end-of-scenario** (teardown, may lack test workloads). Match to failure by comparing capture timestamp with `rf-debug.log` failure time.
+
+**Journals**: use plain-text `journal_*.log` next to tarballs — no extraction needed.
+
+**Pod logs**: extract with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>`. Output lands in `<tarball-parent>/sos-extracted/<sosreport-name>/`:
+
+- Pod logs: `sos_commands/microshift/namespaces/<ns>/pods/<pod>/<container>/<container>/logs/{current,previous}.log`
+- `previous.log` tail states why a dead container exited (fatal error, leader election lost, panic)
+- Cluster resources: `sos_commands/microshift/cluster-scoped-resources/`
+
+## Greenboot
+
+Before tests, the scenario waits for `greenboot-healthcheck.service` to exit. Failure → `pre_test_greenboot_check FAILED` in `phase_create/junit.xml`, no tests run. In the journal, `40_microshift_running_check.sh` lines show which deployments were waited on.
+
+## Journal reading
+
+Reconstruct a timestamped timeline before attributing fault:
+
+- Pod lifecycle: `Created container`/`Started container` (crio), `SyncLoop (PLEG)`, probe readiness transitions
+- Two `Created container` events for the same pod = first instance died — read `previous.log`
+- `apply request took too long` = apiserver/etcd latency (can cause leader-election loss)
+
+## Common patterns
+
+**Timeout cascade**: `TEST_EXECUTION_TIMEOUT` expires → one test gets `Execution terminated by signal`, all subsequent get `Test execution stopped due to a fatal error`. ONE root cause — find what consumed the time budget.
+
+**Greenboot masking**: greenboot failure → no tests run → only `phase_create/junit.xml` has the failure. Root cause is in the journal.
+
+**Shared-hypervisor contention**: all scenarios share one host. CPU/memory/disk contention → greenboot timeouts, etcd pressure, image pull timeouts. Attribute to infrastructure, not product/test.
diff --git a/plugins/microshift-ci/agents/references/structured-summary.md b/plugins/microshift-ci/agents/references/structured-summary.md
@@ -0,0 +1,96 @@
+# Structured Summary Output Format
+
+Output contract for CI job analysis skills, consumed by `aggregate.py`, `search-bugs.py`, and `create-report.py`.
+
+## Output Template
+
+```text
+Error Severity: {1-5}
+Stack Layer: {AWS Infra | External Infrastructure | build phase | deploy phase | test setup phase | Test Configuration | test | teardown}
+Step Name: {CI step where the error occurred}
+Error: {Exact error with log context}
+Causal Chain: {numbered list, each link cites file:line}
+Confidence: {high | medium | low}
+Suggested Remediation: {fix direction; do NOT propose test tolerance (waits/retries/timeouts) unless the product behaved correctly}
+```
+
+| Severity | Meaning |
+|---|---|
+| 5 | Release-blocking product regression — no workaround |
+| 4 | Persistent product or test failure — no workaround |
+| 3 | Persistent failure with workaround, or scoped to single scenario/arch |
+| 2 | Intermittent failure / likely flake |
+| 1 | Infrastructure noise or self-healing condition |
+
+## STRUCTURED SUMMARY JSON
+
+Append after all prose. **Both markers are required** — the parser skips the report if either is missing.
+
+```text
+--- STRUCTURED SUMMARY ---
+[ { ... } ]
+--- END STRUCTURED SUMMARY ---
+```
+
+### Fields
+
+| Field | Description |
+|---|---|
+| `severity` | 1-5 per rubric above |
+| `stack_layer` | One of the values from the template |
+| `step_name` | CI step where the error occurred |
+| `error_signature` | Concise one-line description for dedup and bug titles |
+| `root_cause` | WHY it failed — mechanism, not symptom (~80 chars, see rules below) |
+| `raw_error` | Verbatim log text — deterministic anchor (see rules below) |
+| `infrastructure_failure` | `true` if AWS/CI infra caused it, `false` otherwise |
+| `job_url` | Full Prow job URL |
+| `job_name` | Full job name |
+| `release` | Release branch (e.g. `4.22`, `main`) |
+| `remediation` | Fix direction (~120 chars). Infra → infra action. Product → code fix direction |
+| `finished` | Job finish date, `YYYY-MM-DD` |
+| `causal_chain` | Array of `{"cause", "evidence", "quote"}`. `evidence` = artifact file path with `:line`. `quote` = verbatim excerpt, no labels/commentary. **Re-read every cited file:line before finalizing** — wrong citations destroy trust. The `cause` text must use terms from actual log messages, not vague categories |
+| `confidence` | `high` / `medium` / `low` (see rules below) |
+| `analysis_gaps` | Array of strings naming missing evidence. Empty `[]` when nothing skipped |
+| `scenarios` | Array of scenario names where this failure occurred. Empty `[]` for non-scenario jobs |
+
+### CONFIDENCE rules
+
+- **high**: every causal-chain link directly evidenced by a quoted artifact line or graph
+- **medium**: mechanism is inferred but consistent with all evidence; citations still required — `medium` means the *interpretation* is inferred, not that citations can be omitted
+- **low**: symptom-level only — chain stops before actionable cause; `analysis_gaps` MUST be populated
+
+Do NOT inflate confidence — downstream automation acts on it.
+
+### RAW_ERROR rules
+
+The verbatim anchor readers use to match the report against logs.
+
+1. **Copy-paste exact error text** — do NOT paraphrase
+2. **Pick ONE error** — the first fatal one
+3. **Only strip timestamps** — keep everything else verbatim
+4. **Never concatenate** multiple errors
+5. **Truncate to ~150 chars** if very long — keep the distinctive part
+
+### ROOT_CAUSE rules
+
+| Field | Purpose |
+|---|---|
+| `error_signature` | WHAT failed (bug titles) |
+| `root_cause` | WHY it failed (mechanism) |
+| `raw_error` | Verbatim log text (anchor) |
+
+1. **~80 chars max**
+2. **Focus on mechanism**, not symptom
+3. **Stable terms** — no version numbers, timestamps, or job names
+
+Describe the specific mechanism, not architectural generalizations ("framework expects annotation X which MicroShift does not set", not "MicroShift is single-node").
+
+Grouping and cross-release deduplication key on the deterministic failure `fingerprint` injected by tooling — do NOT emit a `fingerprint` field yourself.
+
+### Multiple independent failures
+
+1. One entry per independent failure (different scenarios, different root causes)
+2. Same root cause = one entry — do NOT split
+3. At most 5 entries per job
+4. Cascading failures are NOT independent — report only the root failure
+5. Single failures are still a JSON array
diff --git a/plugins/microshift-ci/scripts/extract-evidence.py b/plugins/microshift-ci/scripts/extract-evidence.py
@@ -0,0 +1 @@
+../../shared/scripts/extract-evidence.py
diff --git a/plugins/microshift-ci/scripts/plan-analysis.py b/plugins/microshift-ci/scripts/plan-analysis.py
@@ -0,0 +1 @@
+../../shared/scripts/plan-analysis.py
diff --git a/plugins/microshift-ci/scripts/validate-reports.py b/plugins/microshift-ci/scripts/validate-reports.py
@@ -0,0 +1 @@
+../../shared/scripts/validate-reports.py