Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions plugins/lvms-ci/scripts/extract-evidence.py
1 change: 1 addition & 0 deletions plugins/lvms-ci/scripts/plan-analysis.py
74 changes: 74 additions & 0 deletions plugins/microshift-ci/agents/analyze-evidence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Analyze Evidence Agent

Analyze a group of MicroShift Prow CI jobs that share the same deterministic failure fingerprint — they failed the same way, so ONE analysis covers all of them. Your goal is the UNDERLYING root cause, not the first error in the log. Follow the drill-down and causal-chain requirements below, consulting the sosreport and performance graphs when relevant.

## Inputs

Jobs in this group (same failure fingerprint):

{GROUP_JOBS}

Output file for the analysis report: `{OUTPUT_FILE}`

## Instructions

### 1. Read the evidence and references

Read `plugins/microshift-ci/agents/references/microshift-ci-primer.md` and every evidence pack listed above. The packs describe the same failure in different jobs (often different releases) — note what varies between them (release, OS, scenario set); it constrains the root cause.

If an evidence pack is missing, work from that job's raw artifacts directory instead and record the gap in `analysis_gaps`.

### 2. Assess the failure

- `scenario-e2e` → examine each scenario's alerts, failures, and journal. Use `failure_timeline` to distinguish cascade from independent failures.
- `conformance` → examine `conformance_failures`.
- `build`/`config`/`rebase` → examine `build_errors`.
- `infrastructure_indicators.is_infra_failure` true alongside test evidence → weigh whether infrastructure caused the test failures (shared-hypervisor contention, CI capacity) before blaming product or tests.
- No failure evidence anywhere → severity 1, `infrastructure_failure: false`, note it in `analysis_gaps`. Do NOT drill down.

### 3. Drill down

Drill down in the group member with the most complete evidence (journal + sosreport + graphs). Iterate hypothesis → evidence until the cause is actionable. When the group has more than one job, spot-check your conclusion against a second member's evidence pack — if it does not hold there, say so in `analysis_gaps`.

**Mandatory raw-log verification** — BEFORE concluding, even when the evidence pack looks sufficient:

- Read ~200 lines of raw journal around the failure timestamp — look for patterns NOT in the evidence pack (authorization denials, scheduler errors, admission failures, kubelet sandbox errors).
- When a sosreport exists, check **kube-apiserver** pod logs for authorization/admission/scheduling decisions.
- "Timed out waiting for X" is a symptom — read raw logs to find WHY X was slow or absent.

**Deeper investigation** via raw artifacts:

- **Sosreport pod logs**: read from `extracted_sosreport_dirs` when available, or run `bash plugins/microshift-ci/scripts/extract-sosreport.sh <tarball>` on paths in `sosreport_paths`.
- **PCP graphs**: read PNGs listed in `pcp_graphs` when the failure involves timeouts, slowness, or resource exhaustion.
- **Source code**: use `source_checkout.path` to read `test/suites/` or product code. Check `recent_commits` for related changes.

**Critical rules**:

- A test-layer fix is never the bottom when a product component misbehaved — reconstruct the component's story from journal and pod logs before concluding.
- Two `Created container` events for the same pod = the first instance died. Read `previous.log` for the exit reason.
- Multiple scenario failures: decide cascade vs independent using the **timeline**, not error-text similarity.
- **Every causal-chain link MUST cite an artifact file path** (e.g., `artifacts/.../boot_and_run.log:4629`). Do NOT cite the evidence JSON, general knowledge, or architectural statements. The evidence pack includes `file` and `line` for each match — trace back to those. Drop unsupported links or record as analysis gaps.

### 4. Validate causal chain

Before producing the report, validate every causal-chain link:

- Every link MUST have an `evidence` field containing an artifact file path with `:line` (e.g., `artifacts/.../boot_and_run.log:4629`).
- Every link MUST have a `quote` field with verbatim text from that file.
- If any link cites the evidence JSON, general knowledge, or architectural statements instead of an artifact file — fix it now by finding the actual artifact file, or drop the link.

### 5. Produce the report

Write the report per `plugins/microshift-ci/agents/references/structured-summary.md`. Include both the human-readable analysis and the `--- STRUCTURED SUMMARY ---` JSON block.

Group handling in the structured summary:

- Emit one entry per INDEPENDENT failure (not per job — the group shares the failures).
- Fill `job_name`, `job_url`, `release`, `finished` from the FIRST job listed above; tooling fans the report out to every group member and patches these fields per job.
- Do NOT emit a `fingerprint` field; tooling injects it.

### 6. Save and reply

Save the FULL report output (including the `--- STRUCTURED SUMMARY ---` block) to `{OUTPUT_FILE}` using the Write tool. The file must contain the complete analysis report.

After saving, reply with EXACTLY one line: `DONE {OUTPUT_FILE}`. Do NOT include the report text in your reply.
92 changes: 92 additions & 0 deletions plugins/microshift-ci/agents/references/microshift-ci-primer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# MicroShift CI Artifact Primer

Reference for analyzing MicroShift Prow job artifacts — which file answers which question.

## Job types

- **Scenario-based e2e** (`e2e-aws-tests-*`): the `openshift-microshift-e2e-metal-tests` step boots ~20 VM-based test scenarios on a shared hypervisor. Failures are per-scenario.
- **Direct-test** (`*-ocp-conformance-*`, `e2e-aws-ai-model-serving-*`, `e2e-aws-footprint-*`): run their test suite directly, no scenario fan-out.

## Test framework

Tests use [Robot Framework](https://robotframework.org). Suites: `test/suites/*.robot`. Shared keywords: `test/resources/`. Scenario definitions: `test/scenarios*/`.

`TEST_EXECUTION_TIMEOUT` (default `30m`) wraps Robot Framework in `timeout`. When exceeded, the current test dies with `Execution terminated by signal` and every subsequent test reports `Test execution stopped due to a fatal error` — a cascade with ONE root cause (the time budget).

## Deployment types

Three deployment pipelines: **ostree** (scenarios in `test/scenarios/`), **bootc** (`test/scenarios-bootc/`), **RPM** (`test/suites/rpm/`). Job name indicates which (e.g. `e2e-aws-tests-bootc-*`). All produce the same artifact layout.

## Scenario naming

Scenario names encode OS, MicroShift version source, and suite. The `@` separator chains stages left-to-right; the **last segment** is always the test suite.

### Version-source markers

| Marker | Meaning |
|---|---|
| `src` | Built from source (PR or branch) |
| `base` | Built from PR's target branch |
| `prel` | Previous minor release (Y-1) |
| `crel` | Current minor release (EC/RC/z-stream) |
| `lrel` | Latest available release from staging repos |
| `zprel` | Latest z-stream from rhocp |
| `y1`/`y2` | Y-1/Y-2 minor versions back (also `yminus1`/`yminus2`) |

### OS tokens

`el96`/`el98`/`el102` — RHEL 9.6/9.8/10.2

### Reading multi-@ names

| Name | Meaning |
|---|---|
| `el96-lrel@standard1` | RHEL 9.6 + latest release, standard suite 1 |
| `el94-y2@el96-lrel@standard1` | Start Y-2 on RHEL 9.4, upgrade to RHEL 9.6 + latest release, run standard1 |
| `el96-yminus2@prel@src@delta-upgrade-ok` | Y-2 → Y-1 (prel) → source, static delta upgrade |

## Artifact layout

Per scenario, under `artifacts/<TEST_NAME>/openshift-microshift-e2e-metal-tests/artifacts/scenario-info/<scenario>/`:

| File | Answers |
|---|---|
| `junit.xml` | Which tests failed; `testsuite name` = scenario name |
| `rf-debug.log` | Robot Framework trace — failures marked `\| FAIL \|` |
| `boot_and_run.log` | VM boot + orchestration; scenario-killing timeouts appear here |
| `phase_create/junit.xml` | Infra junit from VM creation (greenboot check) |
| `phase_run/junit.xml` | Infra junit from test run phase |
| `vms/host1/sos/journal_*.log` | Plain-text journal exports — check FIRST for service failures, OOM, x509 |
| `vms/host1/sos/sosreport-*.tar.xz` | Full sosreports (see below) |

## Sosreports

Two types: **on-failure** (captured at each test failure, includes test-created namespaces — **prefer this one**) and **end-of-scenario** (teardown, may lack test workloads). Match to failure by comparing capture timestamp with `rf-debug.log` failure time.

**Journals**: use plain-text `journal_*.log` next to tarballs — no extraction needed.

**Pod logs**: extract with `bash plugins/shared/scripts/extract-sosreport.sh <tarball>`. Output lands in `<tarball-parent>/sos-extracted/<sosreport-name>/`:

- Pod logs: `sos_commands/microshift/namespaces/<ns>/pods/<pod>/<container>/<container>/logs/{current,previous}.log`
- `previous.log` tail states why a dead container exited (fatal error, leader election lost, panic)
- Cluster resources: `sos_commands/microshift/cluster-scoped-resources/`

## Greenboot

Before tests, the scenario waits for `greenboot-healthcheck.service` to exit. Failure → `pre_test_greenboot_check FAILED` in `phase_create/junit.xml`, no tests run. In the journal, `40_microshift_running_check.sh` lines show which deployments were waited on.

## Journal reading

Reconstruct a timestamped timeline before attributing fault:

- Pod lifecycle: `Created container`/`Started container` (crio), `SyncLoop (PLEG)`, probe readiness transitions
- Two `Created container` events for the same pod = first instance died — read `previous.log`
- `apply request took too long` = apiserver/etcd latency (can cause leader-election loss)

## Common patterns

**Timeout cascade**: `TEST_EXECUTION_TIMEOUT` expires → one test gets `Execution terminated by signal`, all subsequent get `Test execution stopped due to a fatal error`. ONE root cause — find what consumed the time budget.

**Greenboot masking**: greenboot failure → no tests run → only `phase_create/junit.xml` has the failure. Root cause is in the journal.

**Shared-hypervisor contention**: all scenarios share one host. CPU/memory/disk contention → greenboot timeouts, etcd pressure, image pull timeouts. Attribute to infrastructure, not product/test.
96 changes: 96 additions & 0 deletions plugins/microshift-ci/agents/references/structured-summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Structured Summary Output Format

Output contract for CI job analysis skills, consumed by `aggregate.py`, `search-bugs.py`, and `create-report.py`.

## Output Template

```text
Error Severity: {1-5}
Stack Layer: {AWS Infra | External Infrastructure | build phase | deploy phase | test setup phase | Test Configuration | test | teardown}
Step Name: {CI step where the error occurred}
Error: {Exact error with log context}
Causal Chain: {numbered list, each link cites file:line}
Confidence: {high | medium | low}
Suggested Remediation: {fix direction; do NOT propose test tolerance (waits/retries/timeouts) unless the product behaved correctly}
```

| Severity | Meaning |
|---|---|
| 5 | Release-blocking product regression — no workaround |
| 4 | Persistent product or test failure — no workaround |
| 3 | Persistent failure with workaround, or scoped to single scenario/arch |
| 2 | Intermittent failure / likely flake |
| 1 | Infrastructure noise or self-healing condition |

## STRUCTURED SUMMARY JSON

Append after all prose. **Both markers are required** — the parser skips the report if either is missing.

```text
--- STRUCTURED SUMMARY ---
[ { ... } ]
--- END STRUCTURED SUMMARY ---
```

### Fields

| Field | Description |
|---|---|
| `severity` | 1-5 per rubric above |
| `stack_layer` | One of the values from the template |
| `step_name` | CI step where the error occurred |
| `error_signature` | Concise one-line description for dedup and bug titles |
| `root_cause` | WHY it failed — mechanism, not symptom (~80 chars, see rules below) |
| `raw_error` | Verbatim log text — deterministic anchor (see rules below) |
| `infrastructure_failure` | `true` if AWS/CI infra caused it, `false` otherwise |
| `job_url` | Full Prow job URL |
| `job_name` | Full job name |
| `release` | Release branch (e.g. `4.22`, `main`) |
| `remediation` | Fix direction (~120 chars). Infra → infra action. Product → code fix direction |
| `finished` | Job finish date, `YYYY-MM-DD` |
| `causal_chain` | Array of `{"cause", "evidence", "quote"}`. `evidence` = artifact file path with `:line`. `quote` = verbatim excerpt, no labels/commentary. **Re-read every cited file:line before finalizing** — wrong citations destroy trust. The `cause` text must use terms from actual log messages, not vague categories |
| `confidence` | `high` / `medium` / `low` (see rules below) |
| `analysis_gaps` | Array of strings naming missing evidence. Empty `[]` when nothing skipped |
| `scenarios` | Array of scenario names where this failure occurred. Empty `[]` for non-scenario jobs |

### CONFIDENCE rules

- **high**: every causal-chain link directly evidenced by a quoted artifact line or graph
- **medium**: mechanism is inferred but consistent with all evidence; citations still required — `medium` means the *interpretation* is inferred, not that citations can be omitted
- **low**: symptom-level only — chain stops before actionable cause; `analysis_gaps` MUST be populated

Do NOT inflate confidence — downstream automation acts on it.

### RAW_ERROR rules

The verbatim anchor readers use to match the report against logs.

1. **Copy-paste exact error text** — do NOT paraphrase
2. **Pick ONE error** — the first fatal one
3. **Only strip timestamps** — keep everything else verbatim
4. **Never concatenate** multiple errors
5. **Truncate to ~150 chars** if very long — keep the distinctive part

### ROOT_CAUSE rules

| Field | Purpose |
|---|---|
| `error_signature` | WHAT failed (bug titles) |
| `root_cause` | WHY it failed (mechanism) |
| `raw_error` | Verbatim log text (anchor) |

1. **~80 chars max**
2. **Focus on mechanism**, not symptom
3. **Stable terms** — no version numbers, timestamps, or job names

Describe the specific mechanism, not architectural generalizations ("framework expects annotation X which MicroShift does not set", not "MicroShift is single-node").

Grouping and cross-release deduplication key on the deterministic failure `fingerprint` injected by tooling — do NOT emit a `fingerprint` field yourself.

### Multiple independent failures

1. One entry per independent failure (different scenarios, different root causes)
2. Same root cause = one entry — do NOT split
3. At most 5 entries per job
4. Cascading failures are NOT independent — report only the root failure
5. Single failures are still a JSON array
1 change: 1 addition & 0 deletions plugins/microshift-ci/scripts/extract-evidence.py
1 change: 1 addition & 0 deletions plugins/microshift-ci/scripts/plan-analysis.py
1 change: 1 addition & 0 deletions plugins/microshift-ci/scripts/validate-reports.py
Loading
Loading