CI Doctor: failure fingerprint grouping, slim doctor prose#215
Conversation
extract-evidence.py condenses a failed job's downloaded artifacts into one structured evidence file per job: failed steps, test and phase failures, journal alerts, and container restart counts — every entry stamped with its timestamp and merged into a single time-sorted failure timeline. doctor.sh gains an `evidence` phase that runs it for every downloaded job, and the lvms-ci/microshift-ci plugins symlink the shared script. The evidence pack becomes the single starting point for analysis agents instead of each agent re-scanning raw artifacts.
Replace the prow-job skill's inline RCA instructions with a dedicated analyze-evidence agent that starts from the evidence pack and consults the MicroShift CI artifact primer (moved under agents/references/) and a structured-summary contract with tightened causal-chain rules. The doctor skill launches the same agent for its per-job analyses; prow-job becomes a thin wrapper that downloads artifacts, extracts evidence, and spawns the agent. validate-reports.py checks every agent report against the structured summary contract, and the doctor skill re-launches fix agents for reports that fail; parse.py sanitizes structured summaries before parsing.
The validator previously only checked that 'evidence' looked like a path — a hallucinated-but-plausible citation passed. It now resolves each citation against the job's downloaded artifacts (build dir derived from the entry's job_url), checks the file exists, the line is in range, and the quote actually appears near the cited line (timestamps stripped, whitespace normalized). Error messages include where the quote really is so fix agents can re-ground citations instead of guessing. Fix agents are no longer told to delete unsupported links to pass validation — they must re-ground each link or move the claim to analysis_gaps and downgrade confidence, then re-run the validator on their own output. Evidence packs now record the source file for every rf/boot_and_run/ journal alert entry (journal alerts from multiple files are merged, so line numbers alone were ambiguous). Drop missing_patterns from the agent contract: nothing consumed it — parse.py discarded it at aggregation — so it was pure token cost.
Grouping and cross-release dedup previously keyed on LLM-authored text (raw_error + root_cause) with a 0.5 token-similarity threshold — demanding cross-run determinism a sampled model cannot give, while the truly deterministic key (which step/tests/scenarios failed) already sat in the evidence pack. extract-evidence.py now computes a failure fingerprint from artifact facts only (job type, failed step, failing test names, phase failures, timeout cascade, greenboot verdict, infra indicator labels, first build error — all normalized, no job names/builds/timestamps). New doctor.sh plan/fanout phases (plan-analysis.py): - plan groups all failed jobs (releases + PRs) by fingerprint, writes template verdicts for pure-infrastructure and no-failure groups (no agent at all), and renders one fully substituted agent prompt file per remaining group - ONE agent analyzes each distinct failure instead of one per job — cross-release verdicts consistent by construction and fewer agents against the CI session's 45-minute budget - fanout explodes each validated group report into the per-job report files aggregate.py/search-bugs.py/create-report.py already consume, patching job fields and injecting 'fingerprint' (+ entry ordinal so independent failures stay separate issues) parse.py groups by fingerprint when present; token similarity remains as fallback for legacy reports. The validator resolves citations against all group members' build dirs. The analyze-evidence agent template is now group-native; prow-job renders it as a group of one. lvms-ci symlinks the new shared plan-analysis.py so its doctor flow resolves it too. Verified on a synthetic workdir: 5 jobs → 3 groups (1 agent), two consecutive runs produce byte-identical grouping.
fanout used to exit 0 even when group reports were missing or unparseable, merely listing them in its JSON — easy for the orchestrating session to ignore, silently dropping every job in those groups. Now it exits 3 and emits a retry_groups array with each group's prompt_file (null for deterministic groups) so the orchestrator can re-launch the failed analysis agents directly.
The group-first flow made most of the orchestration prose obsolete: the orchestrator no longer reads job JSON fields or builds agent prompts, so the field-name warnings, evidence-content inventories, duplicate examples, and step-restating notes are gone (318 → ~200 lines). The '-p mode' turn-keeping scaffolding stays until the CI step takes over the deterministic phases.
In CI the deterministic phases (prepare, graphs, evidence, fetch-previous, finalize) burn the Claude session's 45-minute wall clock while the model just waits on downloads. With --prepared the CI step runs them in bash around the session, and the skill covers only what needs a model: planning-driven agent launches, validation, fan-out, and bug correlation. Interactive use without the flag is unchanged.
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Warning Review limit reached
Next review available in: 59 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (16)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Depends on #214