CI Doctor: Don't loose jobs that failed analysis#217
Conversation
download-jobs.sh silently excluded jobs whose artifact download failed from the jobs.json it emits, so a detected failed job could vanish from the pipeline before analysis even started. Keep such jobs in the list with an empty artifacts_dir and a download_error marker, and have plan-analysis skip scheduling analysis for them (there are no artifacts to read) — the aggregation overlay will surface them as unanalyzed instead of losing them.
… placeholders total_failed used to count parsed STRUCTURED SUMMARY entries, so a job whose analysis report was missing or unparseable silently disappeared from the summary (and one multi-entry report could overcount). Treat the prepare phase's jobs/release-<v>-jobs.json and prs-jobs.json as ground truth: join parsed analyses onto the detected list by build_id (from the per-job filename, with job_url fallback) and emit an explicit missing_analysis placeholder for every detected job with no analysis. Placeholders live in a separate array with no fingerprint or severity so they never enter issues[], history.json, or bug correlation. Summaries are now written even when zero reports parsed (all placeholders) or the detected list is empty (total_failed 0) — but not when a collection error was recorded, which must keep rendering as such. affected_jobs entries gain build_id so the HTML can link detected jobs to issues.
The finalize step only ran aggregate.py --prs when at least one prs-job-*.txt file existed, so if every PR analysis went missing no PR summary was written at all and the detected failures vanished from the report. Gate on detected PR jobs too — aggregation now emits an all-placeholder summary in that case.
…rkers Each release and PR section now lists every detected failed job: analyzed jobs link to their issue card's anchor, and jobs from missing_analysis carry a badge-nodata "not analyzed" marker with the reason — so a job that slipped through the analysis pipeline is visible in the report instead of silently absent. The breakdown row gains a "Not analyzed" count (deliberately without a bd-* class so the today-filter JS recompute ignores it), and the rdata-None fallback copy now points at aggregation errors, the only thing it can mean with the overlay in place. Old summaries without the new fields render unchanged.
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pmtk The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughThis PR introduces a deterministic "detected failures" overlay in aggregate.py, sourced from release/PR job JSON files, ensuring failed jobs appear even when analysis reports are missing or unparseable via new ChangesDetected failures overlay
Estimated code review effort: 4 (Complex) | ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
plugins/shared/scripts/create-report.py (1)
1242-1247: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueUse iterable unpacking instead of list concatenation (ruff RUF005).
♻️ Proposed fix
return ( - [f' <p><strong>Detected failed jobs ({total})</strong></p>', - ' <ul class="detected-jobs">'] - + rows - + [' </ul>'] + [f' <p><strong>Detected failed jobs ({total})</strong></p>', + ' <ul class="detected-jobs">', + *rows, + ' </ul>'] )As per coding guidelines, "Python code must pass ruff validation," and per CONTRIBUTING.md: "Python: follow PEP 8 and ensure ruff passes."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/shared/scripts/create-report.py` around lines 1242 - 1247, The detected-jobs HTML builder is using list concatenation in the return expression, which triggers Ruff RUF005. Update the logic in the function that assembles the report rows to use iterable unpacking in a single list literal instead of combining separate lists with plus, while keeping the existing rows content and order unchanged.Sources: Coding guidelines, Path instructions, Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/shared/scripts/doctor.sh`:
- Around line 341-349: Add direct test coverage for the new PR-detection branch
in doctor.sh, since the jq-based gate in the PR aggregation logic now changes
behavior based on prs-jobs.json contents. Update plugins/tests/test_aggregate.py
to cover the relevant path by exercising the doctor.sh aggregation behavior with
missing, empty, and populated prs-jobs.json cases, and verify the branch that
sets pr_detected in the aggregation flow. Include both positive and negative
cases so the new jq -e 'length > 0' check is fully validated.
In `@plugins/shared/scripts/download-jobs.sh`:
- Around line 167-174: The jq transform in the failed/successful job merge is
not null-safe on the success path: the `then` branch already normalizes
`.build_id`, but the `else` branch in the jq expression still concatenates the
raw `.build_id`, which can crash when it is missing or null. Update the jq
filter used after `failed_json`/`jobs_json` to normalize `build_id` once (as
done in the `index($b)` check) and reuse that safe value when building
`artifacts_dir`, or otherwise guard the concatenation so null/missing IDs
produce a harmless fallback instead of a jq runtime error.
In `@plugins/tests/test_aggregate.py`:
- Around line 72-99: The release aggregation test coverage is missing the
negative case for the placeholder branch that reports a missing build ID. Add a
focused test in test_aggregate.py alongside
test_release_missing_analyses_become_placeholders to exercise _placeholder()
with a detected job that has an empty build_id and assert
summary["missing_analysis"][0]["reason"] is "no build id". Keep the existing
download_error and analysis-missing assertions, and verify the new case through
the same _run and _release_summary flow so the missing-analysis placeholder
logic is fully covered.
- Around line 101-111: The test in test_release_placeholders_have_no_fingerprint
can pass without checking anything if _release_summary()["missing_analysis"] is
empty. Update the assertion logic to first verify that missing_analysis contains
the expected placeholder entry before iterating, using the existing
_release_summary() helper and the release job setup in
test_release_placeholders_have_no_fingerprint so the test fails when the job is
incorrectly treated as covered.
---
Nitpick comments:
In `@plugins/shared/scripts/create-report.py`:
- Around line 1242-1247: The detected-jobs HTML builder is using list
concatenation in the return expression, which triggers Ruff RUF005. Update the
logic in the function that assembles the report rows to use iterable unpacking
in a single list literal instead of combining separate lists with plus, while
keeping the existing rows content and order unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: dd980f52-f8a1-43f0-a832-b4624ee12600
📒 Files selected for processing (5)
plugins/shared/scripts/aggregate.pyplugins/shared/scripts/create-report.pyplugins/shared/scripts/doctor.shplugins/shared/scripts/download-jobs.shplugins/tests/test_aggregate.py
| # Aggregate PRs when per-job report files exist OR failed PR jobs were | ||
| # detected — even if every analysis went missing, the summary must list | ||
| # the detected jobs as unanalyzed rather than not exist at all. | ||
| local pr_files pr_detected="" | ||
| pr_files=$(find "${WORKDIR}/jobs" -name 'prs-job-*.txt' 2>/dev/null | head -1) | ||
| if [[ -n "${pr_files}" ]]; then | ||
| if jq -e 'length > 0' "${WORKDIR}/jobs/prs-jobs.json" >/dev/null 2>&1; then | ||
| pr_detected="yes" | ||
| fi | ||
| if [[ -n "${pr_files}" ]] || [[ -n "${pr_detected}" ]]; then |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether doctor.sh's PR-detection branch has any dedicated test coverage
fd -e bats -e sh . plugins/tests 2>/dev/null
rg -n 'prs-jobs\.json|pr_detected' plugins/tests -g '*.py' -g '*.sh' -g '*.bats' 2>/dev/nullRepository: openshift-eng/edge-tooling
Length of output: 423
🏁 Script executed:
#!/usr/bin/bash
set -euo pipefail
sed -n '1,260p' plugins/tests/test_aggregate.pyRepository: openshift-eng/edge-tooling
Length of output: 8588
🏁 Script executed:
#!/usr/bin/bash
set -euo pipefail
rg -n --hidden --glob '!**/.git/**' 'doctor\.sh|aggregate\.py|prs-jobs\.json|prs-summary\.json' plugins tests .claude 2>/dev/nullRepository: openshift-eng/edge-tooling
Length of output: 5821
Add tests for the new PR-detection branch in plugins/shared/scripts/doctor.sh
Per CONTRIBUTING.md: “any parsing/regex/validation logic introduced or changed must include positive and negative cases.” plugins/tests/test_aggregate.py covers aggregate.py’s prs-jobs.json overlay, but this new jq -e 'length > 0' gate still needs direct coverage for missing, empty, and populated prs-jobs.json cases.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/shared/scripts/doctor.sh` around lines 341 - 349, Add direct test
coverage for the new PR-detection branch in doctor.sh, since the jq-based gate
in the PR aggregation logic now changes behavior based on prs-jobs.json
contents. Update plugins/tests/test_aggregate.py to cover the relevant path by
exercising the doctor.sh aggregation behavior with missing, empty, and populated
prs-jobs.json cases, and verify the branch that sets pr_detected in the
aggregation flow. Include both positive and negative cases so the new jq -e
'length > 0' check is fully validated.
Source: Path instructions
| # Keep failed downloads in the list (marked) so downstream reporting | ||
| # never loses a detected job; only successful jobs get an artifacts_dir. | ||
| local failed_json | ||
| failed_json=$(printf '%s\n' "${failed_ids}" | jq -R -s '[split("\n")[] | select(length > 0)]') | ||
| echo "${jobs_json}" | jq --arg workdir "${WORKDIR}" --argjson failed "${failed_json}" ' | ||
| [.[] | if ((.build_id // "") as $b | $failed | index($b)) | ||
| then . + {artifacts_dir: "", download_error: "artifact download failed"} | ||
| else . + {artifacts_dir: ($workdir + "/artifacts/" + .build_id)} end]' |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Inconsistent null-safety between branches can crash the jq transform.
The then branch guards the lookup key with (.build_id // ""), but the else branch concatenates raw .build_id into artifacts_dir. If a job has a null/missing build_id and its failure was filtered out of $failed (since failed_json drops empty strings via select(length > 0)), it falls into the else branch and null + string will throw a jq runtime error, aborting the whole pipeline under set -e/pipefail.
🛠️ Proposed fix
echo "${jobs_json}" | jq --arg workdir "${WORKDIR}" --argjson failed "${failed_json}" '
[.[] | if ((.build_id // "") as $b | $failed | index($b))
then . + {artifacts_dir: "", download_error: "artifact download failed"}
- else . + {artifacts_dir: ($workdir + "/artifacts/" + .build_id)} end]'
+ else . + {artifacts_dir: ($workdir + "/artifacts/" + (.build_id // ""))} end]'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Keep failed downloads in the list (marked) so downstream reporting | |
| # never loses a detected job; only successful jobs get an artifacts_dir. | |
| local failed_json | |
| failed_json=$(printf '%s\n' "${failed_ids}" | jq -R -s '[split("\n")[] | select(length > 0)]') | |
| echo "${jobs_json}" | jq --arg workdir "${WORKDIR}" --argjson failed "${failed_json}" ' | |
| [.[] | if ((.build_id // "") as $b | $failed | index($b)) | |
| then . + {artifacts_dir: "", download_error: "artifact download failed"} | |
| else . + {artifacts_dir: ($workdir + "/artifacts/" + .build_id)} end]' | |
| # Keep failed downloads in the list (marked) so downstream reporting | |
| # never loses a detected job; only successful jobs get an artifacts_dir. | |
| local failed_json | |
| failed_json=$(printf '%s\n' "${failed_ids}" | jq -R -s '[split("\n")[] | select(length > 0)]') | |
| echo "${jobs_json}" | jq --arg workdir "${WORKDIR}" --argjson failed "${failed_json}" ' | |
| [.[] | if ((.build_id // "") as $b | $failed | index($b)) | |
| then . + {artifacts_dir: "", download_error: "artifact download failed"} | |
| else . + {artifacts_dir: ($workdir + "/artifacts/" + (.build_id // ""))} end]' |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/shared/scripts/download-jobs.sh` around lines 167 - 174, The jq
transform in the failed/successful job merge is not null-safe on the success
path: the `then` branch already normalizes `.build_id`, but the `else` branch in
the jq expression still concatenates the raw `.build_id`, which can crash when
it is missing or null. Update the jq filter used after `failed_json`/`jobs_json`
to normalize `build_id` once (as done in the `index($b)` check) and reuse that
safe value when building `artifacts_dir`, or otherwise guard the concatenation
so null/missing IDs produce a harmless fallback instead of a jq runtime error.
| def test_release_missing_analyses_become_placeholders(self): | ||
| detected = [ | ||
| _detected("j-analyzed", "https://prow/111", "111"), | ||
| _detected("j-lost", "https://prow/222", "222"), | ||
| _detected("j-dl-failed", "https://prow/333", "333", | ||
| artifacts_dir="", | ||
| download_error="artifact download failed"), | ||
| ] | ||
| (self.jobs_dir / "release-4.99-jobs.json").write_text( | ||
| json.dumps(detected)) | ||
| (self.jobs_dir / "release-4.99-job-1-111.txt").write_text( | ||
| _report("j-analyzed", "https://prow/111")) | ||
|
|
||
| result = _run(self.workdir, "--release", "4.99") | ||
| self.assertEqual(result.returncode, 0, result.stderr) | ||
| summary = self._release_summary() | ||
|
|
||
| self.assertEqual(summary["total_failed"], 3) | ||
| self.assertEqual(len(summary["issues"]), 1) | ||
| self.assertEqual( | ||
| summary["issues"][0]["affected_jobs"][0]["build_id"], "111") | ||
|
|
||
| reasons = {m["build_id"]: m["reason"] | ||
| for m in summary["missing_analysis"]} | ||
| self.assertEqual(reasons, { | ||
| "222": "analysis report missing or unparseable", | ||
| "333": "artifact download failed", | ||
| }) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Missing negative-case coverage for the "no build id" placeholder reason.
_placeholder()'s three reason branches (download_error / missing build_id / missing-or-unparseable analysis) are only exercised for two of the three here — the "no build id" branch is never tested. Per CONTRIBUTING.md: "any parsing/regex/validation logic introduced or changed must include positive and negative cases (relevant to placeholder/missing-analysis and detected-overlay behavior)."
def test_release_placeholder_reports_no_build_id(self):
detected = [_detected("j-no-build", "https://prow/000", "")]
(self.jobs_dir / "release-4.99-jobs.json").write_text(json.dumps(detected))
result = _run(self.workdir, "--release", "4.99")
self.assertEqual(result.returncode, 0, result.stderr)
summary = self._release_summary()
self.assertEqual(summary["missing_analysis"][0]["reason"], "no build id")🧰 Tools
🪛 ast-grep (0.44.0)
[info] 80-80: use jsonify instead of json.dumps for JSON output
Context: json.dumps(detected)
Note: [CWE-116] Improper Encoding or Escaping of Output.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/tests/test_aggregate.py` around lines 72 - 99, The release
aggregation test coverage is missing the negative case for the placeholder
branch that reports a missing build ID. Add a focused test in test_aggregate.py
alongside test_release_missing_analyses_become_placeholders to exercise
_placeholder() with a detected job that has an empty build_id and assert
summary["missing_analysis"][0]["reason"] is "no build id". Keep the existing
download_error and analysis-missing assertions, and verify the new case through
the same _run and _release_summary flow so the missing-analysis placeholder
logic is fully covered.
Source: Path instructions
| def test_release_placeholders_have_no_fingerprint(self): | ||
| # Downstream tooling keys on issue fingerprints; placeholders must | ||
| # never grow one. | ||
| (self.jobs_dir / "release-4.99-jobs.json").write_text( | ||
| json.dumps([_detected("j-lost", "https://prow/222", "222")])) | ||
|
|
||
| result = _run(self.workdir, "--release", "4.99") | ||
| self.assertEqual(result.returncode, 0, result.stderr) | ||
| for placeholder in self._release_summary()["missing_analysis"]: | ||
| self.assertNotIn("fingerprint", placeholder) | ||
| self.assertNotIn("severity", placeholder) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Loop over missing_analysis can pass vacuously.
If missing_analysis unexpectedly comes back empty (e.g., a regression that treats the job as "covered"), the for loop simply doesn't execute and the test passes without verifying anything.
Proposed fix
result = _run(self.workdir, "--release", "4.99")
self.assertEqual(result.returncode, 0, result.stderr)
- for placeholder in self._release_summary()["missing_analysis"]:
+ placeholders = self._release_summary()["missing_analysis"]
+ self.assertEqual(len(placeholders), 1)
+ for placeholder in placeholders:
self.assertNotIn("fingerprint", placeholder)
self.assertNotIn("severity", placeholder)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_release_placeholders_have_no_fingerprint(self): | |
| # Downstream tooling keys on issue fingerprints; placeholders must | |
| # never grow one. | |
| (self.jobs_dir / "release-4.99-jobs.json").write_text( | |
| json.dumps([_detected("j-lost", "https://prow/222", "222")])) | |
| result = _run(self.workdir, "--release", "4.99") | |
| self.assertEqual(result.returncode, 0, result.stderr) | |
| for placeholder in self._release_summary()["missing_analysis"]: | |
| self.assertNotIn("fingerprint", placeholder) | |
| self.assertNotIn("severity", placeholder) | |
| def test_release_placeholders_have_no_fingerprint(self): | |
| # Downstream tooling keys on issue fingerprints; placeholders must | |
| # never grow one. | |
| (self.jobs_dir / "release-4.99-jobs.json").write_text( | |
| json.dumps([_detected("j-lost", "https://prow/222", "222")])) | |
| result = _run(self.workdir, "--release", "4.99") | |
| self.assertEqual(result.returncode, 0, result.stderr) | |
| placeholders = self._release_summary()["missing_analysis"] | |
| self.assertEqual(len(placeholders), 1) | |
| for placeholder in placeholders: | |
| self.assertNotIn("fingerprint", placeholder) | |
| self.assertNotIn("severity", placeholder) |
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 104-104: use jsonify instead of json.dumps for JSON output
Context: json.dumps([_detected("j-lost", "https://prow/222", "222")])
Note: [CWE-116] Improper Encoding or Escaping of Output.
(use-jsonify)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/tests/test_aggregate.py` around lines 101 - 111, The test in
test_release_placeholders_have_no_fingerprint can pass without checking anything
if _release_summary()["missing_analysis"] is empty. Update the assertion logic
to first verify that missing_analysis contains the expected placeholder entry
before iterating, using the existing _release_summary() helper and the release
job setup in test_release_placeholders_have_no_fingerprint so the test fails
when the job is incorrectly treated as covered.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation