Skip to content

feat(health): add tools/health generator and recipe-health matrix#1304

Open
njhensley wants to merge 1 commit into
NVIDIA:mainfrom
njhensley:feat/recipe-health-matrix
Open

feat(health): add tools/health generator and recipe-health matrix#1304
njhensley wants to merge 1 commit into
NVIDIA:mainfrom
njhensley:feat/recipe-health-matrix

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Add a hermetic, offline recipe-health generator (tools/health) plus the publication machinery (make targets, public doc, nav/ownership) that surfaces per-recipe structural health across the whole criteria matrix — mirroring the BOM generate-and-PR precedent, but with no live Helm render. The matrix body is intentionally deferred (see Implementation Notes); the tooling itself is complete and functional.

Motivation / Context

ADR-009 (Recipe Health Tracking, V1) calls for a hermetic generator, BOM-style make targets that splice into a marked region, and a public matrix doc. This PR delivers the tools/health generator + targets + doc; the validation logic it consumes (pkg/health.Compute) already landed via #1226/#1225.

Fixes: N/A — implements the #1229 tooling; public-matrix publication is deferred (see Implementation Notes), so I left #1229 open for the maintainer to close after the matrix publishes.
Related: #1229, #1224 (epic), #1226/#1227 (deps), #1230 (blocked: weekly refresh)

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update
  • Build/CI/tooling

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: tools/health generator + Makefile targets + tools/check-docs-mdx

Implementation Notes

  • Generator (tools/health) parallels tools/bom minus the network render: it calls pkg/health.Compute() against the embedded catalog (Provider: nil) and renders a Markdown matrix — per-recipe rolled-up status, a declared-coverage column (R:n D:n P:n C:n), and a literal pending Evidence column. Flags -deterministic/-no-title make the output byte-stable and embeddable.
  • Targets clone bom-docs/bom-check: make recipe-health-docs splices the body between <!-- BEGIN/END AICR-HEALTH -->; make recipe-health-check is advisory (not wired into make qualify/lint/merge gate). Named recipe-health-*, not the ADR's original health-*, to avoid the check-health/check-health-all/component-health chainsaw family (homophone footgun); ADR-009 §5 updated to match.
  • Matrix publication is intentionally deferred behind an AICR-HEALTH-DEFERRED sentinel. pkg/health's chart_pinned signal currently over-reports fail for manifest-only Helm components (declared type: Helm but shipping local manifests with no external chart to pin), which would make 17/32 recipes show fail and render a misleading public matrix. The generator/targets are fully functional; recipe-health-check skips while the sentinel is present. Publishing is a follow-up gated on fixing classifyChartPinned (a pkg/health change, CODEOWNERS-pinned to the pkg/health: core Compute() loop, resolves signal, rollup, determinism #1225/pkg/health: chart_pinned + declared_coverage signals #1226 owners).
  • pkg/health: constraints_wellformed signal (parse-only, hermetic) #1227 forward-compat: the renderer reads only the rolled-up Status (+ coverage), never per-dimension keys, and pkg/health.rollup() is generic over the dimension map — so constraints_wellformed folds in with zero code change, just a doc regen.
  • docs/user/recipe-health.md excluded from the MDX check (carries splice-marker HTML comments, exactly like container-images.md).

Testing

make lint                                   # golangci-lint 0 issues, MDX, license, agents-sync, bom-pinning
go test -race ./tools/health/...            # pass (CGO race)
go test -coverprofile ./tools/health/...    # 72.1%
make recipe-health-docs                     # splices the 32-row matrix cleanly (verified, then reverted to placeholder)
make recipe-health-check                    # skips honestly while deferral sentinel present (exit 0)

Note: I ran make lint + race/coverage on tools/health + the target/splice behavior, not the full make qualify (e2e/scan) — CI covers those. New package coverage 72.1% (above the 70% floor); no pre-existing package's coverage changed (no edits outside tools/health + docs/build files).

Risk Assessment

  • Low — Isolated, additive tooling; no CLI/server/runtime path touched. The advisory check is unwired from any gate, and the matrix body is a placeholder, so nothing here can turn CI red or affect deployments. Trivially revertible.

Rollout notes: No migration. The public matrix stays a placeholder until a follow-up fixes the chart_pinned false-positive and runs make recipe-health-docs.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Add a hermetic, offline recipe-health generator paralleling tools/bom
(minus the live Helm render). tools/health calls pkg/health.Compute()
and renders a Markdown matrix — per-recipe rolled-up status, a declared-
coverage column (R:n D:n P:n C:n), and a literal `pending` Evidence
column — with -deterministic/-no-title flags for committable output.

Publication mirrors the BOM precedent:
- make recipe-health-docs splices the matrix into the marked region of
  docs/user/recipe-health.md; make recipe-health-check is an advisory
  staleness check (not wired into make qualify). Named recipe-health-*
  to avoid the existing check-health/component-health chainsaw family.
- docs/user/recipe-health.md: hand-written header (layer-1 chart_pinned
  and uniform pending-Evidence notes) plus the spliced region.
- Registered in docs/index.yml, cross-linked from component-catalog.md,
  owned via CODEOWNERS, and excluded from the MDX check (splice
markers).

Tests cover render content, byte-stable determinism, the marker-presence
guard, and a compute-budget gate over the full catalog (ADR-009 §2).

The matrix body is intentionally deferred behind an AICR-HEALTH-DEFERRED
sentinel: pkg/health's chart_pinned signal currently over-reports fail
for manifest-only Helm components (no external chart to pin), so the
published matrix would mislead. The tooling is fully functional;
recipe-health-check skips while the sentinel is present.

Implements NVIDIA#1229. Design: ADR-009.
@njhensley njhensley requested review from a team as code owners June 11, 2026 00:30
@njhensley njhensley added theme/ci-dx CI pipelines, developer experience, and build tooling theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification labels Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request introduces a recipe-health documentation tool and integrates it into the project. It adds a new Go CLI (tools/health) that computes structural health for resolvable recipes using health.Compute and renders the results into a Markdown matrix. The implementation includes comprehensive unit and integration tests. The Makefile is extended with make recipe-health-docs to regenerate the documentation and make recipe-health-check to verify it remains current. A new user documentation page (docs/user/recipe-health.md) describes the health matrix columns and intentionally defers publication of the generated matrix body. Supporting changes include updates to the ADR, documentation navigation, linting exclusions, and project ownership/ignore rules.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1293: Introduces health.Compute and chart_pinned/coverage field changes that this PR's tool directly depends on to render the recipe-health matrix.
  • NVIDIA/aicr#1291: Adds core pkg/health compute and rollup logic that the new tools/health generator calls to produce health reports.

Suggested labels

area/recipes, area/docs, size/L, theme/recipes

Suggested reviewers

  • yuanchen8911
  • dims
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a tools/health generator and recipe-health matrix documentation, which aligns with the primary deliverable in the changeset.
Description check ✅ Passed The description comprehensively explains the changes, motivation, implementation details, testing, and risk assessment, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health/main_test.go`:
- Around line 169-181: TestCoverageCell currently contains two inline
assertions; refactor it into a table-driven test to match the style of
TestDimCell and make adding cases easier. Replace the body of TestCoverageCell
with a slice of test cases (each with name, input *health.DeclaredCoverage,
expected string) and iterate using t.Run for each case, calling coverageCell and
comparing results; keep references to the existing coverageCell function and
health.DeclaredCoverage/health.PhaseCoverage types so the same inputs ("nil" and
the sample cov with Readiness/Deployment/Conformance checks) are used as test
cases.

In `@tools/health/main.go`:
- Around line 88-90: The current code double-wraps errors from renderMatrix by
wrapping mdErr again with errors.Wrap(errors.ErrCodeInternal, "render
markdown"); instead, detect if mdErr != nil and return mdErr directly to
preserve the original structured pkg/errors context—update the error handling in
the block that checks mdErr (the variable returned from renderMatrix) to return
mdErr rather than wrapping it, leaving other error paths unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: db6fee11-770a-432e-a132-1de1fd5d3cbd

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14530 and 2a5b5a8.

📒 Files selected for processing (11)
  • .github/CODEOWNERS
  • .gitignore
  • Makefile
  • docs/design/009-recipe-health-tracking.md
  • docs/index.yml
  • docs/user/component-catalog.md
  • docs/user/recipe-health.md
  • tools/check-docs-mdx
  • tools/health/main.go
  • tools/health/main_test.go
  • tools/health/markdown.go

Comment thread tools/health/main_test.go
Comment on lines +169 to +181
func TestCoverageCell(t *testing.T) {
if got := coverageCell(nil); got != "—" {
t.Errorf("coverageCell(nil) = %q, want em dash", got)
}
cov := &health.DeclaredCoverage{
Readiness: health.PhaseCoverage{Checks: []string{"r1", "r2"}},
Deployment: health.PhaseCoverage{Checks: []string{"d1"}},
Conformance: health.PhaseCoverage{Checks: []string{"c1", "c2", "c3"}},
}
if got := coverageCell(cov); got != "R:2 D:1 P:0 C:3" {
t.Errorf("coverageCell() = %q, want %q", got, "R:2 D:1 P:0 C:3")
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider table-driven structure for consistency.

This test validates two independent scenarios (nil coverage and non-nil coverage). Converting to table-driven format would improve consistency with TestDimCell and make it easier to add coverage edge cases in the future.

♻️ Suggested table-driven refactor
 func TestCoverageCell(t *testing.T) {
-	if got := coverageCell(nil); got != "—" {
-		t.Errorf("coverageCell(nil) = %q, want em dash", got)
-	}
-	cov := &health.DeclaredCoverage{
-		Readiness:   health.PhaseCoverage{Checks: []string{"r1", "r2"}},
-		Deployment:  health.PhaseCoverage{Checks: []string{"d1"}},
-		Conformance: health.PhaseCoverage{Checks: []string{"c1", "c2", "c3"}},
-	}
-	if got := coverageCell(cov); got != "R:2 D:1 P:0 C:3" {
-		t.Errorf("coverageCell() = %q, want %q", got, "R:2 D:1 P:0 C:3")
-	}
+	tests := []struct {
+		name string
+		in   *health.DeclaredCoverage
+		want string
+	}{
+		{"nil coverage is em dash", nil, "—"},
+		{"non-nil formats per-phase counts", &health.DeclaredCoverage{
+			Readiness:   health.PhaseCoverage{Checks: []string{"r1", "r2"}},
+			Deployment:  health.PhaseCoverage{Checks: []string{"d1"}},
+			Conformance: health.PhaseCoverage{Checks: []string{"c1", "c2", "c3"}},
+		}, "R:2 D:1 P:0 C:3"},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			if got := coverageCell(tt.in); got != tt.want {
+				t.Errorf("coverageCell() = %q, want %q", got, tt.want)
+			}
+		})
+	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestCoverageCell(t *testing.T) {
if got := coverageCell(nil); got != "—" {
t.Errorf("coverageCell(nil) = %q, want em dash", got)
}
cov := &health.DeclaredCoverage{
Readiness: health.PhaseCoverage{Checks: []string{"r1", "r2"}},
Deployment: health.PhaseCoverage{Checks: []string{"d1"}},
Conformance: health.PhaseCoverage{Checks: []string{"c1", "c2", "c3"}},
}
if got := coverageCell(cov); got != "R:2 D:1 P:0 C:3" {
t.Errorf("coverageCell() = %q, want %q", got, "R:2 D:1 P:0 C:3")
}
}
func TestCoverageCell(t *testing.T) {
tests := []struct {
name string
in *health.DeclaredCoverage
want string
}{
{"nil coverage is em dash", nil, "—"},
{"non-nil formats per-phase counts", &health.DeclaredCoverage{
Readiness: health.PhaseCoverage{Checks: []string{"r1", "r2"}},
Deployment: health.PhaseCoverage{Checks: []string{"d1"}},
Conformance: health.PhaseCoverage{Checks: []string{"c1", "c2", "c3"}},
}, "R:2 D:1 P:0 C:3"},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
if got := coverageCell(tt.in); got != tt.want {
t.Errorf("coverageCell() = %q, want %q", got, tt.want)
}
})
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health/main_test.go` around lines 169 - 181, TestCoverageCell currently
contains two inline assertions; refactor it into a table-driven test to match
the style of TestDimCell and make adding cases easier. Replace the body of
TestCoverageCell with a slice of test cases (each with name, input
*health.DeclaredCoverage, expected string) and iterate using t.Run for each
case, calling coverageCell and comparing results; keep references to the
existing coverageCell function and health.DeclaredCoverage/health.PhaseCoverage
types so the same inputs ("nil" and the sample cov with
Readiness/Deployment/Conformance checks) are used as test cases.

Comment thread tools/health/main.go
Comment on lines +88 to +90
if mdErr != nil {
return errors.Wrap(errors.ErrCodeInternal, "render markdown", mdErr)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and show the relevant section in tools/health/main.go
FILE="tools/health/main.go"
echo "=== File: $FILE (lines 1-200) ==="
sed -n '1,200p' "$FILE" | nl -ba | sed -n '1,200p'

# Show the exact lines around 88-90 (defensive; line numbers in snippet are 88-90)
echo "=== File: $FILE (lines 70-110) ==="
sed -n '70,110p' "$FILE" | nl -ba

# Find renderMatrix definition and show its body
echo "=== renderMatrix usages/definition ==="
rg -n "func\s+renderMatrix|renderMatrix\s*\(" "$FILE"

# Show renderMatrix function body (best-effort: from definition to next func)
if rg -n "func\s+renderMatrix" "$FILE" >/dev/null; then
  START=$(rg -n "func\s+renderMatrix" "$FILE" | head -n1 | cut -d: -f1)
  echo "renderMatrix starts at line: $START"
  # print a window around it
  sed -n "$START,$((START+200))p" "$FILE" | nl -ba
fi

# Also check for pkg/errors import and error code conventions in this file
echo "=== Imports (top of file) ==="
sed -n '1,60p' "$FILE" | nl -ba

Repository: NVIDIA/aicr

Length of output: 165


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/health/main.go"

if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

echo "=== File head (1-120) ==="
awk '{printf "%5d\t%s\n", NR, $0}' "$FILE" | sed -n '1,120p'

echo "=== Imports (top block) ==="
awk '{printf "%5d\t%s\n", NR, $0}' "$FILE" | sed -n '1,80p'

echo "=== renderMatrix definition (search) ==="
rg -n "func[[:space:]]+renderMatrix" "$FILE" || true

# If found, print from definition start up to next func
START_LINE="$(rg -n "func[[:space:]]+renderMatrix" "$FILE" | head -n1 | cut -d: -f1)"
if [ -n "${START_LINE:-}" ]; then
  echo "=== renderMatrix body (from $START_LINE for ~250 lines) ==="
  awk -v start="$START_LINE" 'NR>=start && NR<start+250{printf "%5d\t%s\n", NR, $0}' "$FILE"
fi

echo "=== Exact block around 80-100 ==="
awk '{printf "%5d\t%s\n", NR, $0}' "$FILE" | sed -n '80,110p'

Repository: NVIDIA/aicr

Length of output: 8039


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/health/main.go"
echo "=== wc -l ==="
wc -l "$FILE"

echo "=== tail 120 lines ==="
tail -n 120 "$FILE" | awk '{printf "%5d\t%s\n", NR+('$(
  wc -l "$FILE" | awk '{print $1}' )'-120), $0}' || true

echo "=== search for renderMatrix in repo ==="
rg -n "renderMatrix" tools/health || true
rg -n "func[[:space:]]+renderMatrix" . || true

Repository: NVIDIA/aicr

Length of output: 5692


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tools/health/markdown.go"
echo "=== wc -l markdown.go ==="
wc -l "$FILE"

echo "=== renderMatrix definition + body ==="
# Print from renderMatrix start to end of function (best-effort by window)
START_LINE="$(rg -n "func[[:space:]]+renderMatrix" "$FILE" | head -n1 | cut -d: -f1)"
echo "renderMatrix starts at: $START_LINE"
awk -v start="$START_LINE" 'NR>=start && NR<start+220{printf "%5d\t%s\n", NR, $0}' "$FILE" | sed -n '1,220p'

echo "=== exact error return sites inside renderMatrix ==="
awk -v start="$START_LINE" 'NR>=start && NR<start+220{print}' "$FILE" | rg -n "return .*errors\.|Wrap|Propagate|ErrCode" || true

Repository: NVIDIA/aicr

Length of output: 3692


Avoid double-wrapping renderMatrix errors in tools/health/main.go

renderMatrix already returns a structured pkg/errors error (errors.Wrap(errors.ErrCodeInternal, "failed to write recipe-health markdown", sw.err)), so wrapping mdErr again in the if mdErr != nil block is redundant. Return mdErr directly to preserve the original error context.

Proposed fix
  if mdErr != nil {
-   return errors.Wrap(errors.ErrCodeInternal, "render markdown", mdErr)
+   return mdErr
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if mdErr != nil {
return errors.Wrap(errors.ErrCodeInternal, "render markdown", mdErr)
}
if mdErr != nil {
return mdErr
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health/main.go` around lines 88 - 90, The current code double-wraps
errors from renderMatrix by wrapping mdErr again with
errors.Wrap(errors.ErrCodeInternal, "render markdown"); instead, detect if mdErr
!= nil and return mdErr directly to preserve the original structured pkg/errors
context—update the error handling in the block that checks mdErr (the variable
returned from renderMatrix) to return mdErr rather than wrapping it, leaving
other error paths unchanged.

Source: Learnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/docs size/XL theme/ci-dx CI pipelines, developer experience, and build tooling theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant