Skip to content

feat: add cascade delete observability metrics#1113

Open
ishtoo1 wants to merge 1 commit intofeat/cascade-delete-childrenfrom
feat/cascade-delete-metrics
Open

feat: add cascade delete observability metrics#1113
ishtoo1 wants to merge 1 commit intofeat/cascade-delete-childrenfrom
feat/cascade-delete-metrics

Conversation

@ishtoo1
Copy link
Copy Markdown
Contributor

@ishtoo1 ishtoo1 commented Apr 22, 2026

What type of PR is this? (check all applicable)

  • Feature

What changed?

  • Add four Prometheus instruments to go/components/pipeline/metrics.go:
    • pipeline_cascade_delete_started_total{namespace, pipeline}
    • pipeline_cascade_delete_completed_total{namespace, pipeline}
    • pipeline_cascade_delete_error_total{namespace, pipeline, reason} (reasons: list_error, delete_error, update_error, kill_timeout)
    • pipeline_cascade_delete_active_children{namespace, pipeline, kind} (kinds: trigger_run, pipeline_run)
  • Emit metrics at each phase transition and error path in handleDeletion.
  • Add cascade-delete-started-at annotation so the "started" counter fires exactly once per cascade (not once per requeue).
  • Add cascadeDeleteKillTimeout = 30m. After the timeout, controller stops killing and forcefully deletes child CRs with reason=\"kill_timeout\" on the error counter, so the Pipeline CR never stays stuck in Terminating forever.
  • Add TestCascadeDelete_KillTimeout and TestCascadeDelete_CounterIncrementsOncePerCascade.

Why?
Addressing #1091.

  • Observability for cascade-delete operations in dashboards / SLOs.
  • Bound the controller's worst-case behavior when child workflows can't reach terminal state.

How did you test it?

  • bazel test //go/components/pipeline/... — all tests pass.

  • bazel build //go/... — no build errors.

  • End-to-end behavior of the full cascade-delete stack is verified on a sandbox cluster; results are attached on docs: document pipeline cascade delete behavior #1114.

Potential risks

  • New metric series. Cardinality is namespace × pipeline × (reason|kind). For typical deployments this is bounded; for operators managing thousands of pipelines the gauge may grow. Uses the same pattern as go/components/pipelinerun/metrics.go.
  • Forceful delete after 30 minutes: a TR/PR CR may be removed while its underlying workflow is still running. This is intentional — the alternative is holding the pipeline CR in Terminating indefinitely. Operators can investigate stuck workflows via logs and the kill_timeout metric.

Release notes
New metrics emitted under the pipeline_cascade_delete_* prefix. No user-facing behavior change beyond the 30-minute kill timeout fallback.

Documentation Changes
N/A.


Stacked on top of #1112.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 22, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Go Coverage Report (Bazel)

Total Coverage: 63.4%

Coverage Policy:

  • Baseline (existing code): ≥60% (current coverage)
  • New/changed code: ≥90% ✅ STRICTLY ENFORCED
  • Long-term goal: Improve baseline to 90%

View detailed HTML report in artifacts

@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from c2f628c to a97b06a Compare April 23, 2026 20:55
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from 6795311 to 852f830 Compare April 23, 2026 20:55
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from a97b06a to 17a9b4c Compare April 23, 2026 22:01
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from 852f830 to d29a625 Compare April 23, 2026 22:01
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from 17a9b4c to 1a4d392 Compare April 23, 2026 22:54
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from d29a625 to a869aa8 Compare April 23, 2026 22:54
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from 1a4d392 to a7ac07c Compare April 24, 2026 22:10
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from a869aa8 to bbaa8f6 Compare April 24, 2026 22:10
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from a7ac07c to 1abceee Compare April 25, 2026 03:50
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from bbaa8f6 to a996226 Compare April 25, 2026 03:51
Summary:
Intent:
- Add observability for pipeline cascade delete operations

Changes:
- Add 3 counters (started, completed, error) and 1 gauge (active_children) for cascade delete
- Wire metrics into handleDeletion at each phase transition and error path
- Register new metrics in RegisterPipelineMetrics

Test Plan:
- go test ./components/pipeline/... -v -count=1 (all tests pass)
- go build ./components/pipeline/... (builds successfully)

Revert Plan:
Revert this PR via git revert.

Jira Issues:
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-children branch from 1abceee to c248111 Compare April 25, 2026 05:57
@ishtoo1 ishtoo1 force-pushed the feat/cascade-delete-metrics branch from a996226 to a0c2eb4 Compare April 25, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants