Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 71 additions & 6 deletions plugins/gitops-kubernetes/skills/gitops-cluster-debug/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
compatibility: Requires flux-operator-mcp
description: |
Debug and troubleshoot Flux CD on live Kubernetes clusters (not local repo files) via the Flux MCP server — inspects Flux resource status, reads controller logs, traces dependency chains, and performs installation health checks. Use when users report failing, stuck, or not-ready Flux resources on a cluster, reconciliation errors, controller issues, artifact pull failures, or need live cluster Flux Operator troubleshooting.
Debug and troubleshoot Flux CD on live Kubernetes clusters (not local repo files) via the Flux MCP server — inspects Flux resource status, reads controller logs, traces dependency chains, and performs installation health checks. Use when users report failing, stuck, or not-ready Flux resources on a cluster, reconciliation errors, controller issues, artifact pull failures, image automation not updating tags, alerts or webhooks not being delivered, or need live cluster Flux Operator troubleshooting.
license: Apache-2.0
metadata:
github-path: skills/gitops-cluster-debug
github-ref: refs/tags/v0.0.4
github-ref: refs/tags/v0.1.0
github-repo: https://github.com/fluxcd/agent-skills
github-tree-sha: 434d303add349e64129103242e7ec7d58d60f8a1
github-tree-sha: 5d21c19fc629322cb58d56fe6fe23b9702becacc
name: gitops-cluster-debug
---
# Flux Cluster Debugger
Expand All @@ -26,8 +26,9 @@ root causes.
- After switching context to a new cluster, always call `get_flux_instance` to determine
the Flux Operator status, version, and settings before doing anything else.
- When creating or updating resources on the cluster, generate a Kubernetes YAML manifest
and call the `apply_kubernetes_resource` tool. Do not apply resources unless explicitly
requested by the user. Before generating any YAML manifest, read the relevant OpenAPI
and call the `apply_kubernetes_manifest` tool. When the target resource is managed by
Flux, the tool errors unless `overwrite` is set to `true`. Do not apply resources unless
explicitly requested by the user. Before generating any YAML manifest, read the relevant OpenAPI
schema from `assets/schemas/` to verify the exact field names
and nesting. Schema files follow the naming convention `{kind}-{group}-{version}.json`
(see the CRD reference table below).
Expand Down Expand Up @@ -123,7 +124,71 @@ Follow these steps when troubleshooting a ResourceSet:
7. Create a root cause analysis report. Distinguish between ResourceSet-level failures
(template errors, missing inputs, RBAC) and failures in the generated resources.

### Workflow 5: Kubernetes Logs Analysis
### Workflow 5: Source Debugging

Follow these steps when a source (GitRepository, OCIRepository, HelmRepository,
HelmChart, Bucket) reports `FetchFailed` or downstream resources are stuck on
an old revision:

1. Call `get_flux_instance` to check the source-controller deployment status and
the `apiVersion` of the source kind.
2. Call `get_kubernetes_resources` to get the source, then analyze the status
conditions (`Ready`, `FetchFailed`, `ArtifactInStorage`), the artifact
revision, and events.
3. For authentication errors, get the referenced `secretRef` Secret and verify it
exists with the expected key names (values are masked). For cloud registries
with no secret, check `.spec.provider` and workload identity.
4. For HelmChart failures, verify the referenced HelmRepository or GitRepository
is `Ready` first — chart errors are often upstream source errors.
5. Compare the last reconcile time against `.spec.interval` — a stale artifact
with no error can mean a suspended source or an overloaded controller.
6. Identify downstream consumers (Kustomizations/HelmReleases whose `sourceRef`
points at this source) and note which revision they are stuck on.
7. Create a root cause analysis report. Load `references/troubleshooting.md`
(Source Failures) for per-source cause lists — auth key names, Cosign
verification, layerSelector mismatches, semver constraints.

### Workflow 6: Image Automation Debugging

Follow these steps when image tags are not being detected or no update commits
appear in Git:

1. Call `get_flux_instance` and verify `image-reflector-controller` and
`image-automation-controller` are listed in the components and running.
2. Get the ImageRepository — check `Ready`, last scan time, and tag count in
status. Auth failures point to the `secretRef` or `.spec.provider`.
3. Get the ImagePolicy — check `Ready` and `status.latestImage`. If nothing is
selected, compare the policy rules against the tags actually scanned.
4. Get the ImageUpdateAutomation — check `Ready`, last push time, and events.
Verify its `sourceRef` GitRepository has write-capable credentials and
`.spec.git.push.branch` is the branch the user is watching.
5. If everything is `Ready` but no commits appear: verify manifests under
`.spec.update.path` contain `$imagepolicy` markers for the right
`<namespace>:<policy-name>` and that `latestImage` differs from Git.
6. Create a root cause analysis report tracing ImageRepository → ImagePolicy →
ImageUpdateAutomation → GitRepository.

### Workflow 7: Notification Debugging

Follow these steps when alerts are not being delivered or a webhook Receiver
does not trigger reconciliation:

1. Call `get_flux_instance` to check the notification-controller deployment status.
2. Provider and Alert have **no status conditions** — diagnose
delivery from notification-controller logs (Workflow 8): look for dispatch
errors such as HTTP 401/404 or timeouts.
3. Get the Alert and verify `.spec.eventSources` matches the resources expected
to produce events and `.spec.eventSeverity` is not filtering them out.
4. Get the referenced Provider and verify `.spec.type`, `.spec.address`, and the
`secretRef` Secret key names.
5. For Receivers (these do have a `Ready` condition): verify `status.webhookPath`
and the webhook Secret, then check logs for incoming requests to that path —
none means the external service is not calling the webhook.
6. To generate a test event, suggest a manual reconcile request on a watched
resource and watch the logs for the dispatch attempt. Load
`references/troubleshooting.md` (Notification Failures) for cause lists.

### Workflow 8: Kubernetes Logs Analysis

When analyzing logs for any workload:

Expand Down

This file was deleted.

Loading
Loading