Skip to content

Add Longhorn storage health probe for read-only mount failures and disk pressure#2706

Draft
Copilot wants to merge 2 commits into
masterfrom
copilot/detect-readonly-mounted-storage
Draft

Add Longhorn storage health probe for read-only mount failures and disk pressure#2706
Copilot wants to merge 2 commits into
masterfrom
copilot/detect-readonly-mounted-storage

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

Pods can continue running while Longhorn-backed mounts become read-only after filesystem errors; restarting workloads may recover but there was no proactive signal. This change adds cluster-side detection for read-only mount symptoms and Longhorn disk-capacity pressure.

  • What changed

    • Added cluster/core/longhorn-system/storage-health-check.yaml with:
      • ServiceAccount + scoped ClusterRole/ClusterRoleBinding
      • CronJob (longhorn-storage-health-check) on a 15-minute cadence
    • Wired the new manifest into cluster/core/longhorn-system/kustomization.yaml.
  • Detection coverage

    • Warning events containing read-only filesystem indicators (read-only file system, erofs, remount-ro)
    • Pod container waiting/termination messages containing read-only filesystem errors
    • Longhorn volumes in unhealthy states (e.g., faulted, or attached but not healthy)
    • Longhorn disk usage above threshold (currently >=85% used)
  • Operational behavior

    • Emits targeted findings to job logs.
    • Exits non-zero when any condition is detected, creating an explicit failure signal for visibility/alerting pipelines.
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: check
              command:
                - /bin/sh
                - -ec
                - |
                  # flags read-only indicators + Longhorn unhealthy volumes + >=85% disk usage
                  ...
                  if [ "${issues}" -ne 0 ]; then exit 1; fi

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • charts.fairwinds.com
    • Triggering command: /usr/local/bin/helm helm repo add fairwinds-stable REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Agent-Logs-Url: https://github.com/jeffborg/home-network/sessions/44b758de-6369-4fa8-a65e-4042780c8d01

Co-authored-by: jeffborg <1595430+jeffborg@users.noreply.github.com>
Copilot AI changed the title [WIP] Add detection for read-only mounted storage in pods Add Longhorn storage health probe for read-only mount failures and disk pressure May 26, 2026
Copilot AI requested a review from jeffborg May 26, 2026 00:41
@jeffborg
Copy link
Copy Markdown
Owner

@copilot is this disk usage within the longhorn volumes themselves?
also where do the notifications go to?

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 26, 2026

@copilot is this disk usage within the longhorn volumes themselves?
also where do the notifications go to?

  1. Not currently inside each mounted volume filesystem. The check uses nodes.longhorn.io diskStatus (storageMaximum vs storageAvailable), so it measures Longhorn backing-disk capacity on each Longhorn node/disk.

  2. Right now this CronJob does not send external notifications by itself; it only fails the Job and logs findings. So visibility is in Job status/logs (and any cluster-wide alerting you already have on failed Jobs/CronJobs). The existing Flux Slack alert in this repo is for Flux resources in flux-system, not this longhorn-system CronJob.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

need to detect pods where the mounted storage becomes read only due to error

2 participants