Skip to content

Orphan blobs accumulate in S3 backend when cache_entries DB rows are deleted (cleanup leak) #236

@cyphercider

Description

@cyphercider

Summary

Running github-actions-cache-server@9.4.7 against a self-hosted MinIO S3 backend, we observe that blobs in the S3 bucket are not deleted when the corresponding cache_entries SQLite row is removed (e.g., key rotation, branch deletion, retention expiry). The bucket grows monotonically until it hits XMinioStorageFull, at which point the cache silently stops accepting writes and CI jobs hang on actions/cache@v4.

Environment

  • github-actions-cache-server: 9.4.7 (HelmRelease gha-cache-server v7, chart github-actions-cache-server@1.0.3)
  • Self-hosted MinIO backend in same k8s namespace (github-runner), bucket gha-cache
  • CACHE_CLEANUP_OLDER_THAN_DAYS=5 (env var set on the deployment)
  • ENABLE_DIRECT_DOWNLOADS=true
  • ARC runner-set workload — salespath repos, mixed Go + Node/pnpm test workflows
  • Runner cache keys: setup-go-Linux-x64-... (~462 MB each), node-cache-Linux-x64-pnpm-... (~varies)

Evidence — orphan accumulation over 6h post-purge

After a full manual purge per the recovery procedure (scale cache-server to 0, wipe gh-actions-cache/, wipe cache-server.db*, scale back up), the bucket grew back 5 top-level entry directories in 6 hours but the cache_entries table only contained 2 rows during the same window. Orphan ratio: 2.5:1.

Time DB rows MinIO entry dirs Orphan blobs Bucket size
Pre-purge 29 (all >5d) 55 ~26 (~9.6 GB) 19 GB / 100% (XMinioStorageFull)
T+0 (post-purge) 0 0 0 1.1 GB (.minio.sys only)
T+6h 2 (both today) 5 3 (~900 MB) 6.4 GB / 33%

gha-cache-server logs over the 6h window: zero cleanup / delete / expire / purge activity logged. Either the cleanup tick is silent on success, or it isn't running.

At this orphan-creation rate (~12 orphans/day, ~4.5 GB/day of orphan growth), the 20 Gi PVC refills in ~5 days post-purge. We've experienced the resulting XMinioStorageFull outage twice in three weeks.

Expected behavior

When the cache-server cleanup tick runs:

  1. Identify cache_entries rows past CACHE_CLEANUP_OLDER_THAN_DAYS retention → DELETE the rows
  2. Also identify S3 objects whose corresponding cache_entries row no longer exists → DELETE the objects
  3. Log the cleanup activity (counts + bytes reclaimed) for observability

Actual behavior

Either the cleanup tick is not firing at all, OR it deletes DB rows without also deleting the S3 objects. Result: monotonic bucket growth until full.

Workaround we're deploying

A daily Kubernetes CronJob that runs mc rm --recursive --force --older-than 7d against the bucket. 7d gives 2d safety margin over the documented 5d retention. This is a band-aid; the cleanup tick should handle this internally.

Ask

  1. Is this a known issue? If so, is there a target version for the fix?
  2. Is there a config option to enable verbose cleanup-tick logging so we can confirm whether it fires?
  3. Would a PR to add explicit orphan-blob sweeping to the cleanup tick be welcome?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions