Skip to content

satellite: retry drbdadm secondary after mkfs on transient EBUSY#493

Open
IvanHunters wants to merge 1 commit into
LINBIT:masterfrom
IvanHunters:fix/retry-secondary-after-mkfs
Open

satellite: retry drbdadm secondary after mkfs on transient EBUSY#493
IvanHunters wants to merge 1 commit into
LINBIT:masterfrom
IvanHunters:fix/retry-secondary-after-mkfs

Conversation

@IvanHunters
Copy link
Copy Markdown

Problem

On systems with active block-device probing (Talos machined/block-controller,
udev, multipathd) the kernel uevent for a freshly-promoted DRBD device causes
an external probe to open() the device for superblock inspection. If
drbdadm secondary fires while the probe still holds the fd, DRBD returns:

StorageException: Failed to become secondary again after creating filesystem
drbdsetup secondary <res>: State change failed: (-12) Device is held open by someone

The satellite never reports the final UpToDate event to the controller, so
the controller view stays stuck in SyncTarget/Inconsistent/DELETING permanently —
even though the DRBD kernel state on the satellite is Established/UpToDate. This
silently blocks cleanup of Released PVs and accumulates with every PVC creation.

Reproduced in production on Talos 1.10 / LINSTOR 1.33.2: 4 occurrences in 24 hours
across 3 nodes for different PVCs.

Fixes #268.

Fix

Introduce secondaryAfterMkfs() in DrbdAdm — a wrapper that delegates to the
existing execAdmCommand infrastructure with an exponential-backoff retry executor:

  • 7 attempts, backoff 250 ms → 2 s (capped), total window ≈ 10 s
  • executeWithRetryOnDeviceHeldOpen retries only when stderr contains
    Device is held open by someone; any other failure still throws immediately,
    preserving the existing contract
  • Only DrbdPrimary.close() (the post-mkfs demotion path) calls the new wrapper;
    all other secondary() callers are untouched

Probes release the fd within milliseconds in practice; the ~10 s window covers
worst-case timing.

When LINSTOR promotes a DRBD device to Primary for mkfs and then
demotes it back to Secondary, a block-device probe running concurrently
(Talos machined/block-controller, udev, multipathd) may hold the device
fd open long enough for drbdadm secondary to receive EBUSY:

  StorageException: Failed to become secondary again after creating
  filesystem
  drbdsetup secondary <res>: State change failed: (-12) Device is held
  open by someone

The satellite never reports the final UpToDate event to the controller,
so the controller view stays in SyncTarget/Inconsistent/DELETING
permanently even though the DRBD kernel state is Established/UpToDate.
This silently blocks cleanup of Released PVs and accumulates with every
PVC creation.

Fix: introduce secondaryAfterMkfs() — a wrapper around the secondary
command that retries up to 7 times with exponential backoff (250 ms to
2 s) when the error is specifically Device is held open by someone. Any
other failure still fails fast. DrbdPrimary.close() is updated to call
this wrapper instead of secondary() so the retry applies only to the
post-mkfs demotion, leaving all other secondary() callers unchanged.

Probes release the fd within milliseconds in practice; the ~10 s retry
window covers the worst-case timing.

Fixes: LINBIT#268
Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
@IvanHunters IvanHunters marked this pull request as ready for review May 2, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some devices created as Inconsistent

1 participant