satellite: retry drbdadm secondary after mkfs on transient EBUSY by IvanHunters · Pull Request #493 · LINBIT/linstor-server

IvanHunters · 2026-05-02T22:02:59Z

Problem

On systems with active block-device probing (Talos machined/block-controller,
udev, multipathd) the kernel uevent for a freshly-promoted DRBD device causes
an external probe to open() the device for superblock inspection. If
drbdadm secondary fires while the probe still holds the fd, DRBD returns:

StorageException: Failed to become secondary again after creating filesystem
drbdsetup secondary <res>: State change failed: (-12) Device is held open by someone

The satellite never reports the final UpToDate event to the controller, so
the controller view stays stuck in SyncTarget/Inconsistent/DELETING permanently —
even though the DRBD kernel state on the satellite is Established/UpToDate. This
silently blocks cleanup of Released PVs and accumulates with every PVC creation.

Reproduced in production on Talos 1.10 / LINSTOR 1.33.2: 4 occurrences in 24 hours
across 3 nodes for different PVCs.

Fixes #268.

Fix

Introduce secondaryAfterMkfs() in DrbdAdm — a wrapper that delegates to the
existing execAdmCommand infrastructure with an exponential-backoff retry executor:

7 attempts, backoff 250 ms → 2 s (capped), total window ≈ 10 s
executeWithRetryOnDeviceHeldOpen retries only when stderr contains
Device is held open by someone; any other failure still throws immediately,
preserving the existing contract
Only DrbdPrimary.close() (the post-mkfs demotion path) calls the new wrapper;
all other secondary() callers are untouched

Probes release the fd within milliseconds in practice; the ~10 s window covers
worst-case timing.

When LINSTOR promotes a DRBD device to Primary for mkfs and then demotes it back to Secondary, a block-device probe running concurrently (Talos machined/block-controller, udev, multipathd) may hold the device fd open long enough for drbdadm secondary to receive EBUSY: StorageException: Failed to become secondary again after creating filesystem drbdsetup secondary <res>: State change failed: (-12) Device is held open by someone The satellite never reports the final UpToDate event to the controller, so the controller view stays in SyncTarget/Inconsistent/DELETING permanently even though the DRBD kernel state is Established/UpToDate. This silently blocks cleanup of Released PVs and accumulates with every PVC creation. Fix: introduce secondaryAfterMkfs() — a wrapper around the secondary command that retries up to 7 times with exponential backoff (250 ms to 2 s) when the error is specifically Device is held open by someone. Any other failure still fails fast. DrbdPrimary.close() is updated to call this wrapper instead of secondary() so the retry applies only to the post-mkfs demotion, leaving all other secondary() callers unchanged. Probes release the fd within milliseconds in practice; the ~10 s retry window covers the worst-case timing. Fixes: LINBIT#268 Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>

IvanHunters marked this pull request as ready for review May 2, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

satellite: retry drbdadm secondary after mkfs on transient EBUSY#493

satellite: retry drbdadm secondary after mkfs on transient EBUSY#493
IvanHunters wants to merge 1 commit into
LINBIT:masterfrom
IvanHunters:fix/retry-secondary-after-mkfs

IvanHunters commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

IvanHunters commented May 2, 2026

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant