Skip to content

Worker-pinning / affinity for VMs with worker-local host-dirs #443

@elmariachi111

Description

@elmariachi111

Problem

--host-dirs mounts a path from the worker's local filesystem into the VM. This works great for persistent state, but it creates an implicit worker-affinity constraint that Orchard does not currently enforce.

When a worker reboots, --restart-policy OnFailure correctly restarts the VM — but if the worker is slow to reconnect (or another worker is idle first), Orchard may reschedule the VM onto a different worker where the host-dir path does not exist. The VM then fails with:

tart command failed: "Error Domain=VZErrorDomain Code=2 \"A directory sharing device configuration is invalid.\"
UserInfo={..., NSLocalizedFailureReason=A directory sharing device configuration is invalid.,
NSUnderlyingError=... {Error Domain=NSPOSIXErrorDomain Code=2 \"No such file or directory\"}}"

Because the path is worker-local, the VM is now stuck in a crash-loop on the wrong worker with no automated recovery path.

Impact

Any production use of --host-dirs for persistent state (agent workspaces, databases, checkpoints) is broken by this: the restart policy that is supposed to provide HA instead causes data loss or an outage whenever a worker reboots.

Proposed solutions (pick one or combine)

  1. Worker affinity / node selector: allow orchard create vm --worker <name> (or a label selector) so a VM is always scheduled on — and only restarted on — a specific worker. This is the simplest fix and mirrors how Kubernetes nodeName / nodeSelector works.

  2. Sticky scheduling: when a VM has --host-dirs and was last running on worker X, treat worker X as the preferred (or required) restart target. Only reschedule elsewhere if worker X is permanently removed from the cluster.

  3. Cluster-wide host-dir paths: allow --host-dirs to reference a network-mounted path (NFS, SMB) that is identical across all workers, so worker identity doesn't matter. This is more of a documentation/integration story than a code change.

Current workaround

Operators must pause all other workers before creating a VM, wait until it shows running on the intended worker, and then resume. This is manual, error-prone, and offers no protection against post-creation rescheduling on OnFailure restarts.

Environment

  • Orchard controller + workers on Mac Mini (macOS)
  • --restart-policy OnFailure + --host-dirs on Linux guest VMs (Ubuntu 24.04 aarch64 via Virtualization.framework)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions