Skip to content

ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them #30

Description

@hefgi

Summary

ecluse down (and ecluse flush) only kill the top-level PIDs recorded in .ecluse/pids/<slug>/<service>.pid, which for pnpm-based stacks is the outer pnpm wrapper. The grandchild vite (or tsx/node) process that actually binds the dev-server port is left running as an orphan adopted by init/launchd. Over many up/down cycles these orphans accumulate, each holding multiple listening sockets (vite + Cloudflare workerd plugin grab 4-6+ ports per instance), causing silent cross-slot port collisions.

Version / environment

  • ecluse 0.3.1
  • process_manager = "tmux"
  • macOS 14, zsh
  • Stack: pnpm workspace with Cloudflare's @cloudflare/vite-plugin (the plugin opens additional internal sockets beyond --port)

Reproduction

# Repo has [[services]] entries like:
#   [[services]]
#   name = "backoffice-app"
#   base_port = 7300
#   port_env = "ECLUSE_BACKOFFICE_APP_PORT"
#   command = "pnpm --filter backoffice-app exec vite --port \$ECLUSE_BACKOFFICE_APP_PORT"

ecluse up feat-a
ecluse up feat-b
ecluse up feat-c
ecluse up feat-d

# Snapshot ecluse-tracked PIDs:
for s in feat-a feat-b feat-c feat-d; do
  echo "=== $s ==="
  cat .ecluse/pids/$s/backoffice-app.pid
done
# All point at the outer pnpm wrappers, e.g. PID 89891.

# Snapshot what actually binds the configured 73xx ports:
pgrep -fl "vite.js --port" | head
# Output shows:
#   90252 node ./node_modules/.bin/../vite/bin/vite.js --port 7301   <- slot 1's actual vite
#   92855 node ./node_modules/.bin/../vite/bin/vite.js --port 7302   <- slot 2's
#   ... etc
# Note: PIDs are very different from the recorded .pid values — the recorded
# PID is the outer pnpm; the vite PID is a grandchild.

# Tear down all four:
for s in feat-a feat-b feat-c feat-d; do ecluse down "$s" --keep-worktree; done

# Check what survived:
pgrep -fl "vite.js --port" | head
# Result: vite grandchildren may still be running, e.g.:
#   81906 node ./node_modules/.bin/../vite/bin/vite.js --port 7299
# (Note port 7299 — this one auto-bumped from a previous startup retry and
# is now holding 8 ports across multiple slots' ranges.)

# Try ecluse flush — even this leaves the orphan alive:
ecluse flush --yes
pgrep -fl "vite.js --port"
#   81906 node ./node_modules/.bin/../vite/bin/vite.js --port 7299   <- still alive!

# Only `kill -9 81906` actually removes it.

Why this matters

Each leftover vite/workerd instance holds multiple sockets, not just --port:

  • Vite's HMR port
  • Cloudflare workerd's inspector port (default 9229)
  • The proxy worker socket (random high port via 127.0.0.1:0)
  • The vite-plugin's secondary HMR port

A single zombie from an earlier run can hold 6-8 ports. Some of those ports land in the range a future ecluse up configures for a different slot. The future ecluse up succeeds because ecluse only checks the configured port (e.g. 7301), but vite's own auto-port-bumping silently lands the new instance on a port that overlaps with yet another slot. Net effect: parallel worktrees end up serving each other's content, the operator can't tell which URL maps to which worktree, and ecluse status reports "all healthy" because the configured PIDs are alive.

The bug only becomes visible when the user navigates to http://localhost:7301 expecting slot 1 and sees slot 4's branch instead — at which point they have no way to find out what's actually serving the response without lsof and manual port tracing.

Suggested fix

Option 1: Kill the process group, not just the PID.

When ecluse spawns a service via tmux send-keys, the spawned process becomes the leader of a new process group (it's spawned interactively in a fresh tmux window). On teardown, instead of kill <pid>, use kill -- -<pgid> to signal the whole group. POSIX setpgid(0,0) semantics — every descendant of the leader inherits the same PGID until they explicitly fork into a new one. For pnpm + vite chains, all descendants are in the same group.

// pseudo: instead of
nix::sys::signal::kill(Pid::from_raw(pid), Signal::SIGTERM)?;
// do
nix::sys::signal::killpg(Pid::from_raw(pid), Signal::SIGTERM)?;

Option 2: Walk the process tree.

Use pgrep -P <pid> recursively (or platform equivalent) to enumerate descendants before killing. Slower than killpg but works even if some descendant has called setsid().

Option 3 (defensive — even with option 1 or 2): also enumerate every process whose cwd is the worktree directory on ecluse down/ecluse flush, and SIGKILL them. This catches orphans that have already detached. On macOS: lsof +D <worktree> or lsof -p <pid>; on Linux: /proc/*/cwd symlinks. This is what ecluse flush would ideally do — flush should leave zero processes referencing any of the worktree paths.

Workaround in use today

Setting strict_port = true in .ecluse.toml and server.strictPort = true in vite configs makes the next ecluse up fail loudly instead of silently colliding, so at least the bug becomes visible. But it still requires kill -9 of the zombies by hand. A clean teardown that actually cleans would remove this hand-off entirely.

Impact

  • Severity: high once you have 3+ parallel sessions and a few up/down cycles.
  • Hard to diagnose: ecluse status reports all green, the configured PID is alive, the configured port responds with HTTP 200 — but it's serving the wrong worktree's content.
  • Affects any stack that uses a wrapper script (pnpm/npm/yarn/bin/dev) whose actual server is a multi-level descendant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions