ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them

## Summary

`ecluse down` (and `ecluse flush`) only kill the top-level PIDs recorded in `.ecluse/pids/<slug>/<service>.pid`, which for pnpm-based stacks is the outer `pnpm` wrapper. The grandchild `vite` (or `tsx`/`node`) process that actually binds the dev-server port is left running as an orphan adopted by `init`/`launchd`. Over many `up`/`down` cycles these orphans accumulate, each holding multiple listening sockets (vite + Cloudflare workerd plugin grab 4-6+ ports per instance), causing silent cross-slot port collisions.

## Version / environment

- ecluse 0.3.1
- `process_manager = "tmux"`
- macOS 14, zsh
- Stack: pnpm workspace with Cloudflare's `@cloudflare/vite-plugin` (the plugin opens additional internal sockets beyond `--port`)

## Reproduction

```bash
# Repo has [[services]] entries like:
#   [[services]]
#   name = "backoffice-app"
#   base_port = 7300
#   port_env = "ECLUSE_BACKOFFICE_APP_PORT"
#   command = "pnpm --filter backoffice-app exec vite --port \$ECLUSE_BACKOFFICE_APP_PORT"

ecluse up feat-a
ecluse up feat-b
ecluse up feat-c
ecluse up feat-d

# Snapshot ecluse-tracked PIDs:
for s in feat-a feat-b feat-c feat-d; do
  echo "=== $s ==="
  cat .ecluse/pids/$s/backoffice-app.pid
done
# All point at the outer pnpm wrappers, e.g. PID 89891.

# Snapshot what actually binds the configured 73xx ports:
pgrep -fl "vite.js --port" | head
# Output shows:
#   90252 node ./node_modules/.bin/../vite/bin/vite.js --port 7301   <- slot 1's actual vite
#   92855 node ./node_modules/.bin/../vite/bin/vite.js --port 7302   <- slot 2's
#   ... etc
# Note: PIDs are very different from the recorded .pid values — the recorded
# PID is the outer pnpm; the vite PID is a grandchild.

# Tear down all four:
for s in feat-a feat-b feat-c feat-d; do ecluse down "$s" --keep-worktree; done

# Check what survived:
pgrep -fl "vite.js --port" | head
# Result: vite grandchildren may still be running, e.g.:
#   81906 node ./node_modules/.bin/../vite/bin/vite.js --port 7299
# (Note port 7299 — this one auto-bumped from a previous startup retry and
# is now holding 8 ports across multiple slots' ranges.)

# Try ecluse flush — even this leaves the orphan alive:
ecluse flush --yes
pgrep -fl "vite.js --port"
#   81906 node ./node_modules/.bin/../vite/bin/vite.js --port 7299   <- still alive!

# Only `kill -9 81906` actually removes it.
```

## Why this matters

Each leftover vite/workerd instance holds **multiple** sockets, not just `--port`:
- Vite's HMR port
- Cloudflare workerd's inspector port (default 9229)
- The proxy worker socket (random high port via `127.0.0.1:0`)
- The vite-plugin's secondary HMR port

A single zombie from an earlier run can hold 6-8 ports. Some of those ports land in the range a future `ecluse up` configures for a different slot. The future `ecluse up` succeeds because ecluse only checks the *configured* port (e.g. 7301), but vite's own auto-port-bumping silently lands the new instance on a port that overlaps with yet another slot. Net effect: parallel worktrees end up serving each other's content, the operator can't tell which URL maps to which worktree, and `ecluse status` reports "all healthy" because the configured PIDs are alive.

The bug only becomes visible when the user navigates to `http://localhost:7301` expecting slot 1 and sees slot 4's branch instead — at which point they have no way to find out what's actually serving the response without `lsof` and manual port tracing.

## Suggested fix

**Option 1: Kill the process group, not just the PID.**

When ecluse spawns a service via `tmux send-keys`, the spawned process becomes the leader of a new process group (it's spawned interactively in a fresh tmux window). On teardown, instead of `kill <pid>`, use `kill -- -<pgid>` to signal the whole group. POSIX `setpgid(0,0)` semantics — every descendant of the leader inherits the same PGID until they explicitly fork into a new one. For pnpm + vite chains, all descendants are in the same group.

```rust
// pseudo: instead of
nix::sys::signal::kill(Pid::from_raw(pid), Signal::SIGTERM)?;
// do
nix::sys::signal::killpg(Pid::from_raw(pid), Signal::SIGTERM)?;
```

**Option 2: Walk the process tree.**

Use `pgrep -P <pid>` recursively (or platform equivalent) to enumerate descendants before killing. Slower than killpg but works even if some descendant has called `setsid()`.

**Option 3 (defensive — even with option 1 or 2):** also enumerate every process whose `cwd` is the worktree directory on `ecluse down`/`ecluse flush`, and SIGKILL them. This catches orphans that have already detached. On macOS: `lsof +D <worktree>` or `lsof -p <pid>`; on Linux: `/proc/*/cwd` symlinks. This is what `ecluse flush` would *ideally* do — flush should leave zero processes referencing any of the worktree paths.

## Workaround in use today

Setting `strict_port = true` in `.ecluse.toml` and `server.strictPort = true` in vite configs makes the next `ecluse up` fail loudly instead of silently colliding, so at least the bug becomes visible. But it still requires `kill -9` of the zombies by hand. A clean teardown that actually cleans would remove this hand-off entirely.

## Impact

- Severity: high once you have 3+ parallel sessions and a few `up`/`down` cycles.
- Hard to diagnose: `ecluse status` reports all green, the configured PID is alive, the configured port responds with HTTP 200 — but it's serving the *wrong* worktree's content.
- Affects any stack that uses a wrapper script (pnpm/npm/yarn/bin/dev) whose actual server is a multi-level descendant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them #30

Summary

Version / environment

Reproduction

Why this matters

Suggested fix

Workaround in use today

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ecluse down kills top-level pnpm wrapper but vite descendants survive as orphans; ecluse flush also misses them #30

Description

Summary

Version / environment

Reproduction

Why this matters

Suggested fix

Workaround in use today

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions