fix(jobs): kill/reaper lifecycle correctness + tests#2
Conversation
Each job's `sh` now leads its own process group (process_group(0) on unix), and kill signals the negative pgid via `kill -- -<pgid>`. Without this, killing only the shell's pid orphaned any children it spawned (e.g. `cmd &`), leaving them running after a kill. Test `kill_reaches_descendants` proves a backgrounded grandchild is reaped by the kill; verified it fails without the process-group change. Co-Authored-By: Claude <noreply@anthropic.com>
- kill() now SIGTERMs the process group, then SIGKILLs after a grace period if it hasn't exited (escalation via signal_group helper). - Return false when the job is not Running (unknown id or already finished) — nothing to signal; previously always returned true. - Wait out the grace event-driven on the job's completion watch flag (exited_within) instead of polling. - Tests: unknown id, finished job, and TERM-ignored escalation to KILL. Co-Authored-By: Claude <noreply@anthropic.com>
The hourly reaper evicted aged-out jobs (map.remove + log delete) but never signalled a still-`Running` job's process group, orphaning the whole process tree. Now it kills the group (TERM->KILL escalation) before eviction, reusing the same `kill_job` helper as `job(action="kill")`. mod.rs crossed 300 LOC, so per the module convention the reaper + signal helpers move to `src/jobs/reaper.rs`. `reap_once(jobs, retention)` is extracted with an injectable retention so the new test drives a real eviction with zero retention and asserts the descendant is reaped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- kill_terminates_child_process: bg sleep 1000, kill, assert state leaves Running - list_reports_all_jobs: two jobs, assert both appear sorted in list() - kill_unknown_id_returns_false already existed; no duplicate added Co-Authored-By: Claude <noreply@anthropic.com>
Adapted from the developerz-ai house template (db-mcp-gateway), scoped to mcp-ssh's modules and CLAUDE.md NEVER/Conventions rules. Beta features on (early_access), concise path instructions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe PR adds CodeRabbit repository rules and updates job handling to use process groups, a separate reaper module, and completion signaling. Unix tests expand around kill behavior, stale-job reaping, and job listing/polling. ChangesRepository guidance
Job process-group lifecycle
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.coderabbit.yaml:
- Around line 132-136: The path_filters in the .coderabbit.yaml config are
excluding lockfiles, including Cargo.lock, from review. Remove the lockfile
ignore entries from the path_filters list so dependency changes in the Rust
binary are visible during review, while keeping the other target/** exclusions
intact.
In `@src/jobs/mod.rs`:
- Around line 32-38: Replace the bare `Option<u32>` process-group identifier in
`Job` with a private `ProcessGroupId` newtype and thread that type through the
job lifecycle instead of using the raw primitive. Update `Job::run`,
`Job::kill`, and the `reaper` integration so they construct, store, and pass
`ProcessGroupId` consistently, while keeping the underlying value encapsulated.
Make sure any matching logic or signaling in `src/jobs/reaper.rs` uses the
newtype rather than `u32`, and preserve the existing semantics for the “no pid”
case with `Option<ProcessGroupId>`.
- Around line 466-480: The list test currently claims to cover a backgrounded
job but both store.run calls in the job listing test use background=false, so
the background execution path is not exercised. Update the test around store.run
and store.list in src/jobs/mod.rs so one job is launched with background=true
and the assertions verify that list includes both the inline and backgrounded
commands, ensuring the new public behavior is actually covered.
In `@src/jobs/reaper.rs`:
- Around line 20-32: The kill_job flow currently reports success even when
signal_group fails to actually send TERM or KILL to the process group. Update
kill_job to check and propagate signaling failures from signal_group before
returning true, and only continue to the KILL escalation path if TERM was
successfully delivered; use the existing kill_job, signal_group, and
exited_within helpers to keep the TERM->KILL behavior while ensuring false is
returned on any signal error.
- Around line 70-87: In reap_once in reaper.rs, stale jobs are being removed
from the jobs map before kill_job() runs, which leaves still-running process
groups untracked during TERM grace/escalation. Change the ordering so each stale
Arc<Job> is passed to kill_job() first, then remove its entry from the
Mutex<HashMap<String, Arc<Job>>> afterward, and only then delete the log file;
keep the stale collection logic intact while ensuring running jobs remain
pollable/killable until termination completes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef757399-409e-427e-b0d2-90bd3e5ef6c6
📒 Files selected for processing (4)
.coderabbit.yamlCLAUDE.mdsrc/jobs/mod.rssrc/jobs/reaper.rs
- reaper: signal_group returns delivery status; kill_job reports false when TERM/KILL never reach the group - reaper: reap_once kills stale jobs before evicting them, so a still-Running group is never orphaned mid-grace - jobs: wrap pgid in a private ProcessGroupId newtype - test: list_reports_all_jobs now exercises the backgrounded path - coderabbit: stop filtering Cargo.lock/lockfiles out of review Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/jobs/reaper.rs`:
- Around line 95-106: The reaper currently removes every stale job from the jobs
map even when kill delivery fails in `kill_job`, which causes running jobs to be
evicted prematurely. Update `reap_once` in `src/jobs/reaper.rs` to inspect the
result of `kill_job(job).await` and only remove entries from `jobs.lock().await`
when the job is finished or when the running job’s group was successfully
signalled; keep failed running jobs tracked for a later reap attempt. Use the
existing `kill_job`, `reap_once`, and `jobs` map logic to separate successful
kills from failed ones before calling `remove`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 376058fd-c1c4-42af-86a5-3a9c241c2548
📒 Files selected for processing (3)
.coderabbit.yamlsrc/jobs/mod.rssrc/jobs/reaper.rs
💤 Files with no reviewable changes (1)
- .coderabbit.yaml
reap_once removed every stale job and deleted its log even when kill_job failed on a still-Running job, orphaning the live process group and deleting its log out from under it. Only evict jobs that finished or whose group was successfully signalled; keep failed running jobs tracked for a later reap pass. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/jobs/mod.rs`:
- Around line 453-458: The test around reaper::reap_once is currently using
tokio::time::Instant::now(), which can skip the stale path when retention is
Duration::ZERO instead of deterministically exercising kill-failure handling.
Backdate the job’s started timestamp in the test setup within src/jobs/mod.rs so
duration_since(started) is definitely greater than the retention, and the reaper
branch that attempts to kill a stale job is always hit. Use the existing job
insertion and reaper::reap_once test case to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6e2df70c-4af1-445f-9452-b5557bd74ecb
📒 Files selected for processing (2)
src/jobs/mod.rssrc/jobs/reaper.rs
- reaper_keeps_running_job_when_kill_fails used Instant::now(), letting the test pass without exercising the kill-failure path under ZERO retention - Backdate started by 1s so duration_since > retention is guaranteed Co-Authored-By: Claude <noreply@anthropic.com>
Summary
kill_terminates_child_process,kill_unknown_id_returns_false,list_reports_all_jobs(+ earlier:kill_reaches_descendants,kill_escalates_to_sigkill_when_term_ignored,reaper_kills_running_job_before_eviction)Tests
19 pass, 0 fail
Summary by CodeRabbit