Skip to content

fix(jobs): kill/reaper lifecycle correctness + tests#2

Merged
sebyx07 merged 8 commits into
mainfrom
fix/job-kill-reaper-correctness
Jun 26, 2026
Merged

fix(jobs): kill/reaper lifecycle correctness + tests#2
sebyx07 merged 8 commits into
mainfrom
fix/job-kill-reaper-correctness

Conversation

@sebyx07

@sebyx07 sebyx07 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Spawn each job in its own process group so kills reach descendants
  • Kill the process group with TERM→KILL escalation; return accurate bool status
  • Reaper kills still-Running jobs before eviction (was leaking processes)
  • Tests: kill_terminates_child_process, kill_unknown_id_returns_false, list_reports_all_jobs (+ earlier: kill_reaches_descendants, kill_escalates_to_sigkill_when_term_ignored, reaper_kills_running_job_before_eviction)

Tests

19 pass, 0 fail

Summary by CodeRabbit

  • Bug Fixes
    • Improved job termination to stop descendant processes via process-group handling, with TERM→SIGKILL escalation when a job doesn’t exit in time.
    • Refined job state transitions during kill/reaper flows to improve reliability and correctness.
  • New Features
    • Added an hourly job reaper that evicts jobs older than 24 hours, terminates their process groups, and removes associated logs.
  • Documentation
    • Updated internal guidance to reflect the split job engine and reaper structure.
  • Tests
    • Expanded Unix-gated test coverage for process-group killing, escalation behavior, and reaper eviction outcomes.
  • Chores
    • Updated automated review and linting/security configuration to enforce stricter standards.

sebyx07 and others added 5 commits June 26, 2026 16:59
Each job's `sh` now leads its own process group (process_group(0) on
unix), and kill signals the negative pgid via `kill -- -<pgid>`. Without
this, killing only the shell's pid orphaned any children it spawned
(e.g. `cmd &`), leaving them running after a kill.

Test `kill_reaches_descendants` proves a backgrounded grandchild is
reaped by the kill; verified it fails without the process-group change.

Co-Authored-By: Claude <noreply@anthropic.com>
- kill() now SIGTERMs the process group, then SIGKILLs after a grace
  period if it hasn't exited (escalation via signal_group helper).
- Return false when the job is not Running (unknown id or already
  finished) — nothing to signal; previously always returned true.
- Wait out the grace event-driven on the job's completion watch flag
  (exited_within) instead of polling.
- Tests: unknown id, finished job, and TERM-ignored escalation to KILL.

Co-Authored-By: Claude <noreply@anthropic.com>
The hourly reaper evicted aged-out jobs (map.remove + log delete) but
never signalled a still-`Running` job's process group, orphaning the whole
process tree. Now it kills the group (TERM->KILL escalation) before
eviction, reusing the same `kill_job` helper as `job(action="kill")`.

mod.rs crossed 300 LOC, so per the module convention the reaper + signal
helpers move to `src/jobs/reaper.rs`. `reap_once(jobs, retention)` is
extracted with an injectable retention so the new test drives a real
eviction with zero retention and asserts the descendant is reaped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- kill_terminates_child_process: bg sleep 1000, kill, assert state leaves Running
- list_reports_all_jobs: two jobs, assert both appear sorted in list()
- kill_unknown_id_returns_false already existed; no duplicate added

Co-Authored-By: Claude <noreply@anthropic.com>
Adapted from the developerz-ai house template (db-mcp-gateway), scoped to
mcp-ssh's modules and CLAUDE.md NEVER/Conventions rules. Beta features on
(early_access), concise path instructions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sebyx07 sebyx07 added the claudetm Created by Claude Task Master label Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4d212621-e8b3-4e10-b6a8-77dcf5e31a49

📥 Commits

Reviewing files that changed from the base of the PR and between 2c1ab88 and 525e009.

📒 Files selected for processing (1)
  • src/jobs/mod.rs

Walkthrough

The PR adds CodeRabbit repository rules and updates job handling to use process groups, a separate reaper module, and completion signaling. Unix tests expand around kill behavior, stale-job reaping, and job listing/polling.

Changes

Repository guidance

Layer / File(s) Summary
CodeRabbit configuration
.coderabbit.yaml
Top-level review settings, tool selection, path instructions, path filters, chat behavior, and knowledge-base settings are added.
CLAUDE references
CLAUDE.md
The job engine module map and lookup guidance now reference src/jobs/mod.rs and src/jobs/reaper.rs.

Job process-group lifecycle

Layer / File(s) Summary
Job wiring and state
src/jobs/mod.rs
The job store uses the reaper submodule, tracks process-group ids and completion state, and starts shells in their own process groups.
Kill and reaper logic
src/jobs/mod.rs, src/jobs/reaper.rs
kill delegates to reaper::kill_job, and the new reaper module removes stale jobs, kills their process groups, and deletes log files.
Job tests
src/jobs/mod.rs
Job behavior tests cover inline completion, backgrounding, descendant termination, TERM-to-KILL escalation, reaper eviction, retention on kill failure, list ordering, and poll pagination.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main job kill/reaper lifecycle fix and mentions the added tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.coderabbit.yaml:
- Around line 132-136: The path_filters in the .coderabbit.yaml config are
excluding lockfiles, including Cargo.lock, from review. Remove the lockfile
ignore entries from the path_filters list so dependency changes in the Rust
binary are visible during review, while keeping the other target/** exclusions
intact.

In `@src/jobs/mod.rs`:
- Around line 32-38: Replace the bare `Option<u32>` process-group identifier in
`Job` with a private `ProcessGroupId` newtype and thread that type through the
job lifecycle instead of using the raw primitive. Update `Job::run`,
`Job::kill`, and the `reaper` integration so they construct, store, and pass
`ProcessGroupId` consistently, while keeping the underlying value encapsulated.
Make sure any matching logic or signaling in `src/jobs/reaper.rs` uses the
newtype rather than `u32`, and preserve the existing semantics for the “no pid”
case with `Option<ProcessGroupId>`.
- Around line 466-480: The list test currently claims to cover a backgrounded
job but both store.run calls in the job listing test use background=false, so
the background execution path is not exercised. Update the test around store.run
and store.list in src/jobs/mod.rs so one job is launched with background=true
and the assertions verify that list includes both the inline and backgrounded
commands, ensuring the new public behavior is actually covered.

In `@src/jobs/reaper.rs`:
- Around line 20-32: The kill_job flow currently reports success even when
signal_group fails to actually send TERM or KILL to the process group. Update
kill_job to check and propagate signaling failures from signal_group before
returning true, and only continue to the KILL escalation path if TERM was
successfully delivered; use the existing kill_job, signal_group, and
exited_within helpers to keep the TERM->KILL behavior while ensuring false is
returned on any signal error.
- Around line 70-87: In reap_once in reaper.rs, stale jobs are being removed
from the jobs map before kill_job() runs, which leaves still-running process
groups untracked during TERM grace/escalation. Change the ordering so each stale
Arc<Job> is passed to kill_job() first, then remove its entry from the
Mutex<HashMap<String, Arc<Job>>> afterward, and only then delete the log file;
keep the stale collection logic intact while ensuring running jobs remain
pollable/killable until termination completes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef757399-409e-427e-b0d2-90bd3e5ef6c6

📥 Commits

Reviewing files that changed from the base of the PR and between 0dda8fc and 8d30efc.

📒 Files selected for processing (4)
  • .coderabbit.yaml
  • CLAUDE.md
  • src/jobs/mod.rs
  • src/jobs/reaper.rs

Comment thread .coderabbit.yaml Outdated
Comment thread src/jobs/mod.rs
Comment thread src/jobs/mod.rs Outdated
Comment thread src/jobs/reaper.rs
Comment thread src/jobs/reaper.rs
- reaper: signal_group returns delivery status; kill_job reports
  false when TERM/KILL never reach the group
- reaper: reap_once kills stale jobs before evicting them, so a
  still-Running group is never orphaned mid-grace
- jobs: wrap pgid in a private ProcessGroupId newtype
- test: list_reports_all_jobs now exercises the backgrounded path
- coderabbit: stop filtering Cargo.lock/lockfiles out of review

Co-Authored-By: Claude <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/jobs/reaper.rs`:
- Around line 95-106: The reaper currently removes every stale job from the jobs
map even when kill delivery fails in `kill_job`, which causes running jobs to be
evicted prematurely. Update `reap_once` in `src/jobs/reaper.rs` to inspect the
result of `kill_job(job).await` and only remove entries from `jobs.lock().await`
when the job is finished or when the running job’s group was successfully
signalled; keep failed running jobs tracked for a later reap attempt. Use the
existing `kill_job`, `reap_once`, and `jobs` map logic to separate successful
kills from failed ones before calling `remove`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 376058fd-c1c4-42af-86a5-3a9c241c2548

📥 Commits

Reviewing files that changed from the base of the PR and between 8d30efc and 09dcbfe.

📒 Files selected for processing (3)
  • .coderabbit.yaml
  • src/jobs/mod.rs
  • src/jobs/reaper.rs
💤 Files with no reviewable changes (1)
  • .coderabbit.yaml

Comment thread src/jobs/reaper.rs Outdated
reap_once removed every stale job and deleted its log even when
kill_job failed on a still-Running job, orphaning the live process
group and deleting its log out from under it. Only evict jobs that
finished or whose group was successfully signalled; keep failed
running jobs tracked for a later reap pass.

Co-Authored-By: Claude <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/jobs/mod.rs`:
- Around line 453-458: The test around reaper::reap_once is currently using
tokio::time::Instant::now(), which can skip the stale path when retention is
Duration::ZERO instead of deterministically exercising kill-failure handling.
Backdate the job’s started timestamp in the test setup within src/jobs/mod.rs so
duration_since(started) is definitely greater than the retention, and the reaper
branch that attempts to kill a stale job is always hit. Use the existing job
insertion and reaper::reap_once test case to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e2df70c-4af1-445f-9452-b5557bd74ecb

📥 Commits

Reviewing files that changed from the base of the PR and between 09dcbfe and 2c1ab88.

📒 Files selected for processing (2)
  • src/jobs/mod.rs
  • src/jobs/reaper.rs

Comment thread src/jobs/mod.rs Outdated
- reaper_keeps_running_job_when_kill_fails used Instant::now(), letting
  the test pass without exercising the kill-failure path under ZERO retention
- Backdate started by 1s so duration_since > retention is guaranteed

Co-Authored-By: Claude <noreply@anthropic.com>
@sebyx07 sebyx07 merged commit a7621dd into main Jun 26, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudetm Created by Claude Task Master

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant