Make timeout e2e test reliably exercise the watchdog by gnovak · Pull Request #625 · gnovak/remote-dev-bot

gnovak · 2026-05-23T05:59:56Z

Why

The most recent full-suite run failed the (now-tightened) timeout assertion because the agent scope-reduced the vague refactor task and finished in 50s instead of being killed by the 1-minute watchdog:

```
Tool: finish({'success': True, 'explanation': 'Created utils.py with type hints
and docstrings, and test_utils.py with 14 passing unit tests.'})
```

The test had been silently coasting on PR #624's lax "no comment found = PASS" predecessor for months. The tightened assertion correctly fails, but the underlying issue is that the test doesn't actually exercise the watchdog anymore — the agent's improved wrapup/scoping just routes around it.

Fix

Two changes, both needed (defense in depth):

Concrete uncompressible task: replace "refactor every file to follow best practices..." with "create exactly 50 Python files named `day_01.py`..`day_50.py` with this exact body". The agent can no longer satisfy the requirement by writing one demonstrative file — it's all 50 or explicit failure.
Shorter watchdog: cut `timeout_minutes` from 1 to 0.5. Even scope-reduced work doesn't fit in 30s. Required adding fractional support: `timeout_minutes` is now `float` in `lib/config.py`'s `ALLOWED_ARGS`, and the 5 workflow watchdog steps change from `$((TIMEOUT_MINUTES60))` (bash integer math, drops fractional values to 0) to `awk -v m=… 'BEGIN{print int(m60)}'`.

The float type also leaves the door open for users who want `timeout_minutes=1.5` in production configs.

Test plan

Pytest: 655 passed
`bash -n tests/e2e.sh`
Awk math spot-checked for inputs 0.5, 1, 60 → 30, 60, 3600
Next full-test-suite run should show `timeout: PASS (watchdog fired + comment posted)`

🤖 Generated with Claude Code

The most recent full-suite run failed the (now-tightened) timeout assertion because the agent scope-reduced the vague refactor task and finished in 50s instead of being killed by the 1-minute watchdog: Tool: finish({'success': True, 'explanation': 'Created utils.py with type hints and docstrings, and test_utils.py with 14 passing unit tests.'}) The test had been silently coasting on the lax "no comment found = PASS" assertion. PR #624 tightened it to FAIL, which surfaced this: the watchdog isn't actually being exercised because the task is too soft and the timeout too long. ## Fix Two changes, both needed: - **Concrete uncompressible task**: replace "refactor every file to follow best practices..." with "create exactly 50 Python files named day_01.py..day_50.py with this exact body." The agent can no longer satisfy the requirement by writing one demonstrative file; it's all 50 or explicit failure. - **Shorter watchdog**: cut timeout_minutes from 1 to 0.5. Even scope-reduced work doesn't fit in 30s. Required adding fractional support: `timeout_minutes` is now `float` in lib/config.py's ALLOWED_ARGS, and the bash arithmetic in the workflow's 5 watchdog steps changes from `$((TIMEOUT_MINUTES*60))` (bash integer math, drops fractional values to 0) to `awk -v m="$TIMEOUT_MINUTES" 'BEGIN{print int(m*60)}'` (handles int and float, rounds down to whole seconds). The float type also leaves the door open for users who want timeout_minutes=1.5 or similar in production configs. 655 unit tests still pass.

gnovak merged commit a12967e into dev May 23, 2026

gnovak deleted the fix-timeout-e2e-test branch June 13, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make timeout e2e test reliably exercise the watchdog#625

Make timeout e2e test reliably exercise the watchdog#625
gnovak merged 1 commit into
devfrom
fix-timeout-e2e-test

gnovak commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnovak commented May 23, 2026

Why

Fix

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant