Make timeout e2e test reliably exercise the watchdog#625
Merged
Conversation
The most recent full-suite run failed the (now-tightened) timeout
assertion because the agent scope-reduced the vague refactor task and
finished in 50s instead of being killed by the 1-minute watchdog:
Tool: finish({'success': True, 'explanation': 'Created utils.py with
type hints and docstrings, and test_utils.py with 14 passing
unit tests.'})
The test had been silently coasting on the lax "no comment found =
PASS" assertion. PR #624 tightened it to FAIL, which surfaced this:
the watchdog isn't actually being exercised because the task is too
soft and the timeout too long.
## Fix
Two changes, both needed:
- **Concrete uncompressible task**: replace "refactor every file to
follow best practices..." with "create exactly 50 Python files
named day_01.py..day_50.py with this exact body." The agent can
no longer satisfy the requirement by writing one demonstrative
file; it's all 50 or explicit failure.
- **Shorter watchdog**: cut timeout_minutes from 1 to 0.5. Even
scope-reduced work doesn't fit in 30s. Required adding fractional
support: `timeout_minutes` is now `float` in lib/config.py's
ALLOWED_ARGS, and the bash arithmetic in the workflow's 5
watchdog steps changes from `$((TIMEOUT_MINUTES*60))` (bash
integer math, drops fractional values to 0) to
`awk -v m="$TIMEOUT_MINUTES" 'BEGIN{print int(m*60)}'`
(handles int and float, rounds down to whole seconds).
The float type also leaves the door open for users who want
timeout_minutes=1.5 or similar in production configs.
655 unit tests still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The most recent full-suite run failed the (now-tightened) timeout assertion because the agent scope-reduced the vague refactor task and finished in 50s instead of being killed by the 1-minute watchdog:
```
Tool: finish({'success': True, 'explanation': 'Created utils.py with type hints
and docstrings, and test_utils.py with 14 passing unit tests.'})
```
The test had been silently coasting on PR #624's lax "no comment found = PASS" predecessor for months. The tightened assertion correctly fails, but the underlying issue is that the test doesn't actually exercise the watchdog anymore — the agent's improved wrapup/scoping just routes around it.
Fix
Two changes, both needed (defense in depth):
The float type also leaves the door open for users who want `timeout_minutes=1.5` in production configs.
Test plan
🤖 Generated with Claude Code