Skip to content

Make timeout e2e test reliably exercise the watchdog#625

Merged
gnovak merged 1 commit into
devfrom
fix-timeout-e2e-test
May 23, 2026
Merged

Make timeout e2e test reliably exercise the watchdog#625
gnovak merged 1 commit into
devfrom
fix-timeout-e2e-test

Conversation

@gnovak

@gnovak gnovak commented May 23, 2026

Copy link
Copy Markdown
Owner

Why

The most recent full-suite run failed the (now-tightened) timeout assertion because the agent scope-reduced the vague refactor task and finished in 50s instead of being killed by the 1-minute watchdog:

```
Tool: finish({'success': True, 'explanation': 'Created utils.py with type hints
and docstrings, and test_utils.py with 14 passing unit tests.'})
```

The test had been silently coasting on PR #624's lax "no comment found = PASS" predecessor for months. The tightened assertion correctly fails, but the underlying issue is that the test doesn't actually exercise the watchdog anymore — the agent's improved wrapup/scoping just routes around it.

Fix

Two changes, both needed (defense in depth):

  • Concrete uncompressible task: replace "refactor every file to follow best practices..." with "create exactly 50 Python files named `day_01.py`..`day_50.py` with this exact body". The agent can no longer satisfy the requirement by writing one demonstrative file — it's all 50 or explicit failure.
  • Shorter watchdog: cut `timeout_minutes` from 1 to 0.5. Even scope-reduced work doesn't fit in 30s. Required adding fractional support: `timeout_minutes` is now `float` in `lib/config.py`'s `ALLOWED_ARGS`, and the 5 workflow watchdog steps change from `$((TIMEOUT_MINUTES60))` (bash integer math, drops fractional values to 0) to `awk -v m=… 'BEGIN{print int(m60)}'`.

The float type also leaves the door open for users who want `timeout_minutes=1.5` in production configs.

Test plan

  • Pytest: 655 passed
  • `bash -n tests/e2e.sh`
  • Awk math spot-checked for inputs 0.5, 1, 60 → 30, 60, 3600
  • Next full-test-suite run should show `timeout: PASS (watchdog fired + comment posted)`

🤖 Generated with Claude Code

The most recent full-suite run failed the (now-tightened) timeout
assertion because the agent scope-reduced the vague refactor task and
finished in 50s instead of being killed by the 1-minute watchdog:

  Tool: finish({'success': True, 'explanation': 'Created utils.py with
         type hints and docstrings, and test_utils.py with 14 passing
         unit tests.'})

The test had been silently coasting on the lax "no comment found =
PASS" assertion. PR #624 tightened it to FAIL, which surfaced this:
the watchdog isn't actually being exercised because the task is too
soft and the timeout too long.

## Fix

Two changes, both needed:

- **Concrete uncompressible task**: replace "refactor every file to
  follow best practices..." with "create exactly 50 Python files
  named day_01.py..day_50.py with this exact body." The agent can
  no longer satisfy the requirement by writing one demonstrative
  file; it's all 50 or explicit failure.

- **Shorter watchdog**: cut timeout_minutes from 1 to 0.5. Even
  scope-reduced work doesn't fit in 30s. Required adding fractional
  support: `timeout_minutes` is now `float` in lib/config.py's
  ALLOWED_ARGS, and the bash arithmetic in the workflow's 5
  watchdog steps changes from `$((TIMEOUT_MINUTES*60))` (bash
  integer math, drops fractional values to 0) to
  `awk -v m="$TIMEOUT_MINUTES" 'BEGIN{print int(m*60)}'`
  (handles int and float, rounds down to whole seconds).

The float type also leaves the door open for users who want
timeout_minutes=1.5 or similar in production configs.

655 unit tests still pass.
@gnovak gnovak merged commit a12967e into dev May 23, 2026
@gnovak gnovak deleted the fix-timeout-e2e-test branch June 13, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant