Skip to content

fix(eval): stop eval CI hanging past the timeout ceiling#4

Merged
jasonodoom merged 3 commits into
mainfrom
fix/eval-ci-timeout
Jun 9, 2026
Merged

fix(eval): stop eval CI hanging past the timeout ceiling#4
jasonodoom merged 3 commits into
mainfrom
fix/eval-ci-timeout

Conversation

@jasonodoom

@jasonodoom jasonodoom commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

The weekly-eval and monthly-vision-eval jobs have been getting killed at their timeout (60 min weekly, 120 min monthly) rather than completing. Every recent scheduled run shows the same shape: the eval-live step starts, emits no output, and is cancelled at the exact ceiling.

Cause

Two compounding problems.

  1. The run cannot fit in the window. eval-live generates 5 models x 2 conditions x 30 samples = 300 requests, and --sample-concurrency defaults to 1 (fully serial). Measured generation times against Cloudflare Workers AI are roughly 15-25s each (gpt-oss ~15s, qwen3 ~26s, llama-scout ~10s). 300 serial requests at that rate is 75-125 minutes, so the 60-minute weekly job can never finish.

  2. The OpenAI-compatible runner issued fetch with no timeout and no AbortSignal. A single stalled upstream connection therefore blocked the whole run indefinitely, with no output, until the CI ceiling killed it. That is why the logs are silent for the full hour.

Changes

  • Pass --sample-concurrency 6 in both eval workflows. Models and conditions stay serial, so this caps in-flight requests at 6, comfortably under the Workers AI 300 req/min text-generation limit.
  • Add a 120s per-request AbortSignal.timeout in src/eval/runners/openai.ts. A stall now surfaces as a caught error; the per-sample handler in live.ts writes an .error.txt and the run continues instead of hanging.
  • Refresh the stale default model @cf/google/gemma-3-12b-it to @cf/google/gemma-4-26b-a4b-it. It was removed from the Workers AI catalog. Verified all nine model slugs used across both workflows against the live catalog; this was the only dead one, and it was only reachable through the no-args default, not the workflows themselves.
  • Correct stale call-count comments (the runs are 300 samples, not 600).

Verification

  • Full test suite: 298 passed, 13 skipped. Pre-commit gate (tsc, tests, validate-tokens) green.
  • End-to-end run against Workers AI with the two slowest models, n=3, --sample-concurrency 6: completed in 108s with zero errors and all samples produced. Extrapolated to the full roster the run is well inside the ceiling.

The weekly and monthly eval jobs ran 300 requests strictly serially
(--sample-concurrency defaults to 1), averaging ~15-25s each, so the
run could never finish inside the 60-minute CI ceiling and was killed
mid-run. Compounding it, the OpenAI-compatible runner issued fetch
with no timeout, so a single stalled upstream connection hung the
whole job silently until the kill.

- Pass --sample-concurrency 6 in both eval workflows (CF-only, safe)
- Add a 120s per-request AbortSignal timeout to the runner; a stall
  now surfaces as a caught error and the run continues
- Refresh the stale @cf/google/gemma-3-12b-it default to gemma-4
  (removed from the CF catalog)
@jasonodoom jasonodoom enabled auto-merge (squash) June 9, 2026 04:47
@jasonodoom jasonodoom merged commit b680995 into main Jun 9, 2026
3 checks passed
@jasonodoom jasonodoom deleted the fix/eval-ci-timeout branch June 9, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant