fix(eval): stop eval CI hanging past the timeout ceiling by jasonodoom · Pull Request #4 · Ad-Astra-Computing/ahd

jasonodoom · 2026-06-09T04:45:21Z

The weekly-eval and monthly-vision-eval jobs have been getting killed at their timeout (60 min weekly, 120 min monthly) rather than completing. Every recent scheduled run shows the same shape: the eval-live step starts, emits no output, and is cancelled at the exact ceiling.

Cause

Two compounding problems.

The run cannot fit in the window. eval-live generates 5 models x 2 conditions x 30 samples = 300 requests, and --sample-concurrency defaults to 1 (fully serial). Measured generation times against Cloudflare Workers AI are roughly 15-25s each (gpt-oss ~15s, qwen3 ~26s, llama-scout ~10s). 300 serial requests at that rate is 75-125 minutes, so the 60-minute weekly job can never finish.
The OpenAI-compatible runner issued fetch with no timeout and no AbortSignal. A single stalled upstream connection therefore blocked the whole run indefinitely, with no output, until the CI ceiling killed it. That is why the logs are silent for the full hour.

Changes

Pass --sample-concurrency 6 in both eval workflows. Models and conditions stay serial, so this caps in-flight requests at 6, comfortably under the Workers AI 300 req/min text-generation limit.
Add a 120s per-request AbortSignal.timeout in src/eval/runners/openai.ts. A stall now surfaces as a caught error; the per-sample handler in live.ts writes an .error.txt and the run continues instead of hanging.
Refresh the stale default model @cf/google/gemma-3-12b-it to @cf/google/gemma-4-26b-a4b-it. It was removed from the Workers AI catalog. Verified all nine model slugs used across both workflows against the live catalog; this was the only dead one, and it was only reachable through the no-args default, not the workflows themselves.
Correct stale call-count comments (the runs are 300 samples, not 600).

Verification

Full test suite: 298 passed, 13 skipped. Pre-commit gate (tsc, tests, validate-tokens) green.
End-to-end run against Workers AI with the two slowest models, n=3, --sample-concurrency 6: completed in 108s with zero errors and all samples produced. Extrapolated to the full roster the run is well inside the ceiling.

The weekly and monthly eval jobs ran 300 requests strictly serially (--sample-concurrency defaults to 1), averaging ~15-25s each, so the run could never finish inside the 60-minute CI ceiling and was killed mid-run. Compounding it, the OpenAI-compatible runner issued fetch with no timeout, so a single stalled upstream connection hung the whole job silently until the kill. - Pass --sample-concurrency 6 in both eval workflows (CF-only, safe) - Add a 120s per-request AbortSignal timeout to the runner; a stall now surfaces as a caught error and the run continues - Refresh the stale @cf/google/gemma-3-12b-it default to gemma-4 (removed from the CF catalog)

jasonodoom added 3 commits May 17, 2026 02:10

docs(artwork): add conference stickers

0908930

Merge branch 'main' into fix/eval-ci-timeout

0a96312

jasonodoom enabled auto-merge (squash) June 9, 2026 04:47

jasonodoom merged commit b680995 into main Jun 9, 2026
3 checks passed

jasonodoom deleted the fix/eval-ci-timeout branch June 9, 2026 04:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): stop eval CI hanging past the timeout ceiling#4

fix(eval): stop eval CI hanging past the timeout ceiling#4
jasonodoom merged 3 commits into
mainfrom
fix/eval-ci-timeout

jasonodoom commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonodoom commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cause

Changes

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasonodoom commented Jun 9, 2026 •

edited

Loading