fix(eval): stop eval CI hanging past the timeout ceiling#4
Merged
Conversation
The weekly and monthly eval jobs ran 300 requests strictly serially (--sample-concurrency defaults to 1), averaging ~15-25s each, so the run could never finish inside the 60-minute CI ceiling and was killed mid-run. Compounding it, the OpenAI-compatible runner issued fetch with no timeout, so a single stalled upstream connection hung the whole job silently until the kill. - Pass --sample-concurrency 6 in both eval workflows (CF-only, safe) - Add a 120s per-request AbortSignal timeout to the runner; a stall now surfaces as a caught error and the run continues - Refresh the stale @cf/google/gemma-3-12b-it default to gemma-4 (removed from the CF catalog)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The weekly-eval and monthly-vision-eval jobs have been getting killed at their timeout (60 min weekly, 120 min monthly) rather than completing. Every recent scheduled run shows the same shape: the eval-live step starts, emits no output, and is cancelled at the exact ceiling.
Cause
Two compounding problems.
The run cannot fit in the window. eval-live generates 5 models x 2 conditions x 30 samples = 300 requests, and --sample-concurrency defaults to 1 (fully serial). Measured generation times against Cloudflare Workers AI are roughly 15-25s each (gpt-oss ~15s, qwen3 ~26s, llama-scout ~10s). 300 serial requests at that rate is 75-125 minutes, so the 60-minute weekly job can never finish.
The OpenAI-compatible runner issued fetch with no timeout and no AbortSignal. A single stalled upstream connection therefore blocked the whole run indefinitely, with no output, until the CI ceiling killed it. That is why the logs are silent for the full hour.
Changes
Verification