Add daily benchmark CI workflow#54
Draft
borisnieuwenhuis wants to merge 9 commits into
Draft
Conversation
Daily GitHub Actions workflow to run all three benchmarks (standard, tournament, trading) with Claude Code CLI, configurable model, change detection, and Slack notifications.
…etup - Rewrite agent prompts to enforce "implement first, deploy last" ordering - Accept _predict() override pattern in verify checks (all 3 benchmarks) - Fix Dockerfile marker matching to use content-based detection - Add tournament workspace setup: local coordinator copy, Dockerfile patching, compose override, webapp clone, CRUNCH_ID isolation - Use Pydantic model coercion in tournament scoring verification - Search both crunch_config.py and scoring.py for type definitions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Runs all three benchmarks (standard, tournament, trading) daily in GitHub Actions with Claude Code CLI.
Changes
.github/workflows/benchmark.yml— daily cron (2am UTC) + manual trigger with configurable model (haiku/sonnet/opus) and benchmark selectioncontinue-on-errorper benchmarkbenchmark-tournament,benchmark-trading,benchmark-allMakefile targetsSecrets needed
ANTHROPIC_API_KEY(required)SLACK_WEBHOOK_URL(optional)Test plan
ANTHROPIC_API_KEYrepository secretSLACK_WEBHOOK_URLand verify failure notification