Skip to content

Add daily benchmark CI workflow#54

Draft
borisnieuwenhuis wants to merge 9 commits into
mainfrom
feature/benchmark-ci
Draft

Add daily benchmark CI workflow#54
borisnieuwenhuis wants to merge 9 commits into
mainfrom
feature/benchmark-ci

Conversation

@borisnieuwenhuis

Copy link
Copy Markdown
Contributor

Summary

Runs all three benchmarks (standard, tournament, trading) daily in GitHub Actions with Claude Code CLI.

Changes

  • Add .github/workflows/benchmark.yml — daily cron (2am UTC) + manual trigger with configurable model (haiku/sonnet/opus) and benchmark selection
  • Change detection: skips scheduled runs when main has no new commits
  • Sequential execution with continue-on-error per benchmark
  • Results uploaded as artifacts (90-day retention)
  • Optional Slack webhook notification on failure
  • Add benchmark-tournament, benchmark-trading, benchmark-all Makefile targets
  • Document CI setup in architecture.md

Secrets needed

  • ANTHROPIC_API_KEY (required)
  • SLACK_WEBHOOK_URL (optional)

Test plan

  • Configure ANTHROPIC_API_KEY repository secret
  • Trigger manual run via Actions → Benchmarks → Run workflow
  • Verify Claude CLI installs and benchmarks execute
  • Verify artifacts are uploaded
  • Optionally configure SLACK_WEBHOOK_URL and verify failure notification
  • Verify scheduled run skips when no new commits

Daily GitHub Actions workflow to run all three benchmarks
(standard, tournament, trading) with Claude Code CLI,
configurable model, change detection, and Slack notifications.
…etup

- Rewrite agent prompts to enforce "implement first, deploy last" ordering
- Accept _predict() override pattern in verify checks (all 3 benchmarks)
- Fix Dockerfile marker matching to use content-based detection
- Add tournament workspace setup: local coordinator copy, Dockerfile patching,
  compose override, webapp clone, CRUNCH_ID isolation
- Use Pydantic model coercion in tournament scoring verification
- Search both crunch_config.py and scoring.py for type definitions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant