Skip to content

feat: Add CI integration for balance validation (Issue #68)#69

Merged
SorraTheOrc merged 7 commits into
mainfrom
copilot/ci-integration-balance-validation
Dec 5, 2025
Merged

feat: Add CI integration for balance validation (Issue #68)#69
SorraTheOrc merged 7 commits into
mainfrom
copilot/ci-integration-balance-validation

Conversation

Copilot AI commented Dec 5, 2025

Copy link
Copy Markdown
  • Understand Issue 11.5.1 — CI Integration for Balance Validation (M11.5) #68 requirements for CI Integration for Balance Validation
  • Create GitHub Actions workflow .github/workflows/balance-validation.yml
    • Schedule nightly runs and manual triggers
    • Run on relevant file changes (simulation, content, AI)
    • Reduced parameter grid for CI speed
    • Baseline comparison and regression detection
    • Publish artifacts and PR comments
    • Configurable thresholds
  • Create CI sweep configuration content/config/ci_balance_sweeps.yml
  • Create initial baseline content/baselines/balance_baseline.json
  • Create baseline management script scripts/manage_balance_baseline.py
  • Add comprehensive tests tests/scripts/test_balance_validation.py (32 tests)
  • Run tests to verify implementation (32/32 passing)
  • Fix linting issues
  • Run code review
  • Address code review feedback
  • Run security scan (no issues found)
  • Fix dependency vulnerability (actions/download-artifact v4 -> v4.1.8)
  • Fix argument order for aggregate_sweep_results.py (--database must come before subcommand)
Original prompt

"Issue 68 - CI Integration for Balance Validation" to devops agent

Created from Copilot CLI via the copilot delegate command.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 4 commits December 5, 2025 06:37
Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>
Copilot AI changed the title [WIP] Add CI integration for balance validation feat: Add CI integration for balance validation (Issue #68) Dec 5, 2025
Copilot AI requested a review from SorraTheOrc December 5, 2025 06:49
@SorraTheOrc

Copy link
Copy Markdown
Member

@copilot When running the baseline-validation workflow in this PR the following error occur:

Run python scripts/aggregate_sweep_results.py ingest build/ci_sweeps
usage: aggregate_sweep_results.py [-h] [--database DATABASE]
{ingest,query,stats,runs} ...
aggregate_sweep_results.py: error: unrecognized arguments: --database build/ci_sweep_results.db

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

Copilot AI commented Dec 5, 2025

Copy link
Copy Markdown
Author

@copilot When running the baseline-validation workflow in this PR the following error occur:

Run python scripts/aggregate_sweep_results.py ingest build/ci_sweeps
...

Fixed in 507d0f5. The --database argument must come before the subcommand (ingest) because it's defined on the parent parser. Changed the order from:

aggregate_sweep_results.py ingest ... --database ...

to:

aggregate_sweep_results.py --database ... ingest ...

@SorraTheOrc SorraTheOrc marked this pull request as ready for review December 5, 2025 10:11
Copilot AI review requested due to automatic review settings December 5, 2025 10:11
@SorraTheOrc SorraTheOrc merged commit 2280fd4 into main Dec 5, 2025
8 checks passed
@SorraTheOrc SorraTheOrc deleted the copilot/ci-integration-balance-validation branch December 5, 2025 10:12

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements CI integration for balance validation to detect game balance regressions through automated testing. The implementation adds a GitHub Actions workflow that runs nightly parameter sweeps, compares results against a baseline, and alerts on significant regressions in strategy stability or win rates.

Key changes include:

  • GitHub Actions workflow with scheduled runs, manual triggers, and automatic baseline updates
  • Python script for baseline management with compare, create, update, and visualization commands
  • Comprehensive test suite with 32 tests covering baseline management and regression detection
  • CI-optimized sweep configuration with reduced parameter grid for fast validation

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.github/workflows/balance-validation.yml GitHub Actions workflow for automated balance validation with baseline comparison and PR comments
scripts/manage_balance_baseline.py Baseline management script with CLI commands for comparing, updating, and visualizing balance data
tests/scripts/test_balance_validation.py Comprehensive test suite covering baseline creation, regression detection, and threshold configuration
content/config/ci_balance_sweeps.yml CI-optimized sweep configuration with reduced parameter grid (45 combinations)
content/baselines/balance_baseline.json Initial placeholder baseline data to be replaced by first CI run

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +406 to +411
# Get threshold config from baseline or use defaults
thresholds = baseline.get("thresholds", {})
config = BaselineConfig.from_dict(thresholds)
# Override with CLI threshold if provided
config.stability_delta_warning = stability_threshold
config.stability_delta_failure = stability_threshold * 2

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison logic in compare_against_baseline overrides the threshold configuration from the baseline file (lines 409-411). This means that custom thresholds stored in the baseline's thresholds section will be ignored in favor of the CLI parameter. This could lead to unexpected behavior where a baseline has specific thresholds configured, but they get overridden every time.

Consider either:

  1. Only override if the CLI parameter differs from the default value
  2. Document this behavior clearly in the function docstring
  3. Add a warning when overriding baseline thresholds

Copilot uses AI. Check for mistakes.
Comment on lines +204 to +228
def compute_win_rate(strategy_stats: dict[str, Any]) -> float:
"""Compute win rate from strategy statistics.

Win is defined as stability >= 0.5.

Parameters
----------
strategy_stats
Statistics for a single strategy.

Returns
-------
float
Win rate as a fraction (0.0 to 1.0).
"""
avg_stability = strategy_stats.get("avg_stability", 0.0)
# Approximate win rate based on average stability
# In actual sweep data, we'd count individual games
if avg_stability >= 0.5:
# Scale win rate based on how far above threshold
return min(1.0, 0.5 + (avg_stability - 0.5))
else:
# Scale win rate based on how far below threshold
return max(0.0, avg_stability)

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The win rate calculation logic in compute_win_rate uses a simple approximation based on average stability, but this doesn't accurately reflect the actual win rate definition stated in the docstring ("Win is defined as stability >= 0.5"). The current implementation:

  • For stability >= 0.5: returns 0.5 + (avg_stability - 0.5), which gives 0.5 for stability=0.5, 1.0 for stability=1.0
  • For stability < 0.5: returns avg_stability, which gives values between 0 and 0.5

This approximation assumes a linear relationship between average stability and win rate, but if a strategy has high variance (some runs well above 0.5, some well below), the approximation will be inaccurate. Consider either:

  1. Documenting this as an approximation used when individual run data isn't available
  2. Storing actual win counts in the sweep summary instead of approximating
  3. Computing win rate from min/max stability bounds if available

Copilot uses AI. Check for mistakes.
Comment on lines +460 to +462
sweep = load_sweep_summary(sweep_path)
if not sweep:
raise ValueError(f"Could not load sweep data from {sweep_path}")

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "Could not load sweep data from {sweep_path}" could be more helpful by including the underlying reason for the failure. Currently, load_sweep_summary returns None if the file doesn't exist, but there could be other reasons (permission denied, invalid JSON, etc.). Consider catching the specific exception and including it in the error message for better debugging.

Suggested change
sweep = load_sweep_summary(sweep_path)
if not sweep:
raise ValueError(f"Could not load sweep data from {sweep_path}")
try:
sweep = load_sweep_summary(sweep_path)
except Exception as e:
raise ValueError(f"Could not load sweep data from {sweep_path}: {type(e).__name__}: {e}") from e
if not sweep:
if not sweep_path.exists():
reason = "file does not exist"
else:
reason = "unknown reason (file exists but could not be loaded)"
raise ValueError(f"Could not load sweep data from {sweep_path} ({reason})")

Copilot uses AI. Check for mistakes.
Comment on lines +516 to +520
sweep = load_sweep_summary(sweep_path)

if not sweep:
raise ValueError(f"Could not load sweep data from {sweep_path}")

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue: the error message "Could not load sweep data from {sweep_path}" doesn't provide the underlying reason. This appears in the update_baseline function as well.

Suggested change
sweep = load_sweep_summary(sweep_path)
if not sweep:
raise ValueError(f"Could not load sweep data from {sweep_path}")
try:
sweep = load_sweep_summary(sweep_path)
except Exception as e:
raise ValueError(f"Could not load sweep data from {sweep_path}: {e}") from e
if not sweep:
if not Path(sweep_path).exists():
raise ValueError(f"Could not load sweep data from {sweep_path}: file does not exist")
else:
raise ValueError(f"Could not load sweep data from {sweep_path}: file is empty or invalid")

Copilot uses AI. Check for mistakes.
Comment on lines +799 to +803
def test_compute_win_rate_explicit_value(self) -> None:
"""Use explicit win_rate when available."""
stats = {"avg_stability": 0.4, "win_rate": 0.8}
# The compare function checks for explicit win_rate first
# but compute_win_rate approximates from avg_stability

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_compute_win_rate_explicit_value has a misleading name and incorrect implementation. The test expects compute_win_rate to use the explicit win_rate value from the stats dictionary, but the actual implementation of compute_win_rate (lines 204-228 in manage_balance_baseline.py) only uses avg_stability and doesn't check for an explicit win_rate field. The comment on line 802 acknowledges this inconsistency. Either:

  1. Rename the test to reflect what it actually tests (approximation from avg_stability)
  2. Update compute_win_rate to use explicit win_rate when available
  3. Remove this test as it's testing incorrect behavior

The function compare_strategy_stats does check for explicit win_rate (lines 300-307), but compute_win_rate itself does not.

Suggested change
def test_compute_win_rate_explicit_value(self) -> None:
"""Use explicit win_rate when available."""
stats = {"avg_stability": 0.4, "win_rate": 0.8}
# The compare function checks for explicit win_rate first
# but compute_win_rate approximates from avg_stability
def test_compute_win_rate_ignores_explicit_value(self) -> None:
"""compute_win_rate ignores explicit win_rate and uses avg_stability only."""
stats = {"avg_stability": 0.4, "win_rate": 0.8}
# Even though win_rate is present, compute_win_rate ignores it
# and approximates from avg_stability only.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +120
REGRESSION_COUNT=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(len(d.get('regressions', [])))")
if [[ "$REGRESSION_COUNT" -gt 0 ]]; then
echo "regression_detected=true" >> $GITHUB_OUTPUT
SUMMARY=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(d.get('summary', 'Regressions detected'))")

Copilot AI Dec 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python inline commands used to extract JSON values in the bash script are brittle and hard to maintain. Consider using jq for JSON parsing in bash, which is more robust and readable. For example:

REGRESSION_COUNT=$(jq '.regressions | length' build/regression_summary.json)
SUMMARY=$(jq -r '.summary // "Regressions detected"' build/regression_summary.json)

Alternatively, create a small Python helper script that handles all the output formatting to keep the logic in Python rather than mixing languages.

Suggested change
REGRESSION_COUNT=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(len(d.get('regressions', [])))")
if [[ "$REGRESSION_COUNT" -gt 0 ]]; then
echo "regression_detected=true" >> $GITHUB_OUTPUT
SUMMARY=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(d.get('summary', 'Regressions detected'))")
REGRESSION_COUNT=$(jq '.regressions | length' build/regression_summary.json)
if [[ "$REGRESSION_COUNT" -gt 0 ]]; then
echo "regression_detected=true" >> $GITHUB_OUTPUT
SUMMARY=$(jq -r '.summary // "Regressions detected"' build/regression_summary.json)

Copilot uses AI. Check for mistakes.
SorraTheOrc added a commit that referenced this pull request Dec 5, 2025
- Updated task 11.5.1 status from not-started to COMPLETED
- PR #69 merged on 2025-12-05 with CI balance validation workflow
- Added .github/workflows/balance-validation.yml (285 lines)
- Added scripts/manage_balance_baseline.py (948 lines)
- Added tests/scripts/test_balance_validation.py (803 tests)
- Baseline storage and versioning system implemented
- Regression detection with configurable thresholds
- Phase 11 progress: 4/6 tasks complete (67% complete)
- Core balance workflow complete; 11.4.1 and 11.6.1 remain as optional
- Updated next steps: prioritization decision needed for Phase 11 completion vs. Phase 12 start
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants