feat: Add CI integration for balance validation (Issue #68) by Copilot · Pull Request #69 · TheWizardsCode/GEngine

Copilot · 2025-12-05T06:23:35Z

Original prompt

"Issue 68 - CI Integration for Balance Validation" to devops agent

Created from Copilot CLI via the copilot delegate command.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

SorraTheOrc · 2025-12-05T09:42:25Z

@copilot When running the baseline-validation workflow in this PR the following error occur:

Run python scripts/aggregate_sweep_results.py ingest build/ci_sweeps
usage: aggregate_sweep_results.py [-h] [--database DATABASE]
{ingest,query,stats,runs} ...
aggregate_sweep_results.py: error: unrecognized arguments: --database build/ci_sweep_results.db

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

Copilot · 2025-12-05T09:44:19Z

@copilot When running the baseline-validation workflow in this PR the following error occur:

Run python scripts/aggregate_sweep_results.py ingest build/ci_sweeps
...

Fixed in 507d0f5. The --database argument must come before the subcommand (ingest) because it's defined on the parent parser. Changed the order from:

aggregate_sweep_results.py ingest ... --database ...

to:

aggregate_sweep_results.py --database ... ingest ...

Copilot

Pull request overview

This PR implements CI integration for balance validation to detect game balance regressions through automated testing. The implementation adds a GitHub Actions workflow that runs nightly parameter sweeps, compares results against a baseline, and alerts on significant regressions in strategy stability or win rates.

Key changes include:

GitHub Actions workflow with scheduled runs, manual triggers, and automatic baseline updates
Python script for baseline management with compare, create, update, and visualization commands
Comprehensive test suite with 32 tests covering baseline management and regression detection
CI-optimized sweep configuration with reduced parameter grid for fast validation

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`.github/workflows/balance-validation.yml`	GitHub Actions workflow for automated balance validation with baseline comparison and PR comments
`scripts/manage_balance_baseline.py`	Baseline management script with CLI commands for comparing, updating, and visualizing balance data
`tests/scripts/test_balance_validation.py`	Comprehensive test suite covering baseline creation, regression detection, and threshold configuration
`content/config/ci_balance_sweeps.yml`	CI-optimized sweep configuration with reduced parameter grid (45 combinations)
`content/baselines/balance_baseline.json`	Initial placeholder baseline data to be replaced by first CI run

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T10:16:14Z

+    # Get threshold config from baseline or use defaults
+    thresholds = baseline.get("thresholds", {})
+    config = BaselineConfig.from_dict(thresholds)
+    # Override with CLI threshold if provided
+    config.stability_delta_warning = stability_threshold
+    config.stability_delta_failure = stability_threshold * 2


The comparison logic in compare_against_baseline overrides the threshold configuration from the baseline file (lines 409-411). This means that custom thresholds stored in the baseline's thresholds section will be ignored in favor of the CLI parameter. This could lead to unexpected behavior where a baseline has specific thresholds configured, but they get overridden every time.

Consider either:

Only override if the CLI parameter differs from the default value

Document this behavior clearly in the function docstring

Add a warning when overriding baseline thresholds

Copilot · 2025-12-05T10:16:14Z

+def compute_win_rate(strategy_stats: dict[str, Any]) -> float:
+    """Compute win rate from strategy statistics.
+
+    Win is defined as stability >= 0.5.
+
+    Parameters
+    ----------
+    strategy_stats
+        Statistics for a single strategy.
+
+    Returns
+    -------
+    float
+        Win rate as a fraction (0.0 to 1.0).
+    """
+    avg_stability = strategy_stats.get("avg_stability", 0.0)
+    # Approximate win rate based on average stability
+    # In actual sweep data, we'd count individual games
+    if avg_stability >= 0.5:
+        # Scale win rate based on how far above threshold
+        return min(1.0, 0.5 + (avg_stability - 0.5))
+    else:
+        # Scale win rate based on how far below threshold
+        return max(0.0, avg_stability)
+


The win rate calculation logic in compute_win_rate uses a simple approximation based on average stability, but this doesn't accurately reflect the actual win rate definition stated in the docstring ("Win is defined as stability >= 0.5"). The current implementation:

For stability >= 0.5: returns 0.5 + (avg_stability - 0.5), which gives 0.5 for stability=0.5, 1.0 for stability=1.0

For stability < 0.5: returns avg_stability, which gives values between 0 and 0.5

This approximation assumes a linear relationship between average stability and win rate, but if a strategy has high variance (some runs well above 0.5, some well below), the approximation will be inaccurate. Consider either:

Documenting this as an approximation used when individual run data isn't available

Storing actual win counts in the sweep summary instead of approximating

Computing win rate from min/max stability bounds if available

Copilot · 2025-12-05T10:16:14Z

+    sweep = load_sweep_summary(sweep_path)
+    if not sweep:
+        raise ValueError(f"Could not load sweep data from {sweep_path}")


The error message "Could not load sweep data from {sweep_path}" could be more helpful by including the underlying reason for the failure. Currently, load_sweep_summary returns None if the file doesn't exist, but there could be other reasons (permission denied, invalid JSON, etc.). Consider catching the specific exception and including it in the error message for better debugging.

Suggested change

sweep = load_sweep_summary(sweep_path)

if not sweep:

raise ValueError(f"Could not load sweep data from {sweep_path}")

try:

sweep = load_sweep_summary(sweep_path)

except Exception as e:

raise ValueError(f"Could not load sweep data from {sweep_path}: {type(e).__name__}: {e}") from e

if not sweep:

if not sweep_path.exists():

reason = "file does not exist"

else:

reason = "unknown reason (file exists but could not be loaded)"

raise ValueError(f"Could not load sweep data from {sweep_path} ({reason})")

Copilot · 2025-12-05T10:16:15Z

+    sweep = load_sweep_summary(sweep_path)
+
+    if not sweep:
+        raise ValueError(f"Could not load sweep data from {sweep_path}")
+


Same issue: the error message "Could not load sweep data from {sweep_path}" doesn't provide the underlying reason. This appears in the update_baseline function as well.

Suggested change

sweep = load_sweep_summary(sweep_path)

if not sweep:

raise ValueError(f"Could not load sweep data from {sweep_path}")

try:

sweep = load_sweep_summary(sweep_path)

except Exception as e:

raise ValueError(f"Could not load sweep data from {sweep_path}: {e}") from e

if not sweep:

if not Path(sweep_path).exists():

raise ValueError(f"Could not load sweep data from {sweep_path}: file does not exist")

else:

raise ValueError(f"Could not load sweep data from {sweep_path}: file is empty or invalid")

Copilot · 2025-12-05T10:16:15Z

+    def test_compute_win_rate_explicit_value(self) -> None:
+        """Use explicit win_rate when available."""
+        stats = {"avg_stability": 0.4, "win_rate": 0.8}
+        # The compare function checks for explicit win_rate first
+        # but compute_win_rate approximates from avg_stability


The test test_compute_win_rate_explicit_value has a misleading name and incorrect implementation. The test expects compute_win_rate to use the explicit win_rate value from the stats dictionary, but the actual implementation of compute_win_rate (lines 204-228 in manage_balance_baseline.py) only uses avg_stability and doesn't check for an explicit win_rate field. The comment on line 802 acknowledges this inconsistency. Either:

Rename the test to reflect what it actually tests (approximation from avg_stability)

Update compute_win_rate to use explicit win_rate when available

Remove this test as it's testing incorrect behavior

The function compare_strategy_stats does check for explicit win_rate (lines 300-307), but compute_win_rate itself does not.

Suggested change

def test_compute_win_rate_explicit_value(self) -> None:

"""Use explicit win_rate when available."""

stats = {"avg_stability": 0.4, "win_rate": 0.8}

# The compare function checks for explicit win_rate first

# but compute_win_rate approximates from avg_stability

def test_compute_win_rate_ignores_explicit_value(self) -> None:

"""compute_win_rate ignores explicit win_rate and uses avg_stability only."""

stats = {"avg_stability": 0.4, "win_rate": 0.8}

# Even though win_rate is present, compute_win_rate ignores it

# and approximates from avg_stability only.

Copilot · 2025-12-05T10:16:16Z

+            REGRESSION_COUNT=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(len(d.get('regressions', [])))")
+            if [[ "$REGRESSION_COUNT" -gt 0 ]]; then
+              echo "regression_detected=true" >> $GITHUB_OUTPUT
+              SUMMARY=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(d.get('summary', 'Regressions detected'))")


The Python inline commands used to extract JSON values in the bash script are brittle and hard to maintain. Consider using jq for JSON parsing in bash, which is more robust and readable. For example:

REGRESSION_COUNT=$(jq '.regressions | length' build/regression_summary.json) SUMMARY=$(jq -r '.summary // "Regressions detected"' build/regression_summary.json)

Alternatively, create a small Python helper script that handles all the output formatting to keep the logic in Python rather than mixing languages.

Suggested change

REGRESSION_COUNT=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(len(d.get('regressions', [])))")

if [[ "$REGRESSION_COUNT" -gt 0 ]]; then

echo "regression_detected=true" >> $GITHUB_OUTPUT

SUMMARY=$(python -c "import json; d=json.load(open('build/regression_summary.json')); print(d.get('summary', 'Regressions detected'))")

REGRESSION_COUNT=$(jq '.regressions | length' build/regression_summary.json)

if [[ "$REGRESSION_COUNT" -gt 0 ]]; then

echo "regression_detected=true" >> $GITHUB_OUTPUT

SUMMARY=$(jq -r '.summary // "Regressions detected"' build/regression_summary.json)

- Updated task 11.5.1 status from not-started to COMPLETED - PR #69 merged on 2025-12-05 with CI balance validation workflow - Added .github/workflows/balance-validation.yml (285 lines) - Added scripts/manage_balance_baseline.py (948 lines) - Added tests/scripts/test_balance_validation.py (803 tests) - Baseline storage and versioning system implemented - Regression detection with configurable thresholds - Phase 11 progress: 4/6 tasks complete (67% complete) - Core balance workflow complete; 11.4.1 and 11.6.1 remain as optional - Updated next steps: prioritization decision needed for Phase 11 completion vs. Phase 12 start

Initial plan

71e4c79

Copilot AI assigned Copilot and SorraTheOrc Dec 5, 2025

Copilot started work on behalf of SorraTheOrc December 5, 2025 06:24 View session

Copilot AI and others added 4 commits December 5, 2025 06:37

feat: Add CI integration for balance validation (Issue #68)

3f18f70

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

fix: Address linting issues in balance validation

1ecee58

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

fix: Address code review feedback for balance validation

bc519ab

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

fix: Update download-artifact to v4.1.8 to address CVE vulnerability

b3e931c

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add CI integration for balance validation~~ feat: Add CI integration for balance validation (Issue #68) Dec 5, 2025

Copilot AI requested a review from SorraTheOrc December 5, 2025 06:49

Copilot finished work on behalf of SorraTheOrc December 5, 2025 06:49

Copilot started work on behalf of SorraTheOrc December 5, 2025 09:42 View session

fix: Correct argument order for aggregate_sweep_results.py in workflow

507d0f5

Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com>

Copilot finished work on behalf of SorraTheOrc December 5, 2025 09:45

fix report generation command

d306d2e

SorraTheOrc marked this pull request as ready for review December 5, 2025 10:11

Copilot AI review requested due to automatic review settings December 5, 2025 10:11

Copilot started reviewing on behalf of SorraTheOrc December 5, 2025 10:11 View session

SorraTheOrc merged commit 2280fd4 into main Dec 5, 2025
8 checks passed

SorraTheOrc deleted the copilot/ci-integration-balance-validation branch December 5, 2025 10:12

Copilot finished reviewing on behalf of SorraTheOrc December 5, 2025 10:13

Copilot AI reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add CI integration for balance validation (Issue #68)#69

feat: Add CI integration for balance validation (Issue #68)#69
SorraTheOrc merged 7 commits into
mainfrom
copilot/ci-integration-balance-validation

Copilot AI commented Dec 5, 2025 •

edited

Loading

Uh oh!

SorraTheOrc commented Dec 5, 2025

Uh oh!

Copilot AI commented Dec 5, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    sweep = load_sweep_summary(sweep_path)
-    if not sweep:
-        raise ValueError(f"Could not load sweep data from {sweep_path}")
+    try:
+        sweep = load_sweep_summary(sweep_path)
+    except Exception as e:
+        raise ValueError(f"Could not load sweep data from {sweep_path}: {type(e).__name__}: {e}") from e
+    if not sweep:
+        if not sweep_path.exists():
+            reason = "file does not exist"
+        else:
+            reason = "unknown reason (file exists but could not be loaded)"
+        raise ValueError(f"Could not load sweep data from {sweep_path} ({reason})")

Uh oh!

Conversation

Copilot AI commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SorraTheOrc commented Dec 5, 2025

Uh oh!

Copilot AI commented Dec 5, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Dec 5, 2025 •

edited

Loading