Skip to content

feat(ruleset_strategy): add round_score "rank" (per-episode placement scoring)#48

Merged
daveey merged 1 commit into
mainfrom
daveey/round-score-rank-by-episode
Jun 22, 2026
Merged

feat(ruleset_strategy): add round_score "rank" (per-episode placement scoring)#48
daveey merged 1 commit into
mainfrom
daveey/round-score-rank-by-episode

Conversation

@daveey

@daveey daveey commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Adds an opt-in rank-by-episode round scoring mode to the ruleset_strategy commissioner. Requested for the agricogla league (the metta-side config + image bump is a follow-up).

scoring.round_score: "rank"

ScoringConfig.round_score previously only accepted "mean" (round score = mean of a policy's per-episode scores). Adds "rank":

  • Within each episode, policies are ranked by score and earn N..1 rank points — the winner of an N-policy episode gets N, last gets 1, ties share the better placement (computed as N - (#strictly-higher)).
  • A policy's round score is the mean of its per-episode rank points across the episodes it played. Margins of victory are discarded — only placement each game matters.

How

  • complete_round now delegates per-policy round scoring to an overridable _round_scores_by_policy(entries, episode_results) -> (scores, ranked_counts). The base BaselineCommissioner keeps mean scoring; RulesetStrategyCommissioner switches to rank points when scoring.round_score == "rank". The ranking/metadata assembly is unchanged.
  • Rank rounds are tagged with a distinct score_kind (rank_episode_round_score) via RankingConfig.result_metadata/filter_metadata, so switching a league from meanrank filters the now-incomparable prior-regime round results off the commissioner leaderboard instead of blending two score scales.
  • scoring_mechanics describes the rank scheme for the division description.

Default stays "mean", so every existing config behaves exactly as before.

Tests

  • New test_ruleset_strategy_rank_round_score_uses_per_episode_placement: same episode inputs as the existing mean test, asserts round scores become the mean per-episode rank points and the rank score_kind tag is applied.
  • Full suite: 91 passing. Changed files are ruff-clean (two pre-existing unused-import warnings in utils.py are untouched).

Follow-up (not in this PR)

  • metta: set agricogla's agricogla-commissioner.yaml to scoring: {round_score: rank} and bump the commissioners-default image digest once this is merged + the image is rebuilt/published.
  • The app-backend leaderboard needs no change — it aggregates the commissioner's round score, which becomes mean per-episode rank points.

🤖 Generated with Claude Code

… scoring)

The ruleset_strategy commissioner scored every round by the mean of each policy's
per-episode scores. Add an opt-in `scoring.round_score: "rank"` mode: within each
episode policies are ranked by score and earn N..1 rank points (winner of an
N-policy episode gets N, last gets 1, ties share the better place), and a policy's
round score is the mean of those rank points across the episodes it played. Margins
of victory are discarded — only placement matters.

complete_round now delegates per-policy round scoring to an overridable
_round_scores_by_policy; the base keeps mean scoring, RulesetStrategyCommissioner
switches to rank points when configured. Rank rounds are tagged with a distinct
score_kind so switching a league from mean to rank filters the now-incomparable
prior-regime results off the leaderboard instead of blending score scales. Default
stays "mean", so existing configs are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
"Rounds rank policies by placement within each episode rather than by raw score: in an episode with N "
"policies the highest-scoring policy earns N points and the lowest earns 1 (ties share the better place), and "
"a policy's round score is the average of those rank points across the episodes it played. Margins of victory "
"are discarded — only who beat whom each game matters. The division leaderboard combines completed rounds with "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KyleHerndon note that right now, with division leaderboard computation and commissioner leaderboard computation split the way we do, and with our UI only reflecting commissioner-reported description, we force the commissioner to abstraction-leak by describing how its roundresults get managed by app-backend

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry more plainly: ideally we wouldnt need commissioners to say that their roundresults get 2h-ewma'd; they shouldn't need to know about or speak about it, and can't enforce it

@daveey daveey merged commit 3db3b57 into main Jun 22, 2026
7 checks passed
@daveey daveey deleted the daveey/round-score-rank-by-episode branch June 22, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants