feat(ruleset_strategy): add round_score "rank" (per-episode placement scoring) by daveey · Pull Request #48 · Metta-AI/commissioners

daveey · 2026-06-22T19:52:27Z

What

Adds an opt-in rank-by-episode round scoring mode to the ruleset_strategy commissioner. Requested for the agricogla league (the metta-side config + image bump is a follow-up).

`scoring.round_score: "rank"`

ScoringConfig.round_score previously only accepted "mean" (round score = mean of a policy's per-episode scores). Adds "rank":

Within each episode, policies are ranked by score and earn N..1 rank points — the winner of an N-policy episode gets N, last gets 1, ties share the better placement (computed as N - (#strictly-higher)).
A policy's round score is the mean of its per-episode rank points across the episodes it played. Margins of victory are discarded — only placement each game matters.

How

complete_round now delegates per-policy round scoring to an overridable _round_scores_by_policy(entries, episode_results) -> (scores, ranked_counts). The base BaselineCommissioner keeps mean scoring; RulesetStrategyCommissioner switches to rank points when scoring.round_score == "rank". The ranking/metadata assembly is unchanged.
Rank rounds are tagged with a distinct score_kind (rank_episode_round_score) via RankingConfig.result_metadata/filter_metadata, so switching a league from mean → rank filters the now-incomparable prior-regime round results off the commissioner leaderboard instead of blending two score scales.
scoring_mechanics describes the rank scheme for the division description.

Default stays "mean", so every existing config behaves exactly as before.

Tests

New test_ruleset_strategy_rank_round_score_uses_per_episode_placement: same episode inputs as the existing mean test, asserts round scores become the mean per-episode rank points and the rank score_kind tag is applied.
Full suite: 91 passing. Changed files are ruff-clean (two pre-existing unused-import warnings in utils.py are untouched).

Follow-up (not in this PR)

metta: set agricogla's agricogla-commissioner.yaml to scoring: {round_score: rank} and bump the commissioners-default image digest once this is merged + the image is rebuilt/published.
The app-backend leaderboard needs no change — it aggregates the commissioner's round score, which becomes mean per-episode rank points.

🤖 Generated with Claude Code

… scoring) The ruleset_strategy commissioner scored every round by the mean of each policy's per-episode scores. Add an opt-in `scoring.round_score: "rank"` mode: within each episode policies are ranked by score and earn N..1 rank points (winner of an N-policy episode gets N, last gets 1, ties share the better place), and a policy's round score is the mean of those rank points across the episodes it played. Margins of victory are discarded — only placement matters. complete_round now delegates per-policy round scoring to an overridable _round_scores_by_policy; the base keeps mean scoring, RulesetStrategyCommissioner switches to rank points when configured. Rank rounds are tagged with a distinct score_kind so switching a league from mean to rank filters the now-incomparable prior-regime results off the leaderboard instead of blending score scales. Default stays "mean", so existing configs are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

nishu-builder · 2026-06-22T20:01:35Z

+    "Rounds rank policies by placement within each episode rather than by raw score: in an episode with N "
+    "policies the highest-scoring policy earns N points and the lowest earns 1 (ties share the better place), and "
+    "a policy's round score is the average of those rank points across the episodes it played. Margins of victory "
+    "are discarded — only who beat whom each game matters. The division leaderboard combines completed rounds with "


@KyleHerndon note that right now, with division leaderboard computation and commissioner leaderboard computation split the way we do, and with our UI only reflecting commissioner-reported description, we force the commissioner to abstraction-leak by describing how its roundresults get managed by app-backend

sorry more plainly: ideally we wouldnt need commissioners to say that their roundresults get 2h-ewma'd; they shouldn't need to know about or speak about it, and can't enforce it

daveey assigned nishu-builder Jun 22, 2026

daveey requested a review from nishu-builder June 22, 2026 19:58

nishu-builder reviewed Jun 22, 2026

View reviewed changes

nishu-builder approved these changes Jun 22, 2026

View reviewed changes

daveey merged commit 3db3b57 into main Jun 22, 2026
7 checks passed

daveey deleted the daveey/round-score-rank-by-episode branch June 22, 2026 20:10

daveey mentioned this pull request Jun 22, 2026

feat(configs): bundle agricogla commissioner config + publish its image #49

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ruleset_strategy): add round_score "rank" (per-episode placement scoring)#48

feat(ruleset_strategy): add round_score "rank" (per-episode placement scoring)#48
daveey merged 1 commit into
mainfrom
daveey/round-score-rank-by-episode

daveey commented Jun 22, 2026

Uh oh!

nishu-builder Jun 22, 2026

Uh oh!

nishu-builder Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daveey commented Jun 22, 2026

What

scoring.round_score: "rank"

How

Tests

Follow-up (not in this PR)

Uh oh!

nishu-builder Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

nishu-builder Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`scoring.round_score: "rank"`