feat: generic ablation/comparison across runs

`gym eval run` collects rollouts for exactly **one** configuration. To compare variants of any knob — skills (`+skills.path=`), prompts (`+prompt_config=`), models (`--model`), sampling params (`+responses_create_params.temperature=`), etc. — you run it N times changing a single Hydra override and then diff/group the resulting JSONL artifacts by hand. There is no first-class way to ablate one variable across runs and get a grouped comparison report.

This surfaced in #1605 (skill evaluation): skills are now a sweepable run-level variable stamped with a content-hashed `skills_ref` grouping key, so per-variant attribution already exists — but the actual "compare A vs B" step is still manual. The gap is **generic**, not skills-specific (the same is true for prompts, models, and sampling).

Filing to track. The design — how to express the swept variable, provenance/grouping per dimension, where it lives in the CLI, and what the comparison report should show — needs holistic thinking before committing to an approach.

Note: we should likely consider N-wise ablations, not just pairwise

Refs: #1605, #1256, epic #1494.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: generic ablation/comparison across runs #1747

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: generic ablation/comparison across runs #1747

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions