gym eval run collects rollouts for exactly one configuration. To compare variants of any knob — skills (+skills.path=), prompts (+prompt_config=), models (--model), sampling params (+responses_create_params.temperature=), etc. — you run it N times changing a single Hydra override and then diff/group the resulting JSONL artifacts by hand. There is no first-class way to ablate one variable across runs and get a grouped comparison report.
This surfaced in #1605 (skill evaluation): skills are now a sweepable run-level variable stamped with a content-hashed skills_ref grouping key, so per-variant attribution already exists — but the actual "compare A vs B" step is still manual. The gap is generic, not skills-specific (the same is true for prompts, models, and sampling).
Filing to track. The design — how to express the swept variable, provenance/grouping per dimension, where it lives in the CLI, and what the comparison report should show — needs holistic thinking before committing to an approach.
Note: we should likely consider N-wise ablations, not just pairwise
Refs: #1605, #1256, epic #1494.
gym eval runcollects rollouts for exactly one configuration. To compare variants of any knob — skills (+skills.path=), prompts (+prompt_config=), models (--model), sampling params (+responses_create_params.temperature=), etc. — you run it N times changing a single Hydra override and then diff/group the resulting JSONL artifacts by hand. There is no first-class way to ablate one variable across runs and get a grouped comparison report.This surfaced in #1605 (skill evaluation): skills are now a sweepable run-level variable stamped with a content-hashed
skills_refgrouping key, so per-variant attribution already exists — but the actual "compare A vs B" step is still manual. The gap is generic, not skills-specific (the same is true for prompts, models, and sampling).Filing to track. The design — how to express the swept variable, provenance/grouping per dimension, where it lives in the CLI, and what the comparison report should show — needs holistic thinking before committing to an approach.
Note: we should likely consider N-wise ablations, not just pairwise
Refs: #1605, #1256, epic #1494.