Skip to content

feat: generic ablation/comparison across runs #1747

Description

@cwing-nvidia

gym eval run collects rollouts for exactly one configuration. To compare variants of any knob — skills (+skills.path=), prompts (+prompt_config=), models (--model), sampling params (+responses_create_params.temperature=), etc. — you run it N times changing a single Hydra override and then diff/group the resulting JSONL artifacts by hand. There is no first-class way to ablate one variable across runs and get a grouped comparison report.

This surfaced in #1605 (skill evaluation): skills are now a sweepable run-level variable stamped with a content-hashed skills_ref grouping key, so per-variant attribution already exists — but the actual "compare A vs B" step is still manual. The gap is generic, not skills-specific (the same is true for prompts, models, and sampling).

Filing to track. The design — how to express the swept variable, provenance/grouping per dimension, where it lives in the CLI, and what the comparison report should show — needs holistic thinking before committing to an approach.

Note: we should likely consider N-wise ablations, not just pairwise

Refs: #1605, #1256, epic #1494.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions