Skip to content

[Discussion] Optimizer-strength fairness in Table 1: SkillOpt uses GPT-5.5 as optimizer while several baselines (GEPA, Trace2Skill, EvoSkill) are self-contained by design #12

@Seanium

Description

@Seanium

Summary

First, thank you for the careful, well-documented release — the deep-learning-style
framing of skill optimization is genuinely useful, and the artifact (best_skill.md
plus the bounded-edit loop) is a clean abstraction.

I'd like to raise a question about the optimizer-strength assumption underlying the
headline result ("52/52 cells best or tied-best") in Table 1, and propose two
clarifications / additional experiments that I think would make the comparison
substantially more convincing.

Observation 1 — The paper aligns the target model across baselines, but not the optimizer

Section 4 ("Baselines") states:

"All baselines use the same target model, the same held-out test split, and the
same scorer for every benchmark."

This isolates the target/scorer/test-set factors, which is great. However, the
optimizer / teacher / reflection model used by each baseline is not specified in
either Section 4 or Appendix C. The only optimizer-related disclosure I could find
is that the one-shot LLM skill is generated by GPT-5.5, and that SkillOpt itself
defaults to GPT-5.5 as the optimizer.

This matters because the baselines' original papers differ on whether they
introduce a stronger optimizer at all:

Method Original-paper default
TextGrad "improve weaker model using feedback from stronger models"
GEPA "optimized entirely for (and using) the weaker Qwen3-8B"
Trace2Skill "a single LLM…with no external teacher"
EvoSkill "The underlying model remains frozen throughout"

If SkillOpt uses GPT-5.5 as optimizer while GEPA / Trace2Skill / EvoSkill follow
their original self-contained design, the headline comparison conflates two
distinct factors: (i) the value of the disciplined loop and (ii) the value of an
additional, stronger optimizer model. It would help readers a lot to know which
optimizer/teacher each baseline actually used in the SkillOpt re-implementation.

Could you confirm, for each of TextGrad / GEPA / Trace2Skill / EvoSkill,
the exact model used for the reflection / backward / distillation / evolution step
in Table 1?

Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint

Table 5 shows that swapping GPT-5.5 for a target-matched optimizer keeps 56–74%
of SkillOpt's gain. That establishes "SkillOpt is not pure distillation" — good.

But Table 5 does not include any baseline column. If I take the target-matched
SkillOpt numbers from Table 5 and drop them back into the corresponding cells of
Table 1, the comparison shifts noticeably:

Cell Best baseline in Table 1 SkillOpt (Strong, GPT-5.5) SkillOpt (target-matched)
SpreadsheetBench, GPT-5.4-mini Human 42.9 / GEPA 42.5 47.5 43.2 (still #1, +0.3)
SpreadsheetBench, GPT-5.4-nano Human 41.8 / GEPA 37.2 42.5 35.4 (#4–5, below Human & GEPA)
SearchQA, GPT-5.4-mini GEPA 79.4 / Trace2Skill 78.6 80.2 78.3 (#4, below GEPA & Trace2Skill)
SearchQA, GPT-5.4-nano TextGrad 73.4 / GEPA 73.2 74.8 69.9 (#5, below TextGrad & GEPA)

(Baseline numbers taken from Table 1 rows; target-matched SkillOpt numbers taken
from Table 5. Please correct me if I've misread any cell.)

In 3 of these 4 cells, SkillOpt under a target-matched optimizer no longer leads
the strongest baseline in the same row. This doesn't invalidate the loop's
contribution, but it does suggest that part of the 52/52 dominance is driven by
an optimizer-strength asymmetry, not solely by the disciplined loop.

What would substantially strengthen the paper

I think two relatively small additions would close the gap completely:

  1. Disclose the optimizer/teacher model used by every baseline in Table 1, ideally as a
    per-method footnote (e.g. "GEPA reflection = target model, per original paper"
    or "TextGrad backward engine = GPT-5.5"). This is the cheapest fix and on its
    own would already let readers interpret the gap correctly.

  2. Extend Table 5 (or add a new table) with a baseline column for at least
    the four cells already evaluated — i.e. re-run TextGrad / GEPA / Trace2Skill /
    EvoSkill under the same target-matched optimizer constraint as SkillOpt-TM,
    and report whether SkillOpt-TM still wins. This is the single most informative
    experiment for isolating the contribution of the SkillOpt loop.

A more ambitious version would also report training-time token cost per method, so
the comparison is compute-matched as well as optimizer-matched — but I recognize
that's a larger ask.

To be clear

None of the above implies the loop is not contributing real value — the gate,
edit budget, rejected-edit buffer, and slow/meta update are sensible mechanisms,
and several of the headline numbers (e.g. SpreadsheetBench +38.9 on GPT-5.5,
ALFWorld 2× on GPT-5.4-nano) are large enough that an optimizer-asymmetry
explanation alone seems unlikely to account for all of them. The request is just
to make the optimizer-side experimental conditions explicit and symmetric so the
"52/52" claim is unambiguous.

Happy to help by sharing the re-derived table above in a more structured form, or
by reviewing a draft of the additional experiment if useful. Thanks again for the
work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions