Summary
First, thank you for the careful, well-documented release — the deep-learning-style
framing of skill optimization is genuinely useful, and the artifact (best_skill.md
plus the bounded-edit loop) is a clean abstraction.
I'd like to raise a question about the optimizer-strength assumption underlying the
headline result ("52/52 cells best or tied-best") in Table 1, and propose two
clarifications / additional experiments that I think would make the comparison
substantially more convincing.
Observation 1 — The paper aligns the target model across baselines, but not the optimizer
Section 4 ("Baselines") states:
"All baselines use the same target model, the same held-out test split, and the
same scorer for every benchmark."
This isolates the target/scorer/test-set factors, which is great. However, the
optimizer / teacher / reflection model used by each baseline is not specified in
either Section 4 or Appendix C. The only optimizer-related disclosure I could find
is that the one-shot LLM skill is generated by GPT-5.5, and that SkillOpt itself
defaults to GPT-5.5 as the optimizer.
This matters because the baselines' original papers differ on whether they
introduce a stronger optimizer at all:
| Method |
Original-paper default |
| TextGrad |
"improve weaker model using feedback from stronger models" |
| GEPA |
"optimized entirely for (and using) the weaker Qwen3-8B" |
| Trace2Skill |
"a single LLM…with no external teacher" |
| EvoSkill |
"The underlying model remains frozen throughout" |
If SkillOpt uses GPT-5.5 as optimizer while GEPA / Trace2Skill / EvoSkill follow
their original self-contained design, the headline comparison conflates two
distinct factors: (i) the value of the disciplined loop and (ii) the value of an
additional, stronger optimizer model. It would help readers a lot to know which
optimizer/teacher each baseline actually used in the SkillOpt re-implementation.
Could you confirm, for each of TextGrad / GEPA / Trace2Skill / EvoSkill,
the exact model used for the reflection / backward / distillation / evolution step
in Table 1?
Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint
Table 5 shows that swapping GPT-5.5 for a target-matched optimizer keeps 56–74%
of SkillOpt's gain. That establishes "SkillOpt is not pure distillation" — good.
But Table 5 does not include any baseline column. If I take the target-matched
SkillOpt numbers from Table 5 and drop them back into the corresponding cells of
Table 1, the comparison shifts noticeably:
| Cell |
Best baseline in Table 1 |
SkillOpt (Strong, GPT-5.5) |
SkillOpt (target-matched) |
| SpreadsheetBench, GPT-5.4-mini |
Human 42.9 / GEPA 42.5 |
47.5 |
43.2 (still #1, +0.3) |
| SpreadsheetBench, GPT-5.4-nano |
Human 41.8 / GEPA 37.2 |
42.5 |
35.4 (#4–5, below Human & GEPA) |
| SearchQA, GPT-5.4-mini |
GEPA 79.4 / Trace2Skill 78.6 |
80.2 |
78.3 (#4, below GEPA & Trace2Skill) |
| SearchQA, GPT-5.4-nano |
TextGrad 73.4 / GEPA 73.2 |
74.8 |
69.9 (#5, below TextGrad & GEPA) |
(Baseline numbers taken from Table 1 rows; target-matched SkillOpt numbers taken
from Table 5. Please correct me if I've misread any cell.)
In 3 of these 4 cells, SkillOpt under a target-matched optimizer no longer leads
the strongest baseline in the same row. This doesn't invalidate the loop's
contribution, but it does suggest that part of the 52/52 dominance is driven by
an optimizer-strength asymmetry, not solely by the disciplined loop.
What would substantially strengthen the paper
I think two relatively small additions would close the gap completely:
-
Disclose the optimizer/teacher model used by every baseline in Table 1, ideally as a
per-method footnote (e.g. "GEPA reflection = target model, per original paper"
or "TextGrad backward engine = GPT-5.5"). This is the cheapest fix and on its
own would already let readers interpret the gap correctly.
-
Extend Table 5 (or add a new table) with a baseline column for at least
the four cells already evaluated — i.e. re-run TextGrad / GEPA / Trace2Skill /
EvoSkill under the same target-matched optimizer constraint as SkillOpt-TM,
and report whether SkillOpt-TM still wins. This is the single most informative
experiment for isolating the contribution of the SkillOpt loop.
A more ambitious version would also report training-time token cost per method, so
the comparison is compute-matched as well as optimizer-matched — but I recognize
that's a larger ask.
To be clear
None of the above implies the loop is not contributing real value — the gate,
edit budget, rejected-edit buffer, and slow/meta update are sensible mechanisms,
and several of the headline numbers (e.g. SpreadsheetBench +38.9 on GPT-5.5,
ALFWorld 2× on GPT-5.4-nano) are large enough that an optimizer-asymmetry
explanation alone seems unlikely to account for all of them. The request is just
to make the optimizer-side experimental conditions explicit and symmetric so the
"52/52" claim is unambiguous.
Happy to help by sharing the re-derived table above in a more structured form, or
by reviewing a draft of the additional experiment if useful. Thanks again for the
work.
Summary
First, thank you for the careful, well-documented release — the deep-learning-style
framing of skill optimization is genuinely useful, and the artifact (
best_skill.mdplus the bounded-edit loop) is a clean abstraction.
I'd like to raise a question about the optimizer-strength assumption underlying the
headline result ("52/52 cells best or tied-best") in Table 1, and propose two
clarifications / additional experiments that I think would make the comparison
substantially more convincing.
Observation 1 — The paper aligns the target model across baselines, but not the optimizer
Section 4 ("Baselines") states:
This isolates the target/scorer/test-set factors, which is great. However, the
optimizer / teacher / reflection model used by each baseline is not specified in
either Section 4 or Appendix C. The only optimizer-related disclosure I could find
is that the one-shot LLM skill is generated by GPT-5.5, and that SkillOpt itself
defaults to GPT-5.5 as the optimizer.
This matters because the baselines' original papers differ on whether they
introduce a stronger optimizer at all:
If SkillOpt uses GPT-5.5 as optimizer while GEPA / Trace2Skill / EvoSkill follow
their original self-contained design, the headline comparison conflates two
distinct factors: (i) the value of the disciplined loop and (ii) the value of an
additional, stronger optimizer model. It would help readers a lot to know which
optimizer/teacher each baseline actually used in the SkillOpt re-implementation.
Could you confirm, for each of TextGrad / GEPA / Trace2Skill / EvoSkill,
the exact model used for the reflection / backward / distillation / evolution step
in Table 1?
Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint
Table 5 shows that swapping GPT-5.5 for a target-matched optimizer keeps 56–74%
of SkillOpt's gain. That establishes "SkillOpt is not pure distillation" — good.
But Table 5 does not include any baseline column. If I take the target-matched
SkillOpt numbers from Table 5 and drop them back into the corresponding cells of
Table 1, the comparison shifts noticeably:
(Baseline numbers taken from Table 1 rows; target-matched SkillOpt numbers taken
from Table 5. Please correct me if I've misread any cell.)
In 3 of these 4 cells, SkillOpt under a target-matched optimizer no longer leads
the strongest baseline in the same row. This doesn't invalidate the loop's
contribution, but it does suggest that part of the 52/52 dominance is driven by
an optimizer-strength asymmetry, not solely by the disciplined loop.
What would substantially strengthen the paper
I think two relatively small additions would close the gap completely:
Disclose the optimizer/teacher model used by every baseline in Table 1, ideally as a
per-method footnote (e.g. "GEPA reflection = target model, per original paper"
or "TextGrad backward engine = GPT-5.5"). This is the cheapest fix and on its
own would already let readers interpret the gap correctly.
Extend Table 5 (or add a new table) with a baseline column for at least
the four cells already evaluated — i.e. re-run TextGrad / GEPA / Trace2Skill /
EvoSkill under the same target-matched optimizer constraint as SkillOpt-TM,
and report whether SkillOpt-TM still wins. This is the single most informative
experiment for isolating the contribution of the SkillOpt loop.
A more ambitious version would also report training-time token cost per method, so
the comparison is compute-matched as well as optimizer-matched — but I recognize
that's a larger ask.
To be clear
None of the above implies the loop is not contributing real value — the gate,
edit budget, rejected-edit buffer, and slow/meta update are sensible mechanisms,
and several of the headline numbers (e.g. SpreadsheetBench +38.9 on GPT-5.5,
ALFWorld 2× on GPT-5.4-nano) are large enough that an optimizer-asymmetry
explanation alone seems unlikely to account for all of them. The request is just
to make the optimizer-side experimental conditions explicit and symmetric so the
"52/52" claim is unambiguous.
Happy to help by sharing the re-derived table above in a more structured form, or
by reviewing a draft of the additional experiment if useful. Thanks again for the
work.