[Discussion] Optimizer-strength fairness in Table 1: SkillOpt uses GPT-5.5 as optimizer while several baselines (GEPA, Trace2Skill, EvoSkill) are self-contained by design

## Summary

First, thank you for the careful, well-documented release — the deep-learning-style
framing of skill optimization is genuinely useful, and the artifact (`best_skill.md`
plus the bounded-edit loop) is a clean abstraction.

I'd like to raise a question about the optimizer-strength assumption underlying the
headline result ("52/52 cells best or tied-best") in Table 1, and propose two
clarifications / additional experiments that I think would make the comparison
substantially more convincing.

## Observation 1 — The paper aligns the *target* model across baselines, but not the *optimizer*

Section 4 ("Baselines") states:

> "All baselines use the same target model, the same held-out test split, and the
>  same scorer for every benchmark."

This isolates the target/scorer/test-set factors, which is great. However, the
*optimizer / teacher / reflection model* used by each baseline is not specified in
either Section 4 or Appendix C. The only optimizer-related disclosure I could find
is that the one-shot LLM skill is generated by GPT-5.5, and that SkillOpt itself
defaults to GPT-5.5 as the optimizer.

This matters because the baselines' original papers differ on whether they
introduce a stronger optimizer at all:

| Method        | Original-paper default                                                                                              |
|---------------|----------------------------------------------------------------------------------------------------------------------|
| TextGrad      | "improve weaker model using feedback from stronger models" |
| GEPA          | "optimized entirely for (and using) the weaker Qwen3-8B"         |
| Trace2Skill   | "a single LLM…with no external teacher"    |
| EvoSkill      | "The underlying model remains frozen throughout"                    |

If SkillOpt uses GPT-5.5 as optimizer while GEPA / Trace2Skill / EvoSkill follow
their original self-contained design, the headline comparison conflates two
distinct factors: (i) the value of the disciplined loop and (ii) the value of an
additional, stronger optimizer model. It would help readers a lot to know which
optimizer/teacher each baseline actually used in the SkillOpt re-implementation.

**Could you confirm, for each of TextGrad / GEPA / Trace2Skill / EvoSkill,
the exact model used for the reflection / backward / distillation / evolution step
in Table 1?**

## Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint

Table 5 shows that swapping GPT-5.5 for a target-matched optimizer keeps 56–74%
of SkillOpt's gain. That establishes "SkillOpt is not pure distillation" — good.

But Table 5 does not include any baseline column. If I take the target-matched
SkillOpt numbers from Table 5 and drop them back into the corresponding cells of
Table 1, the comparison shifts noticeably:

| Cell                                  | Best baseline in Table 1 | SkillOpt (Strong, GPT-5.5) | SkillOpt (target-matched) |
|---------------------------------------|--------------------------|----------------------------|---------------------------|
| SpreadsheetBench, GPT-5.4-mini        | Human 42.9 / GEPA 42.5   | **47.5**                   | 43.2 (still #1, +0.3)     |
| SpreadsheetBench, GPT-5.4-nano        | Human 41.8 / GEPA 37.2   | **42.5**                   | 35.4 (#4–5, below Human & GEPA) |
| SearchQA, GPT-5.4-mini                | GEPA 79.4 / Trace2Skill 78.6 | **80.2**                | 78.3 (#4, below GEPA & Trace2Skill) |
| SearchQA, GPT-5.4-nano                | TextGrad 73.4 / GEPA 73.2    | **74.8**                | 69.9 (#5, below TextGrad & GEPA) |

(Baseline numbers taken from Table 1 rows; target-matched SkillOpt numbers taken
from Table 5. Please correct me if I've misread any cell.)

In 3 of these 4 cells, SkillOpt under a target-matched optimizer no longer leads
the *strongest* baseline in the same row. This doesn't invalidate the loop's
contribution, but it does suggest that part of the 52/52 dominance is driven by
an optimizer-strength asymmetry, not solely by the disciplined loop.

## What would substantially strengthen the paper

I think two relatively small additions would close the gap completely:

1. **Disclose the optimizer/teacher model used by every baseline in Table 1**, ideally as a
   per-method footnote (e.g. "GEPA reflection = target model, per original paper"
   or "TextGrad backward engine = GPT-5.5"). This is the cheapest fix and on its
   own would already let readers interpret the gap correctly.

2. **Extend Table 5 (or add a new table) with a baseline column** for at least
   the four cells already evaluated — i.e. re-run TextGrad / GEPA / Trace2Skill /
   EvoSkill *under the same target-matched optimizer constraint as SkillOpt-TM*,
   and report whether SkillOpt-TM still wins. This is the single most informative
   experiment for isolating the contribution of the SkillOpt loop.

A more ambitious version would also report training-time token cost per method, so
the comparison is compute-matched as well as optimizer-matched — but I recognize
that's a larger ask.

## To be clear

None of the above implies the loop is not contributing real value — the gate,
edit budget, rejected-edit buffer, and slow/meta update are sensible mechanisms,
and several of the headline numbers (e.g. SpreadsheetBench +38.9 on GPT-5.5,
ALFWorld 2× on GPT-5.4-nano) are large enough that an optimizer-asymmetry
explanation alone seems unlikely to account for all of them. The request is just
to make the optimizer-side experimental conditions explicit and symmetric so the
"52/52" claim is unambiguous.

Happy to help by sharing the re-derived table above in a more structured form, or
by reviewing a draft of the additional experiment if useful. Thanks again for the
work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Optimizer-strength fairness in Table 1: SkillOpt uses GPT-5.5 as optimizer while several baselines (GEPA, Trace2Skill, EvoSkill) are self-contained by design #12

Summary

Observation 1 — The paper aligns the target model across baselines, but not the optimizer

Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint

What would substantially strengthen the paper

To be clear

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	Original-paper default
TextGrad	"improve weaker model using feedback from stronger models"
GEPA	"optimized entirely for (and using) the weaker Qwen3-8B"
Trace2Skill	"a single LLM…with no external teacher"
EvoSkill	"The underlying model remains frozen throughout"

Cell	Best baseline in Table 1	SkillOpt (Strong, GPT-5.5)	SkillOpt (target-matched)
SpreadsheetBench, GPT-5.4-mini	Human 42.9 / GEPA 42.5	47.5	43.2 (still #1, +0.3)
SpreadsheetBench, GPT-5.4-nano	Human 41.8 / GEPA 37.2	42.5	35.4 (#4–5, below Human & GEPA)
SearchQA, GPT-5.4-mini	GEPA 79.4 / Trace2Skill 78.6	80.2	78.3 (#4, below GEPA & Trace2Skill)
SearchQA, GPT-5.4-nano	TextGrad 73.4 / GEPA 73.2	74.8	69.9 (#5, below TextGrad & GEPA)

[Discussion] Optimizer-strength fairness in Table 1: SkillOpt uses GPT-5.5 as optimizer while several baselines (GEPA, Trace2Skill, EvoSkill) are self-contained by design #12

Description

Summary

Observation 1 — The paper aligns the target model across baselines, but not the optimizer

Observation 2 — Table 5 ablates SkillOpt's optimizer but does not re-run the baselines under the same constraint

What would substantially strengthen the paper

To be clear

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions