Skip to content

Distillation: tighten prompt, fix metric, model-dependent Tier-1 limit#626

Merged
gnovak merged 1 commit into
devfrom
fix-distillation-prompt-and-metrics
May 23, 2026
Merged

Distillation: tighten prompt, fix metric, model-dependent Tier-1 limit#626
gnovak merged 1 commit into
devfrom
fix-distillation-prompt-and-metrics

Conversation

@gnovak

@gnovak gnovak commented May 23, 2026

Copy link
Copy Markdown
Owner

Three independent fixes investigated for bridge-analysis #392, where distillation compressed `leaderboard.py` down to 972 tokens (entire file → snippets only) and the status log reported "5.6K → 972 tokens, ~$0.05 saved" — both numbers misleading.

1. Tighten the distill prompt against over-excerpting

When an issue references specific line numbers ("lines 312-313, 459-460, 534-535"), the distill LLM was extracting tiny snippets and discarding the surrounding file. The agent then kept saying "I need to read the full leaderboard.py" across iterations.

The prompt now has a hard rule: files to be modified are returned in FULL, regardless of how specific the task is. The user-message instructions and system prompt are consistent on this (no more "[full content or relevant excerpt]" escape hatch).

2. Fix the metric

Status log read:
```
Distillation: 5.6K → 972 tokens (4.6K saved/iter × 30 iters = ~$0.05 saved)
```

That 5.6K was the size of EXTRA_FILES (README, AGENTS.md, etc.) — NOT what distillation processed. The actual codebase distillation scanned was 368K tokens. `maybe_distill()` now returns the codebase total as a 6th tuple element so `resolve.py` can pass the correct value into `build_distillation_summary()`.

For bridge-analysis #392:

Pre-token Reported savings
Before 5.6K (wrong) ~$0.05
After 368K (correct) ~$4.30

3. Model-dependent Tier-1-vs-Tier-2 cutoff

`DISTILL_SMALL_REPO_LIMIT` was a hardcoded 100K, set when 200K-window models were typical. All our currently-configured models have 1M-token input windows. Added `_small_repo_limit(model)` which scales the cutoff to 25% of the model's input window, clamped to [50K, 500K]. Old 100K constant is the fallback for unknown models.

Effective Tier-1 ceiling per model:

Model Window Tier-1 cutoff
claude-sonnet-4-6 / opus-4-7 1M 250K
gpt-5.5 1.05M 262K
gemini-2.5-flash 1.05M 262K
claude-sonnet-4-5 200K 50K (floor)
unknown / not in LiteLLM 100K (legacy default)

Test plan

🤖 Generated with Claude Code

Three independent fixes investigated for bridge-analysis #392, where
distillation compressed leaderboard.py down to 972 tokens (entire
file → ~snippets) and the status log misleadingly reported "5.6K →
972 tokens, ~$0.05 saved".

## 1. Tighten the distill prompt against over-excerpting

When an issue references specific line numbers (e.g., "lines 312-313")
the distill LLM was extracting tiny snippets and discarding the
surrounding file. Prompt now has a hard rule: files to be modified
are included in FULL, regardless of how specific the task is. Also
addressed the specific line-number-excerpting failure mode in the
prompt text. The user-message instructions and system prompt are now
consistent on this (no more "[full content or relevant excerpt]"
escape hatch).

## 2. Fix the metric (pre_tokens reflects what distillation saw)

The status log read:
    "Distillation: 5.6K → 972 tokens (4.6K saved/iter × 30 iters)"

That 5.6K was the size of EXTRA_FILES (README, AGENTS.md, etc.) — NOT
what distillation processed. The codebase distillation actually
scanned was 368K tokens. maybe_distill() now returns the codebase
total as a 6th tuple element so resolve.py can pass the correct value
into build_distillation_summary().

Bridge-analysis #392 same numbers, before/after this fix:
    OLD: 5.6K → 972 tokens, ~$0.05 saved
    NEW: 368K → 972 tokens, ~$4.30 saved

## 3. Model-dependent Tier-1-vs-Tier-2 cutoff

DISTILL_SMALL_REPO_LIMIT was a hardcoded 100K, set when 200K-window
models were typical. Modern models we configure (claude-sonnet-4-6,
claude-opus-4-7, gpt-5.5, gemini-2.5-flash) all have 1M-token input
windows, so we can comfortably send much larger codebases to a single
Tier-1 LLM call. Added _small_repo_limit(model) which scales the
cutoff to 25% of the model's input window, clamped to [50K, 500K].
Old 100K constant is the fallback for models LiteLLM doesn't know.

Effective Tier-1 ceiling per model:
- claude-sonnet-4-6, opus-4-7:  250K (1M × 25%)
- gpt-5.5:                       262K (1.05M × 25%)
- gemini-2.5-flash:              262K
- claude-sonnet-4-5 (200K win):   50K (floored)
- unknown / not in LiteLLM:      100K (legacy default)

662 tests pass (was 655; +7 for _small_repo_limit + prompt content).
@gnovak gnovak merged commit a2374ba into dev May 23, 2026
@gnovak gnovak deleted the fix-distillation-prompt-and-metrics branch June 13, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant