Distillation: tighten prompt, fix metric, model-dependent Tier-1 limit#626
Merged
Conversation
Three independent fixes investigated for bridge-analysis #392, where distillation compressed leaderboard.py down to 972 tokens (entire file → ~snippets) and the status log misleadingly reported "5.6K → 972 tokens, ~$0.05 saved". ## 1. Tighten the distill prompt against over-excerpting When an issue references specific line numbers (e.g., "lines 312-313") the distill LLM was extracting tiny snippets and discarding the surrounding file. Prompt now has a hard rule: files to be modified are included in FULL, regardless of how specific the task is. Also addressed the specific line-number-excerpting failure mode in the prompt text. The user-message instructions and system prompt are now consistent on this (no more "[full content or relevant excerpt]" escape hatch). ## 2. Fix the metric (pre_tokens reflects what distillation saw) The status log read: "Distillation: 5.6K → 972 tokens (4.6K saved/iter × 30 iters)" That 5.6K was the size of EXTRA_FILES (README, AGENTS.md, etc.) — NOT what distillation processed. The codebase distillation actually scanned was 368K tokens. maybe_distill() now returns the codebase total as a 6th tuple element so resolve.py can pass the correct value into build_distillation_summary(). Bridge-analysis #392 same numbers, before/after this fix: OLD: 5.6K → 972 tokens, ~$0.05 saved NEW: 368K → 972 tokens, ~$4.30 saved ## 3. Model-dependent Tier-1-vs-Tier-2 cutoff DISTILL_SMALL_REPO_LIMIT was a hardcoded 100K, set when 200K-window models were typical. Modern models we configure (claude-sonnet-4-6, claude-opus-4-7, gpt-5.5, gemini-2.5-flash) all have 1M-token input windows, so we can comfortably send much larger codebases to a single Tier-1 LLM call. Added _small_repo_limit(model) which scales the cutoff to 25% of the model's input window, clamped to [50K, 500K]. Old 100K constant is the fallback for models LiteLLM doesn't know. Effective Tier-1 ceiling per model: - claude-sonnet-4-6, opus-4-7: 250K (1M × 25%) - gpt-5.5: 262K (1.05M × 25%) - gemini-2.5-flash: 262K - claude-sonnet-4-5 (200K win): 50K (floored) - unknown / not in LiteLLM: 100K (legacy default) 662 tests pass (was 655; +7 for _small_repo_limit + prompt content).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three independent fixes investigated for bridge-analysis #392, where distillation compressed `leaderboard.py` down to 972 tokens (entire file → snippets only) and the status log reported "5.6K → 972 tokens, ~$0.05 saved" — both numbers misleading.
1. Tighten the distill prompt against over-excerpting
When an issue references specific line numbers ("lines 312-313, 459-460, 534-535"), the distill LLM was extracting tiny snippets and discarding the surrounding file. The agent then kept saying "I need to read the full leaderboard.py" across iterations.
The prompt now has a hard rule: files to be modified are returned in FULL, regardless of how specific the task is. The user-message instructions and system prompt are consistent on this (no more "[full content or relevant excerpt]" escape hatch).
2. Fix the metric
Status log read:
```
Distillation: 5.6K → 972 tokens (4.6K saved/iter × 30 iters = ~$0.05 saved)
```
That 5.6K was the size of EXTRA_FILES (README, AGENTS.md, etc.) — NOT what distillation processed. The actual codebase distillation scanned was 368K tokens. `maybe_distill()` now returns the codebase total as a 6th tuple element so `resolve.py` can pass the correct value into `build_distillation_summary()`.
For bridge-analysis #392:
3. Model-dependent Tier-1-vs-Tier-2 cutoff
`DISTILL_SMALL_REPO_LIMIT` was a hardcoded 100K, set when 200K-window models were typical. All our currently-configured models have 1M-token input windows. Added `_small_repo_limit(model)` which scales the cutoff to 25% of the model's input window, clamped to [50K, 500K]. Old 100K constant is the fallback for unknown models.
Effective Tier-1 ceiling per model:
Test plan
🤖 Generated with Claude Code