Prepares and validates training samples for a coding-agent LLM. Each sample is a
single agent turn: given the chat history of a coding task, produce a good
THOUGHT + one bash command.
Two responses are generated per turn and pitted against each other by a panel of LLM judges (a "duel"):
- king — a baseline answer produced blind (no knowledge of how the PR was resolved).
- challenger — an answer produced with the hidden PR information, but instructed to
behave as if it were solving the task honestly (see the blind-agent rule in
ch.py).
The challenger is the candidate we want to keep when it reliably beats the king.
Run the scripts in order; each consumes the previous stage's output directory.
prompt.py → king.py → ch.py → validate.py → report.py
│ │ │ │ │
prompt/ answer_*/ answer_*/ score/ report/report.md
commit/ (king field) (challenger) _duel.json final/ refined/ defeat/
- Samples turns from the HF dataset
AlienKevin/SWE-ZERO-12M-trajectories(random offset windows over the full split). - For each sampled turn, keeps the chat history up to that point plus the reference
answer(the dataset's own next turn). - Mirrors the source PR record (issue, files, commit, description) for each
instance_idfromnebius/SWE-rebench-V2-PRsvia duckdb. - Writes:
prompt/prompts_2000_*.json(≤200 prompts/file) andcommit/<instance_id>.json. - Run:
python prompt.py(requests 2000 prompts)
- Reads every prompt across all
prompt/*.jsonfiles (no per-instance cap). - Skips instance ids in
_EXCLUDE_IDS(e.g. oversized PRs — seereport/hugePR.md). - Calls the model with no PR information to get the king reply.
format_ok()rejects malformed outputs (those starting with<|tool_calls_section_begin|>); those are not saved.- Resumable: skips a prompt whose answer file already exists.
- Writes:
answer/<source>_<order>.jsonwith fields{instance_id, king, challenger:"", assistant, messages}. (source= prompt file stem,order= 1-based position within that file.) - Run:
python king.py
- Reads each
answer/*.json, loads the matchingcommit/<instance_id>.json(hidden PR info). - Prompts the model with the PR info + chat history to produce a plausible, honest-looking
next move (
THOUGHT+bash) — constrained by the blind-agent rule so it doesn't leak knowledge it couldn't have observed. - Writes: fills the
challengerfield in eachanswer/*.jsonin place. - Run:
python ch.py
- For each
answer/*.json, runs a pairwise comparison (king vs challenger) across a panel of judge models, scoring five dimensions: correctness, grounding, progress, protocol, efficiency. - Aggregates per-judge / per-metric scores and runs an acceptance gate (win margin, bootstrap LCB > 0, minimum fraction of turns parsed).
DUEL_MODE=1compares the challenger against the referenceassistantinstead of the king (DUEL_MODE=2, default).- Writes:
score/<answer_file>.jsonper turn andscore/_duel.json(overall verdict). - Run:
python validate.py
- Reads every per-turn score in
score/and pairs it with the matching answer file, looked up by stem across the answer dirs (answer_0/,answer_2/— see_find_answer). - Writes
report/report.md: header stats + a per-turn score table (score, file, action style, the five metric columns). - Buckets each parsed turn by its duel score and copies the answer (enriched with
score, per-metricmetrics, and judgereason) into one of three result dirs:export_final()— score ≥ 80 →final/(the accepted dataset)export_refined()— 66 ≤ score < 80 →refined/(borderline, worth refining)export_defeat()— score < 66 →defeat/(the challenger lost)- turns that failed to parse are counted as
parse-failand not exported. Thresholds are_FINAL_MIN = 80/_DEFEAT_MIN = 66. Exports copy (do not remove); re-running overwrites.
- Run:
python report.py
| Dir | Produced by | Contents |
|---|---|---|
prompt/ |
prompt.py | sampled prompt turns |
commit/ |
prompt.py | source PR records, keyed by instance_id |
answer_0/, answer_1/, answer_2/ |
king.py / ch.py | per-turn king + challenger replies (pending), sharded across answer dirs |
score/ |
validate.py | per-turn judge scores + _duel.json |
report/ |
report.py | report.md (and hugePR.md exclusion notes) |
final/ |
report.py | accepted answers, score ≥ 80 (from score/ + answer_*/) |
refined/ |
report.py | borderline answers, 66 ≤ score < 80 (from score/ + answer_*/) |
defeat/ |
report.py | losing answers, score < 66 (from score/ + answer_*/) |
.env(loaded by each script):CHUTES_API_KEY(required for the LLM calls), optionalCHUTES_BASE_URL, andHF_TOKEN/HUGGINGFACE_API_KEYforprompt.py.chain.toml(optional, read byvalidate.py):[judge]models / token limits and[duel]gate thresholds. Falls back to built-in defaults if absent.- Models: generation uses
moonshotai/Kimi-K2.6-TEE; judging uses a multi-model panel (DeepSeek, Qwen3-Thinking, Kimi) defined invalidate.py/chain.toml.