Training-Data Pipeline

Prepares and validates training samples for a coding-agent LLM. Each sample is a single agent turn: given the chat history of a coding task, produce a good THOUGHT + one bash command.

Two responses are generated per turn and pitted against each other by a panel of LLM judges (a "duel"):

king — a baseline answer produced blind (no knowledge of how the PR was resolved).
challenger — an answer produced with the hidden PR information, but instructed to behave as if it were solving the task honestly (see the blind-agent rule in ch.py).

The challenger is the candidate we want to keep when it reliably beats the king.

Pipeline

Run the scripts in order; each consumes the previous stage's output directory.

prompt.py  →  king.py  →  ch.py  →  validate.py  →  report.py
   │            │           │           │              │
 prompt/      answer_*/   answer_*/    score/        report/report.md
 commit/    (king field) (challenger)  _duel.json    final/ refined/ defeat/

1. `prompt.py` — build the prompt set

Samples turns from the HF dataset AlienKevin/SWE-ZERO-12M-trajectories (random offset windows over the full split).
For each sampled turn, keeps the chat history up to that point plus the reference answer (the dataset's own next turn).
Mirrors the source PR record (issue, files, commit, description) for each instance_id from nebius/SWE-rebench-V2-PRs via duckdb.
Writes: prompt/prompts_2000_*.json (≤200 prompts/file) and commit/<instance_id>.json.
Run: python prompt.py (requests 2000 prompts)

2. `king.py` — generate king (blind) answers

Reads every prompt across all prompt/*.json files (no per-instance cap).
Skips instance ids in _EXCLUDE_IDS (e.g. oversized PRs — see report/hugePR.md).
Calls the model with no PR information to get the king reply.
format_ok() rejects malformed outputs (those starting with <|tool_calls_section_begin|>); those are not saved.
Resumable: skips a prompt whose answer file already exists.
Writes: answer/<source>_<order>.json with fields {instance_id, king, challenger:"", assistant, messages}. (source = prompt file stem, order = 1-based position within that file.)
Run: python king.py

3. `ch.py` — generate challenger answers

Reads each answer/*.json, loads the matching commit/<instance_id>.json (hidden PR info).
Prompts the model with the PR info + chat history to produce a plausible, honest-looking next move (THOUGHT + bash) — constrained by the blind-agent rule so it doesn't leak knowledge it couldn't have observed.
Writes: fills the challenger field in each answer/*.json in place.
Run: python ch.py

4. `validate.py` — judge the duel

For each answer/*.json, runs a pairwise comparison (king vs challenger) across a panel of judge models, scoring five dimensions: correctness, grounding, progress, protocol, efficiency.
Aggregates per-judge / per-metric scores and runs an acceptance gate (win margin, bootstrap LCB > 0, minimum fraction of turns parsed).
DUEL_MODE=1 compares the challenger against the reference assistant instead of the king (DUEL_MODE=2, default).
Writes: score/<answer_file>.json per turn and score/_duel.json (overall verdict).
Run: python validate.py

5. `report.py` — report + export

Reads every per-turn score in score/ and pairs it with the matching answer file, looked up by stem across the answer dirs (answer_0/, answer_2/ — see _find_answer).
Writes report/report.md: header stats + a per-turn score table (score, file, action style, the five metric columns).
Buckets each parsed turn by its duel score and copies the answer (enriched with score, per-metric metrics, and judge reason) into one of three result dirs:
- export_final() — score ≥ 80 → final/ (the accepted dataset)
- export_refined() — 66 ≤ score < 80 → refined/ (borderline, worth refining)
- export_defeat() — score < 66 → defeat/ (the challenger lost)
- turns that failed to parse are counted as parse-fail and not exported. Thresholds are _FINAL_MIN = 80 / _DEFEAT_MIN = 66. Exports copy (do not remove); re-running overwrites.
Run: python report.py

Directories

Dir	Produced by	Contents
`prompt/`	prompt.py	sampled prompt turns
`commit/`	prompt.py	source PR records, keyed by `instance_id`
`answer_0/`, `answer_1/`, `answer_2/`	king.py / ch.py	per-turn king + challenger replies (pending), sharded across answer dirs
`score/`	validate.py	per-turn judge scores + `_duel.json`
`report/`	report.py	`report.md` (and `hugePR.md` exclusion notes)
`final/`	report.py	accepted answers, score ≥ 80 (from `score/` + `answer_*/`)
`refined/`	report.py	borderline answers, 66 ≤ score < 80 (from `score/` + `answer_*/`)
`defeat/`	report.py	losing answers, score < 66 (from `score/` + `answer_*/`)

Configuration

.env (loaded by each script): CHUTES_API_KEY (required for the LLM calls), optional CHUTES_BASE_URL, and HF_TOKEN / HUGGINGFACE_API_KEY for prompt.py.
chain.toml (optional, read by validate.py): [judge] models / token limits and [duel] gate thresholds. Falls back to built-in defaults if absent.
Models: generation uses moonshotai/Kimi-K2.6-TEE; judging uses a multi-model panel (DeepSeek, Qwen3-Thinking, Kimi) defined in validate.py / chain.toml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training-Data Pipeline

Pipeline

1. `prompt.py` — build the prompt set

2. `king.py` — generate king (blind) answers

3. `ch.py` — generate challenger answers

4. `validate.py` — judge the duel

5. `report.py` — report + export

Directories

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
commit		commit
defeat		defeat
final		final
model		model
prompt		prompt
report		report
scripts		scripts
upload		upload
vllm		vllm
.env		.env
README.md		README.md
chain.toml		chain.toml
idx.py		idx.py
jsonl.py		jsonl.py
report.py		report.py
validate.py		validate.py

Folders and files

Latest commit

History

Repository files navigation

Training-Data Pipeline

Pipeline

1. prompt.py — build the prompt set

2. king.py — generate king (blind) answers

3. ch.py — generate challenger answers

4. validate.py — judge the duel

5. report.py — report + export

Directories

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `prompt.py` — build the prompt set

2. `king.py` — generate king (blind) answers

3. `ch.py` — generate challenger answers

4. `validate.py` — judge the duel

5. `report.py` — report + export

Packages