Skip to content

nightfairy5831/train_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

244 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training-Data Pipeline

Prepares and validates training samples for a coding-agent LLM. Each sample is a single agent turn: given the chat history of a coding task, produce a good THOUGHT + one bash command.

Two responses are generated per turn and pitted against each other by a panel of LLM judges (a "duel"):

  • king — a baseline answer produced blind (no knowledge of how the PR was resolved).
  • challenger — an answer produced with the hidden PR information, but instructed to behave as if it were solving the task honestly (see the blind-agent rule in ch.py).

The challenger is the candidate we want to keep when it reliably beats the king.

Pipeline

Run the scripts in order; each consumes the previous stage's output directory.

prompt.py  →  king.py  →  ch.py  →  validate.py  →  report.py
   │            │           │           │              │
 prompt/      answer_*/   answer_*/    score/        report/report.md
 commit/    (king field) (challenger)  _duel.json    final/ refined/ defeat/

1. prompt.py — build the prompt set

  • Samples turns from the HF dataset AlienKevin/SWE-ZERO-12M-trajectories (random offset windows over the full split).
  • For each sampled turn, keeps the chat history up to that point plus the reference answer (the dataset's own next turn).
  • Mirrors the source PR record (issue, files, commit, description) for each instance_id from nebius/SWE-rebench-V2-PRs via duckdb.
  • Writes: prompt/prompts_2000_*.json (≤200 prompts/file) and commit/<instance_id>.json.
  • Run: python prompt.py (requests 2000 prompts)

2. king.py — generate king (blind) answers

  • Reads every prompt across all prompt/*.json files (no per-instance cap).
  • Skips instance ids in _EXCLUDE_IDS (e.g. oversized PRs — see report/hugePR.md).
  • Calls the model with no PR information to get the king reply.
  • format_ok() rejects malformed outputs (those starting with <|tool_calls_section_begin|>); those are not saved.
  • Resumable: skips a prompt whose answer file already exists.
  • Writes: answer/<source>_<order>.json with fields {instance_id, king, challenger:"", assistant, messages}. (source = prompt file stem, order = 1-based position within that file.)
  • Run: python king.py

3. ch.py — generate challenger answers

  • Reads each answer/*.json, loads the matching commit/<instance_id>.json (hidden PR info).
  • Prompts the model with the PR info + chat history to produce a plausible, honest-looking next move (THOUGHT + bash) — constrained by the blind-agent rule so it doesn't leak knowledge it couldn't have observed.
  • Writes: fills the challenger field in each answer/*.json in place.
  • Run: python ch.py

4. validate.py — judge the duel

  • For each answer/*.json, runs a pairwise comparison (king vs challenger) across a panel of judge models, scoring five dimensions: correctness, grounding, progress, protocol, efficiency.
  • Aggregates per-judge / per-metric scores and runs an acceptance gate (win margin, bootstrap LCB > 0, minimum fraction of turns parsed).
  • DUEL_MODE=1 compares the challenger against the reference assistant instead of the king (DUEL_MODE=2, default).
  • Writes: score/<answer_file>.json per turn and score/_duel.json (overall verdict).
  • Run: python validate.py

5. report.py — report + export

  • Reads every per-turn score in score/ and pairs it with the matching answer file, looked up by stem across the answer dirs (answer_0/, answer_2/ — see _find_answer).
  • Writes report/report.md: header stats + a per-turn score table (score, file, action style, the five metric columns).
  • Buckets each parsed turn by its duel score and copies the answer (enriched with score, per-metric metrics, and judge reason) into one of three result dirs:
    • export_final() — score ≥ 80final/ (the accepted dataset)
    • export_refined()66 ≤ score < 80refined/ (borderline, worth refining)
    • export_defeat() — score < 66defeat/ (the challenger lost)
    • turns that failed to parse are counted as parse-fail and not exported. Thresholds are _FINAL_MIN = 80 / _DEFEAT_MIN = 66. Exports copy (do not remove); re-running overwrites.
  • Run: python report.py

Directories

Dir Produced by Contents
prompt/ prompt.py sampled prompt turns
commit/ prompt.py source PR records, keyed by instance_id
answer_0/, answer_1/, answer_2/ king.py / ch.py per-turn king + challenger replies (pending), sharded across answer dirs
score/ validate.py per-turn judge scores + _duel.json
report/ report.py report.md (and hugePR.md exclusion notes)
final/ report.py accepted answers, score ≥ 80 (from score/ + answer_*/)
refined/ report.py borderline answers, 66 ≤ score < 80 (from score/ + answer_*/)
defeat/ report.py losing answers, score < 66 (from score/ + answer_*/)

Configuration

  • .env (loaded by each script): CHUTES_API_KEY (required for the LLM calls), optional CHUTES_BASE_URL, and HF_TOKEN / HUGGINGFACE_API_KEY for prompt.py.
  • chain.toml (optional, read by validate.py): [judge] models / token limits and [duel] gate thresholds. Falls back to built-in defaults if absent.
  • Models: generation uses moonshotai/Kimi-K2.6-TEE; judging uses a multi-model panel (DeepSeek, Qwen3-Thinking, Kimi) defined in validate.py / chain.toml.

About

prepare model train dataset for various training method like as SFT, Distillation

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages