Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/eval/general/data_eng_prompt.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Treat the run as a chain of falsifiable experiments — each iteration must buil
- \`python dataset_audit.py --data-path <jsonl>\` — **mandatory** hard gate. Fails on test-set contamination or low diversity.
- \`python train_sft.py --data-path <jsonl> --output-dir experiments/exp_<N>/final_model\` — locked recipe. Whitelisted args only: \`--data-path\`, \`--output-dir\`, \`--max-steps\`, \`--seed\`. (\`$MODEL_TO_TRAIN\` is fixed; no \`--base-model\` flag.)
- \`python publish_experiment.py --exp-dir experiments/exp_<N>/ [--data-sources "..."] [--promoted | --audit-failed]\` — appends one row to \`$SHARED_LOG_CSV\` and to \`experiments/index.csv\`.
- \`python evaluate.py --model-path <path> --limit 5 --max-tokens 512\` — small qualitative probe. **Do not use this to optimize against the test set.** Delete the resulting metrics file before publishing so it never enters the archive.
- \`bash eval.sh --model-path <path> --limit 5 --max-tokens 512\` — small qualitative probe. **Always invoke the scorer via \`bash eval.sh\` (a thin wrapper that forwards all args to \`evaluate.py\`); calling \`python evaluate.py\` directly loses \`/opt/env/local/bin\` from PATH under codex's \`bash -lc\` and you'll hit \`vllm: command not found\`.** **Do not use this to optimize against the test set.** Delete the resulting metrics file before publishing so it never enters the archive.

## Teacher vLLM endpoint (for synthetic data)
An OpenAI-compatible endpoint is available:
Expand Down Expand Up @@ -81,8 +81,8 @@ For each experiment exp_<N>:
5. **TRAIN.** \`python train_sft.py --data-path experiments/exp_<N>/data.jsonl --output-dir experiments/exp_<N>/final_model\`.

6. **PROBE & WRITE CONCLUSION.** Two-tier probe:
- **Format probe (n=5, ~30s):** \`python evaluate.py --model-path experiments/exp_<N>/final_model --limit 5 --max-tokens 512 --json-output-file /tmp/format_probe.json\`. Read the inspect_ai log to inspect generation length and whether it stops cleanly after \`ANSWER: X\`.
- **Quality probe (n=30, ~3 min):** ONLY if format probe shows clean stopping. \`python evaluate.py --model-path experiments/exp_<N>/final_model --limit 30 --max-tokens 1500 --json-output-file /tmp/quality_probe.json\`. Read the inspect_ai per-sample completions to see if the model is reasoning or guessing. n=30 is the smallest sample that lets you observe behavioral differences reliably; n=5 is essentially noise.
- **Format probe (n=5, ~30s):** \`bash eval.sh --model-path experiments/exp_<N>/final_model --limit 5 --max-tokens 512 --json-output-file /tmp/format_probe.json\`. Read the inspect_ai log to inspect generation length and whether it stops cleanly after \`ANSWER: X\`.
- **Quality probe (n=30, ~3 min):** ONLY if format probe shows clean stopping. \`bash eval.sh --model-path experiments/exp_<N>/final_model --limit 30 --max-tokens 1500 --json-output-file /tmp/quality_probe.json\`. Read the inspect_ai per-sample completions to see if the model is reasoning or guessing. n=30 is the smallest sample that lets you observe behavioral differences reliably; n=5 is essentially noise.
Update \`## Conclusion\` in notes.md based on what you observed. **Do not write numeric scores** — describe behavior. Delete both probe files (\`rm -f /tmp/format_probe.json /tmp/quality_probe.json\`) and any inspect_ai logs they produced so they don't leak into the archive.

7. **PUBLISH.** \`python publish_experiment.py --exp-dir experiments/exp_<N>/ [--data-sources "<hf_id1>,<hf_id2>,synthetic"] [--promoted | --audit-failed]\`. Promoted means you've copied this exp's final_model as the run's final_model: \`rm -rf final_model && cp -r experiments/exp_<N>/final_model final_model\`. (Do NOT use \`ln -sfn\` — the wrapper expects \`final_model/\` to be a real directory.)
Expand Down
17 changes: 17 additions & 0 deletions src/eval/general/eval.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash
# Wrapper around evaluate.py for data-engineering agent runs.
#
# Why this exists: the bind-mounted Python env at /opt/env contains the
# `vllm` CLI binary at /opt/env/local/bin/vllm, and run_task.sh injects
# that directory into PATH via `apptainer exec --env PATH=...`. However,
# the codex CLI runs every shell command through `bash -lc "..."` (login
# shell), which sources /etc/profile + ~/.bashrc and *overwrites* PATH
# with the container's defaults — stripping out /opt/env/local/bin. As a
# result the agent sees `vllm: command not found` and inspect_ai cannot
# spawn its local vLLM server.
#
# This wrapper re-asserts the bind-mounted env on PATH and forwards all
# arguments to evaluate.py. Agents should call `bash eval.sh ...` instead
# of `python3 evaluate.py ...` for self-evals.
export PATH="/opt/env/local/bin:/opt/env/bin:${PATH}"
exec python3 /home/ben/task/evaluate.py "$@"
7 changes: 7 additions & 0 deletions src/run_task.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,13 @@ if [ "$POST_TRAIN_BENCH_PROMPT" = "data_eng_prompt" ]; then
cp src/eval/general/train_sft.py "${JOB_DIR}/task/"
cp src/eval/general/dataset_audit.py "${JOB_DIR}/task/"
cp src/eval/general/publish_experiment.py "${JOB_DIR}/task/"
# eval.sh wrapper: codex's `bash -lc` overwrites PATH and strips
# /opt/env/local/bin, so calling `python3 evaluate.py` directly fails
# to find the bind-mounted `vllm` CLI. This wrapper re-asserts PATH
# before exec'ing evaluate.py. Agents should `bash eval.sh ...` for
# self-evals.
cp src/eval/general/eval.sh "${JOB_DIR}/task/"
chmod +x "${JOB_DIR}/task/eval.sh"
mkdir -p "${JOB_DIR}/task/experiments"
fi

Expand Down