LLM360 · shauryr · Jun 1, 2026 · May 29, 2026 · Jun 1, 2026
diff --git a/src/eval/general/data_eng_prompt.txt b/src/eval/general/data_eng_prompt.txt
@@ -24,7 +24,7 @@ Treat the run as a chain of falsifiable experiments — each iteration must buil
 - \`python dataset_audit.py --data-path <jsonl>\` — **mandatory** hard gate. Fails on test-set contamination or low diversity.
 - \`python train_sft.py --data-path <jsonl> --output-dir experiments/exp_<N>/final_model\` — locked recipe. Whitelisted args only: \`--data-path\`, \`--output-dir\`, \`--max-steps\`, \`--seed\`. (\`$MODEL_TO_TRAIN\` is fixed; no \`--base-model\` flag.)
 - \`python publish_experiment.py --exp-dir experiments/exp_<N>/ [--data-sources "..."] [--promoted | --audit-failed]\` — appends one row to \`$SHARED_LOG_CSV\` and to \`experiments/index.csv\`.
-- \`python evaluate.py --model-path <path> --limit 5 --max-tokens 512\` — small qualitative probe. **Do not use this to optimize against the test set.** Delete the resulting metrics file before publishing so it never enters the archive.
+- \`bash eval.sh --model-path <path> --limit 5 --max-tokens 512\` — small qualitative probe. **Always invoke the scorer via \`bash eval.sh\` (a thin wrapper that forwards all args to \`evaluate.py\`); calling \`python evaluate.py\` directly loses \`/opt/env/local/bin\` from PATH under codex's \`bash -lc\` and you'll hit \`vllm: command not found\`.** **Do not use this to optimize against the test set.** Delete the resulting metrics file before publishing so it never enters the archive.
 
 ## Teacher vLLM endpoint (for synthetic data)
 An OpenAI-compatible endpoint is available:
@@ -81,8 +81,8 @@ For each experiment exp_<N>:
 5. **TRAIN.** \`python train_sft.py --data-path experiments/exp_<N>/data.jsonl --output-dir experiments/exp_<N>/final_model\`.
 
 6. **PROBE & WRITE CONCLUSION.** Two-tier probe:
-   - **Format probe (n=5, ~30s):** \`python evaluate.py --model-path experiments/exp_<N>/final_model --limit 5 --max-tokens 512 --json-output-file /tmp/format_probe.json\`. Read the inspect_ai log to inspect generation length and whether it stops cleanly after \`ANSWER: X\`.
-   - **Quality probe (n=30, ~3 min):** ONLY if format probe shows clean stopping. \`python evaluate.py --model-path experiments/exp_<N>/final_model --limit 30 --max-tokens 1500 --json-output-file /tmp/quality_probe.json\`. Read the inspect_ai per-sample completions to see if the model is reasoning or guessing. n=30 is the smallest sample that lets you observe behavioral differences reliably; n=5 is essentially noise.
+   - **Format probe (n=5, ~30s):** \`bash eval.sh --model-path experiments/exp_<N>/final_model --limit 5 --max-tokens 512 --json-output-file /tmp/format_probe.json\`. Read the inspect_ai log to inspect generation length and whether it stops cleanly after \`ANSWER: X\`.
+   - **Quality probe (n=30, ~3 min):** ONLY if format probe shows clean stopping. \`bash eval.sh --model-path experiments/exp_<N>/final_model --limit 30 --max-tokens 1500 --json-output-file /tmp/quality_probe.json\`. Read the inspect_ai per-sample completions to see if the model is reasoning or guessing. n=30 is the smallest sample that lets you observe behavioral differences reliably; n=5 is essentially noise.
    Update \`## Conclusion\` in notes.md based on what you observed. **Do not write numeric scores** — describe behavior. Delete both probe files (\`rm -f /tmp/format_probe.json /tmp/quality_probe.json\`) and any inspect_ai logs they produced so they don't leak into the archive.
 
 7. **PUBLISH.** \`python publish_experiment.py --exp-dir experiments/exp_<N>/ [--data-sources "<hf_id1>,<hf_id2>,synthetic"] [--promoted | --audit-failed]\`. Promoted means you've copied this exp's final_model as the run's final_model: \`rm -rf final_model && cp -r experiments/exp_<N>/final_model final_model\`. (Do NOT use \`ln -sfn\` — the wrapper expects \`final_model/\` to be a real directory.)

diff --git a/src/eval/general/eval.sh b/src/eval/general/eval.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+# Wrapper around evaluate.py for data-engineering agent runs.
+#
+# Why this exists: the bind-mounted Python env at /opt/env contains the
+# `vllm` CLI binary at /opt/env/local/bin/vllm, and run_task.sh injects
+# that directory into PATH via `apptainer exec --env PATH=...`. However,
+# the codex CLI runs every shell command through `bash -lc "..."` (login
+# shell), which sources /etc/profile + ~/.bashrc and *overwrites* PATH
+# with the container's defaults — stripping out /opt/env/local/bin. As a
+# result the agent sees `vllm: command not found` and inspect_ai cannot
+# spawn its local vLLM server.
+#
+# This wrapper re-asserts the bind-mounted env on PATH and forwards all
+# arguments to evaluate.py. Agents should call `bash eval.sh ...` instead
+# of `python3 evaluate.py ...` for self-evals.
+export PATH="/opt/env/local/bin:/opt/env/bin:${PATH}"
+exec python3 /home/ben/task/evaluate.py "$@"
diff --git a/src/run_task.sh b/src/run_task.sh
@@ -73,6 +73,13 @@ if [ "$POST_TRAIN_BENCH_PROMPT" = "data_eng_prompt" ]; then
     cp src/eval/general/train_sft.py "${JOB_DIR}/task/"
     cp src/eval/general/dataset_audit.py "${JOB_DIR}/task/"
     cp src/eval/general/publish_experiment.py "${JOB_DIR}/task/"
+    # eval.sh wrapper: codex's `bash -lc` overwrites PATH and strips
+    # /opt/env/local/bin, so calling `python3 evaluate.py` directly fails
+    # to find the bind-mounted `vllm` CLI. This wrapper re-asserts PATH
+    # before exec'ing evaluate.py. Agents should `bash eval.sh ...` for
+    # self-evals.
+    cp src/eval/general/eval.sh "${JOB_DIR}/task/"
+    chmod +x "${JOB_DIR}/task/eval.sh"
     mkdir -p "${JOB_DIR}/task/experiments"
 fi