LoopX treats benchmark execution as a developer workflow, not only as a
research activity. A benchmark runner should be something a contributor can
inspect, dry-run, diagnose, and improve without reading maintainer .local
state or raw benchmark trajectories.
This document is the stable product entry point for benchmark work. Research
packets and dated route notes still live under
docs/research/long-horizon-agent-benchmarks/, but reusable runner behavior
belongs in loopx/, examples/, and this guide.
The benchmark workflow has four layers:
- Select a benchmark family, task, and arm without exposing private task text or reward leakage.
- Launch through an explicit route contract. The default route is now an exclusive cloud benchmark host where Codex CLI, the benchmark runner, Docker or compatible container runtime, task data, and compact reduction all run in one isolated environment. Split-control routes remain useful for constrained hosts or route research, but they are not the first choice when a dedicated cloud host is available.
- Observe the run through compact handles: pid or job state, readiness re-check, materialization, result or blocker, and cleanup state.
- Ingest only public-safe evidence into LoopX history, ledger, and case analysis.
The user-facing product promise is simple: a developer should be able to tell what ran, why it was allowed, what blocked it, and what can be tried next, without seeing credentials, raw logs, raw trajectories, or local machine paths.
From a fresh checkout:
python3 -m py_compile loopx/*.py loopx/benchmark_core/*.py
python3 examples/benchmark-split-control-remote-executor-smoke.py
loopx benchmark --helpFor a real benchmark slice, use this sequence:
- Run a source and boundary preflight for the target benchmark.
- Prepare or select the benchmark host route:
- prefer the default cloud Codex route when the host is dedicated, has enough CPU/memory/disk, and can run Codex CLI plus containers directly;
- use a split-control route only when host credentials, local policy, or runner constraints make a single cloud host unsuitable.
- Prove the comparison baseline before any treatment claim. The preferred
baseline is real Codex Goal mode, not a hand-rolled polling loop and not a
prompt that merely starts with
/goal. If the runner cannot prove that Codex entered a persistent goal state through a supported Codex surface, park the A/B baseline and continue only readiness, runner, or blocker work. - Produce a launch plan or runner batch only after a fresh readiness re-check.
- Build benchmark-specific command-adapter facts when the route still needs a
LoopX adapter, such as
loopx benchmark terminal-bench-command-adapter terminal-bench. When Terminal-Bench uses a remote executor, first reduce the local-driver request plus private remote launch result through the launch adapter:loopx benchmark terminal-bench-remote-launch-adapter terminal-bench --request-json <private-json> --launch-result-json <private-json>. The launch adapter emits only field presence and compact blocker state; it never executes SSH, Docker, Codex, model calls, uploads, or submits. If a lower-level private runner already produced remote-executor handles, reduce them through a materializer such asloopx benchmark terminal-bench-remote-materializer terminal-bench --handle-manifest-json <private-json>. The materializer emits only handle field presence, never handle values. For Terminal-Bench, handle presence is still not enough: the payload must prove that a local Codex driver owns agent/model/auth and that the remote executor does not require agent or Codex runtime. Then build the execution seam from those facts. The seam should expose both alocal_driver_contractand aremote_sandbox_contract; treat missing command adapters, missing launch-adapter results, missing local-driver materializers, missing sandbox contracts, remote-agent-runtime requirements, or compact reducers as blockers instead of launching a private script. - Run the smallest no-upload dry-run or mini-pair that can answer the current product question.
- Ingest a compact result or precise blocker.
- Update LoopX todo/state so the next developer sees the current route.
Do not start from a raw shell command hidden in a local note. If a benchmark cannot be launched through a documented route, the next product task is to build that route, not to keep a one-off script alive.
Do not wait for a benchmark family to be fully solved before documenting how it runs. Each real run should improve the developer workflow in the same batch as the result or blocker:
- Before launch, write down the intended route, boundary, command shape, expected compact artifacts, and stop conditions.
- During launch, preserve only observable handles that another developer can use: pid or job basename, readiness state, poll command, cleanup state, and compact artifact refs.
- After launch, update the workflow or adapter notes with what changed: product-path pass, precise blocker, cleanup rule, or stale assumption.
- If the run required a private local script, turn the reusable part into a public command, fixture, or adapter contract before relying on it again.
The goal is a living runner guide. Repeated benchmark attempts should make the next attempt easier to launch and debug, not only add more private evidence.
Use the shared snapshot helper for routine polling instead of ad hoc SSH commands:
python3 scripts/benchmark_run_status_snapshot.py \
--run-root <cloud-run-root> \
--label <terminal-bench-run-label> \
--label <skillsbench-run-label> \
--label <swe-marathon-run-label> \
--record-rollout-event \
--goal-id loopx-meta \
--agent-id codex-main-control \
--pattern Working \
--pattern timed\ out \
--prettyThe snapshot reports status.env, pid-file liveness, compact result summaries,
standard artifact presence, and optional keyword booleans for tmux captures. It
does not run ps, read process cmdline/argv, or emit task text, trajectories,
raw logs, or capture content. With
--record-rollout-event, it also appends one aggregate benchmark_status
event to the rollout log so the control plane can see that a poll happened
without seeing host paths or capture text.
For local launchd or similar scheduler jobs, treat benchmark case launches as
one-shot observable runs. Do not use KeepAlive to rerun a case label
automatically: a successful compact closeout or a terminal compact failure must
unload/disable the scheduler label before any rerun is considered. The snapshot
adds an observable_handle_policy per run with these public-safe decisions:
monitor_poll_allowed=true: keep observing the pid/job handle.cleanup_required=true: unload/disable the one-shot label, then ingest the compact closeout or write the precise blocker named bynext_action.blocker_required_before_rerun=true: the observable handle ended or vanished without a compact closeout; write that blocker before any rerun.
This keeps reruns explicit and countable while preserving enough handle state for another developer or heartbeat to resume polling without chat memory.
Each benchmark case should leave a compact LoopX rollout trail, separate from raw Codex sessions or benchmark runner logs. Use the rollout event log when you need to explain the lifecycle of a case or an agent workflow:
quota_should_run: the controller allowed a bounded slice.todo_claim: an agent claimed the work item.benchmark_launch/benchmark_status: a case was launched or polled.validation: a smoke, reducer, or official verifier finished.compact_case_result/compact_blocker: public-safe case outcome.refresh_state/quota_spend: LoopX writeback and spend.codex_session_observed: a private Codex session source exists, but raw session contents and file paths are not recorded.
Core LoopX CLI lifecycle commands append compact events automatically:
todo transitions, refresh-state, and quota should-run /
quota monitor-poll / quota spend-slot / quota void-slot all write to the
rollout event log when they run through the CLI. This makes the log closer to a
Codex session ledger for LoopX itself: agents do not need to remember to
record routine GH control-plane transitions by hand.
Compact benchmark history writes also append automatically when they are
executed through the CLI. loopx history append-benchmark-run --execute, loopx history append-benchmark-result --execute, and the
loopx benchmark ... --execute fixture/ingest path write
compact_case_result or compact_blocker events derived only from the compact
benchmark_run_v0 / benchmark_result_v0 payload. This is the default path for
case-level observation: launch/poll scripts may keep their own private raw
artifacts, but LoopX should observe the public-safe compact writeback.
Use the script for external benchmark transitions, backfills, or operator-side facts that happen outside those core CLI paths:
python3 scripts/goal_rollout_event_log.py append \
--goal-id loopx-meta \
--event-kind compact_case_result \
--agent-id codex-main-control \
--todo-id todo_406bb256efd8 \
--benchmark-id terminal-bench@2.0 \
--case-id build-cython-ext \
--status precise_blocker \
--summary "Official verifier failed before a countable agent result." \
--artifact-ref docs/research/long-horizon-agent-benchmarks/benchmark-case-analysis.jsonSummarize the current trail without reading private sources:
python3 scripts/goal_rollout_event_log.py summarize \
--goal-id loopx-meta \
--limit 12 \
--prettyCodex session JSONL files can help local debugging, but treat them as private source material. Record only their existence/count, never transcript bodies, paths, prompts, tool output, or token-bearing content:
python3 scripts/goal_rollout_event_log.py observe-codex-sessions \
--goal-id loopx-meta \
--agent-id codex-main-controlThe canonical log lives under the LoopX runtime root for the goal, for
example goals/<goal-id>/rollout-event-log.jsonl. It is a local control-plane
artifact, not a raw evidence file to commit. Public docs and ledgers may cite
compact event ids, case ids, run ids, and artifact refs, but must keep raw task
text, logs, trajectories, Codex session transcripts, credentials, and absolute
host paths out.
Use the cloud ECS benchmark host route as the default for Terminal-Bench, SkillsBench, ALE, and other Docker-heavy benchmark families when a dedicated ECS-style cloud VM is available. This is a developer operations pattern: put Codex CLI, benchmark source, container runtime, task data, raw artifacts, and compact reducers on one isolated cloud host, then publish only public-safe control-plane evidence back to LoopX.
| Owner | Responsibility |
|---|---|
| Cloud ECS benchmark host | Codex CLI, benchmark source checkout, runner dependencies, container runtime, task-data staging, no-upload run execution, compact result reduction, and private raw artifacts. |
| LoopX repo | Public-safe route contracts, reducer schemas, benchmark ledger ingestion, todo/state writeback, public docs, and focused smokes. |
| Operator | Codex login on the cloud host, benchmark data gates, upload/leaderboard decisions, and any private-material or credential approval. |
The route is intentionally simpler than split-control: SSH reaches the ECS host, then Codex CLI runs there like a normal developer would. LoopX should not need to understand SSH internals, jump hosts, or remote file bridges in the hot path. It should only record compact route readiness, result handles, blockers, and no-upload boundaries.
Treat remote benchmark host fixes as product assets only after they become one of three reusable surfaces:
- a documented SOP step that another developer can repeat;
- a script or CLI entrypoint that emits compact JSON;
- a reducer that turns private runner state into a public-safe blocker or ingest action.
Runtime-only tweaks such as Docker registry mirrors, loopback proxy sessions, cached base images, source tarballs, dependency prewarm, and run directories are useful operator substrate. Do not let them become hidden LoopX truth. Record only the compact fact: ready, blocked, or needs operator setup. The concrete mirror URL, proxy port, shell history, raw logs, and local host paths stay outside public evidence.
Temporary patches to an upstream benchmark checkout are allowed during route bring-up only when they are explicit, reversible, and recorded as route substrate. Keep the upstream checkout clean enough to rebase: prefer a small patch file, wrapper script, or sidecar module over editing scorer, task truth, or prompts in place. After patching a remote checkout, record the compact metadata another developer needs: upstream repo/ref, patch purpose, files touched by category, whether scoring/task truth changed, validation command, and rollback command. Do not publish the raw patched checkout, task text, raw logs, private paths, or internal hostnames. Once a patch repeats, promote it to one of three durable surfaces: an upstreamable PR, a LoopX wrapper, or a documented benchmark-host SOP.
When the host has both a system disk and a data disk, move every large runner
cache onto the data disk before calling the host ready. Docker data-root
alone is not enough: containerd snapshot state can still fill the system disk
while Docker reports the expected data-root. Verify both paths after setup:
docker info --format '{{.DockerRootDir}}'
df -h / /data /var/lib/docker /var/lib/containerdIf /var/lib/containerd is already large on the system disk, stop Docker and
containerd, copy it to the data disk, replace the original directory with a
symlink or bind mount, then restart both services. Keep the pre-migration copy
only until docker images, docker ps, and a tiny runner smoke prove the
runtime still sees existing images.
Recommended cloud host layout:
loopx-bench/
sources/
runs/
cache/
artifacts/public/
artifacts/private/
Run the bootstrap probe on the cloud host before a benchmark slice:
python3 scripts/benchmark_ecs_bootstrap.py \
--workspace ~/loopx-bench \
--min-free-gib 80 \
--create-dirs \
--prettyAfter the host is ready, decide the benchmark-specific agent runtime layer before launching an official case. The common rule is: agent runtime is preinstalled as a stable layer; the case container only runs the task and the benchmark scorer. Generate the public-safe profile plan:
python3 scripts/benchmark_agent_runtime_layer.py \
--benchmark all \
--workspace ~/loopx-bench \
--prettyTerminal-Bench and SWE-Marathon are Harbor-family profiles. They share the
harbor_codex_cli_tools layer mounted at /opt/harbor-agent-tools.
SkillsBench is a BenchFlow-family profile. It needs a separate
benchflow_js_agent_runtime layer for Node.js and codex-acp, mounted at
/opt/benchflow. Treat verifier dependency prewarm as a separate oracle
concern; the runtime layer is only about making the agent process start without
per-case downloads.
For Harbor-based SWE or Terminal-style runners, avoid downloading nvm, npm packages, or Codex inside every task container. Materialize a host-side preinstalled tools bundle and mount it read-only at Harbor's conventional agent-tools path:
python3 scripts/harbor_agent_tools_bundle.py \
--output ~/loopx-bench/harbor-agent-tools \
--prettyThen add the mount to the Harbor job config or CLI launch:
environment:
type: docker
mounts:
- type: bind
source: ~/loopx-bench/harbor-agent-tools
target: /opt/harbor-agent-tools
read_only: trueFor a Harbor CLI launch, pass the same mount explicitly:
MOUNTS='[{"type":"bind","source":"<workspace>/harbor-agent-tools","target":"/opt/harbor-agent-tools","read_only":true}]'
UV_LINK_MODE=copy uv run --no-default-groups harbor run \
--env docker \
--agent codex-api-key-no-search \
--mounts "$MOUNTS" \
--jobs-dir <run-dir>/jobs \
-p <task-dir>When running SWE-Marathon on a constrained cloud host, prefer invoking Harbor
from an already materialized local Harbor checkout and pointing -p at the
SWE-Marathon task directory. Running uv run harbor from the SWE-Marathon
checkout can fetch Harbor from GitHub and may install default cloud/GPU extras
before the benchmark case starts. That is dependency materialization noise, not
case progress. Use uv run --no-default-groups harbor run for the runner layer
unless the selected environment backend explicitly needs extra Harbor groups.
After any Harbor-family job finishes, reduce the job directory with the generic Harbor reducer and pass the benchmark id explicitly:
python3 scripts/harbor_job_result_reducer.py \
--job-dir <jobs-dir>/<job-name> \
--benchmark-id swe-marathon \
--output-json <jobs-dir>/<job-name>/loopx_harbor_result.compact.json \
--prettyDo not rely on the older Terminal-Bench-named Harbor ingest path for
SWE-Marathon or other path-based Harbor tasks; when Harbor is launched with
-p <task-dir> rather than --dataset, the job lock may not carry enough
benchmark identity to infer the right benchmark_id.
When Harbor's preinstalled in-container Codex surface fails before a real solution attempt, switch to the host Codex Goal custom agent instead of continuing to rebuild or reinstall Codex in every case container:
export PYTHONPATH=<loopx-checkout>:<loopx-checkout>/scripts:${PYTHONPATH:-}
UV_LINK_MODE=copy uv run --no-default-groups harbor run \
--env docker \
--agent-import-path harbor_host_codex_goal_agent:HarborHostCodexGoalAgent \
--agent-kwarg goal_timeout_sec=21600 \
--agent-kwarg task_workdir=/app \
--jobs-dir <run-dir>/jobs \
-p <task-dir>Use the same long timeout envelope for base and treatment arms while measuring
capability ceilings. The host Goal agents default to 21600 seconds; pass an
explicit shorter value only for a timeout-cost experiment and record that tier
in the compact result. LoopX prompt-polling treatments use the same
envelope for each observed round by default, so the controller does not cut off
a still-running Codex Goal turn at the older 900s official timeout before
continuation evidence exists.
For the SkillsBench main-table product-mode comparison, treat the pair contract
as executable input, not prose memory: baseline is raw-codex-autonomous-max5,
treatment is loopx-product-mode, both use the same case and max-5/no-feedback
budget, and the treatment must show LoopX state/todo/replan/CLI lifecycle in
compact counters. Use
loopx.benchmark_core.classify_product_mode_main_table_pair before promoting a
base/test pair into the public comparison table; shallow packet-only or
which-goal-only rows stay analysis assets, not main-table evidence.
When wrapping that launch in tmux, launchd, or a generated run.sh, put the
same PYTHONPATH export inside the wrapper script, not only in the interactive
shell that creates it. Otherwise Harbor can create a job shell but fail before
trial execution with No module named 'harbor_host_codex_goal_agent'. A reusable
wrapper should look like:
#!/usr/bin/env bash
set -euo pipefail
export PYTHONPATH=<loopx-checkout>:<loopx-checkout>/scripts:${PYTHONPATH:-}
python3 - <<'PY'
import importlib
importlib.import_module("harbor_host_codex_goal_agent")
importlib.import_module("loopx.benchmark_core.loop_protocol")
PY
cd <harbor-checkout>
set +e
UV_LINK_MODE=copy uv run --no-default-groups harbor run \
--config <run-dir>/config.json \
> <run-dir>/harbor.run.log 2>&1
status=$?
set -e
printf "exit_status=%s\n" "$status" > <run-dir>/status.env
if [ -d <run-dir>/jobs/<job-name> ]; then
python3 <loopx-checkout>/scripts/harbor_job_result_reducer.py \
--job-dir <run-dir>/jobs/<job-name> \
--benchmark-id swe-marathon \
--output-json <run-dir>/jobs/<job-name>/harbor_job_result.compact.json \
--pretty
fi
exit "$status"Treat a wrapper-level import failure as a launcher blocker, not a benchmark case failure: no trial ran, no verifier reward exists, and the next step is to fix the host-agent import surface before spending another case attempt.
The agent starts Codex native Goal mode on the benchmark host and exposes a
host command named harbor-env-exec. Codex is instructed to call
harbor-env-exec --cwd <task_workdir> -- <command>; set task_workdir per
benchmark family instead of hardcoding /app in runner patches. Commands issued
through that bridge are forwarded to Harbor's environment.exec(), so the
benchmark environment remains the task/scoring surface while Codex login, model
access, tmux, and runtime state stay on the stable host layer.
The Harbor app-server agent drains app-server turn events while the async
runner loop continues to serve harbor-env-exec bridge requests. Its compact
turn file records bridge request count, active-todo exit state,
turn/completed observation, and public-safe solution phase counters without
raw task text, raw logs, raw commands, raw diffs, or raw trajectories. Do not
make turn/completed the benchmark success gate for Harbor-family runs: the
official environment result is the scoring path, the case-local LoopX todo state
is the host exit/closeout source, and app-server completion events are
diagnostic only.
For the current SWE-Marathon experiment axis, keep the same task, model, timeout, environment, jobs directory shape, and no-upload boundary across arms:
- baseline: native Codex app-server Goal mode;
- test: LoopX prompt-driven polling with scheduled continuations.
A single LoopX access-packet run is not the test arm by itself. It adds planning/checkpoint context to the same host app-server Goal worker, but remains packet-only observation unless an outer polling controller records the scheduled rounds:
export PYTHONPATH=<loopx-checkout>:<loopx-checkout>/scripts:${PYTHONPATH:-}
UV_LINK_MODE=copy uv run --no-default-groups harbor run \
--env docker \
--agent-import-path harbor_host_codex_goal_agent:HarborHostCodexGoalAgent \
--agent-kwarg goal_surface=app_server \
--agent-kwarg reasoning_effort=high \
--agent-kwarg app_server_wait_for_completion=true \
--agent-kwarg app_server_response_timeout_sec=90 \
--agent-kwarg goal_timeout_sec=<seconds> \
--agent-kwarg task_workdir=/app \
--agent-kwarg loopx_mode=codex_loopx \
--agent-kwarg loopx_access_packet_mode=compact \
--agent-kwarg loopx_cli_bridge_enabled=true \
--agent-kwarg loopx_goal_id=<goal-id> \
--agent-kwarg loopx_registry_arg=<registry.global.json> \
--agent-kwarg loopx_runtime_root_arg=<runtime-root> \
--agent-kwarg loopx_scan_path=<public-scan-path> \
--agent-kwarg loopx_classification=<public-classification> \
--agent-kwarg loopx_experiment_protocol=packet_only_observation \
--agent-kwarg loopx_max_rounds=5 \
--jobs-dir <run-dir>/jobs \
--job-name <matched-treatment-job-name> \
-p <task-dir>The packet is intentionally lightweight: it gives the host Codex worker LoopX
planning/checkpoint commands and boundary reminders, while the task
solution still goes through harbor-env-exec and the official Harbor verifier
remains authoritative. It is useful as route-safety evidence, but on its own it
must be labeled packet_only_observation, not the prompt-driven test arm.
The prompt-driven test contract is shared in
loopx.benchmark_core.loop_protocol so SkillsBench historical rows,
SWE-Marathon tests, and future Terminal-Bench tests use one semantics instead of
parallel old/new definitions. The contract requires:
max_rounds_budget=5;official_feedback_forwarded=false;- scheduled continuation prompts must not reveal reward, pass/fail, verifier errors, or verifier output;
- the compact result must include public-safe controller/round evidence such as
round_rewards,first_success_round,official_feedback_blinded_count, and the loop contract; - if a Harbor/app-server route cannot provide that controller trace, classify it as packet-only route-safety evidence and do not compare it as test.
best_score is only an executable final-selection policy when the runner also
captures a compact per-round artifact snapshot. Use
loopx.benchmark_core.build_round_artifact_restore_plan to combine
round_rewards with public-safe snapshot handles. If the best scored round is
not the final round and has no restore-ready snapshot handle, keep the run as
offline analysis evidence and record missing_snapshot_for_best_round instead
of claiming the runner can submit or verify the best round.
When launching the actual test arm, the outer controller should set
loopx_experiment_protocol=max5_blind_loop_no_feedback, inject a fresh
LoopX packet before each scheduled prompt, keep official feedback
blinded, and record the public-safe controller trace. The route name for new
SWE-Marathon/Terminal-Bench work is loopx-prompt-polling-test; the old
SkillsBench route name loopx-blind-loop-treatment is a backward-
compatible alias for the same no-feedback polling semantics.
For Harbor-family runners, the host agent enables this controller when the experiment protocol is explicit:
--agent-kwarg loopx_mode=codex_loopx \
--agent-kwarg loopx_access_packet_mode=compact \
--agent-kwarg loopx_experiment_protocol=max5_blind_loop_no_feedback \
--agent-kwarg loopx_max_rounds=5 \
--agent-kwarg loopx_prompt_polling_rounds=5 \
--agent-kwarg loopx_prompt_polling_round_timeout_sec=21600That path starts native Codex app-server Goal once, then uses follow-up
turn/start calls in the same thread for scheduled continuation prompts. It
does not expose official reward, pass/fail status, verifier errors, or verifier
output to the worker. The per-round timeout is separate from the full job
timeout: if a single app-server turn does not hand control back, the controller
must close out with a compact
harbor_prompt_polling_round_timeout_before_completion blocker instead of
waiting for the whole job timeout. If these controller fields are missing from
compact evidence, classify the run as packet-only observation.
The treatment path must also initialize the official case-local LoopX product
lifecycle before the worker starts. Harbor's host agent installs or reuses the
real loopx CLI at /app/.local/bin/loopx, bootstraps a case-local registry
under /app/.loopx/, registers codex-benchmark-agent, seeds one open case
todo through loopx todo add, and records public-safe rollout events. The
host must not claim or complete the case todo for the worker. The product-path
treatment proof is prompt-driven: before planning or editing, the worker should
call the case-local CLI through harbor-env-exec for quota should-run and
todo claim/todo update. The controller may still run the same case-local
CLI as deterministic preflight, scheduler, and closeout fallback:
doctor, status, quota should-run, refresh-state, and
quota spend-slot. That scheduler route is not sufficient for a strict
treatment claim by itself.
Compact evidence must distinguish both surfaces:
loopx_prompt_driven_case_cli_call_count,
loopx_prompt_driven_event_counts,
loopx_prompt_driven_lifecycle_observed, and
loopx_prompt_driven_trace.public.json for worker self-calls, plus
loopx_case_scheduler_command_count,
loopx_case_rollout_event_counts, and
loopx_case_rollout_trace.public.json for controller fallback. If the
prompt-driven lifecycle is absent, classify the run with
prompt_driven_loopx_lifecycle_absent instead of claiming uplift.
SWE-Marathon closeouts should also expose loopx_solution_phase_counters:
coarse edit/build/test/verify command counts, self-declared-done count, and
final active-todo count only, never raw commands, diffs, logs, task text, or
verifier output.
Global LoopX commands are optional context only; they must not select
todos for the benchmark case. This keeps parallel cases isolated and prevents
the main project goal or side-agent lane from leaking into benchmark treatment
control.
This is not a submit/upload path and should still be reduced to compact public evidence before ledger ingestion.
The Harbor bundle requires codex and rg. curl is intentionally optional:
host-copied dynamic curl binaries can fail inside Ubuntu task images because of
shared-library differences. Use the task image's curl, a static curl, or
--include-curl only when a runner explicitly depends on it. Before an
official attempt, run a container-local preflight equivalent to:
PATH=/opt/harbor-agent-tools/bin:$PATH \
command -v codex >/dev/null && codex --version >/dev/null && \
command -v rg >/dev/null && rg --version >/dev/nullPrefer Harbor's preinstalled Codex agent variant when available, for example
codex-api-key-no-search, because it prefixes /opt/harbor-agent-tools/bin
during both setup and execution. A plain codex agent may pass setup if it
finds the bundle, but still fail execution if the runner shell resets PATH.
If the task container cannot reach the model endpoint after this, classify that
as agent egress/proxy readiness, not as nvm/npm dependency materialization.
For SkillsBench, do not let every task container download Node.js from
nodejs.org and then npm install the ACP agent. Prewarm a BenchFlow-family
runtime layer once, mount it at /opt/benchflow, prefix
/opt/benchflow/bin:/opt/benchflow/js-agents/bin:/opt/benchflow/node/bin, and
run the codex-acp launch preflight before a real case. Until that preflight is
green, classify SkillsBench as agent-runtime readiness blocked rather than
spending more official attempts.
Use the SkillsBench materializer with host-side cached sources:
python3 scripts/skillsbench_agent_runtime_layer.py \
--output ~/loopx-bench/benchflow-agent-runtime \
--node-root ~/loopx-bench/cache/node-v22.20.0-linux-x64 \
--codex-acp-bin ~/loopx-bench/cache/codex-acp \
--prettyIf the host has no cached codex-acp binary but may use network outside the
case container, use --use-default-codex-acp-package once during host
bootstrap. Record that as host dependency materialization, not as a benchmark
case step.
The probe checks command presence, Docker server reachability, disk budget, and the standard workspace shape. It intentionally emits only command names, version first lines, booleans, counts, and the workspace basename.
Benchmark source materialization should stay close to upstream:
- prefer a real git checkout or fork when the benchmark source must be patched or rebased;
- if a source tree is only a materialized copy, add a
.loopx-upstreammarker with upstream repo and commit, and do not treat it as a fork branch; - keep wrapper scripts, reducer sidecars, and runbooks in this repository unless the change is clearly upstreamable;
- never mix temporary runner probes, raw evidence, local auth setup, or private benchmark artifacts into upstream benchmark source trees.
Some benchmark checkouts need a small remote-host patch before they are usable as a repeated developer runner. Treat that as a first-class, replayable checkout patch, not as a hidden shell edit:
- Keep an
upstream-cleancheckout or source archive with a recorded upstream repo and commit. Do not patch task text, prompts, scorers, hidden tests, or official result parsing. - Apply the patch only to a run-work checkout or a clearly marked remote checkout. The patch must be generated by a LoopX script or a compact patch artifact that another developer can re-run after a fresh checkout.
- Record the patch command, upstream commit, patch purpose, and validation smoke in this repo. Record only compact handles in public docs; raw panes, logs, task text, trajectories, verifier output, and host paths stay private.
- Classify the patch as one of:
runner_startup_patch,dependency_source_patch,task_image_bootstrap_patch, ortemporary_upstream_candidate_patch. Anything outside those classes needs separate review before it enters the benchmark route. - After upstream updates, recreate the run-work checkout, replay the patch command, rerun the focused smoke, and update the active case-status note. Do not keep layering manual edits on a stale remote tree.
For Terminal-Bench, the current reusable startup patch is the no-rebuild guard:
python3 scripts/terminal_bench_no_rebuild_guard.py \
--terminal-bench-root <terminal-bench-checkout> \
--apply --prettyThis patch teaches older checkout shapes to run Compose with --no-build when
the operator has explicitly selected --no-rebuild. It is allowed because it
only changes runner startup behavior around prewarmed images; it does not
change task contents, scoring, tests, prompts, or result reduction. Pair it
with examples/terminal-bench-no-rebuild-guard-smoke.py before relying on it
for a real case.
For cloud hosts that cannot reliably fetch public Git sources, stage sources as archives from the operator machine. Prefer one compressed archive over many small-file transfers, and on macOS disable copyfile metadata and xattrs:
COPYFILE_DISABLE=1 tar --no-xattrs -C /tmp -czf benchmark-source.tgz upstream-checkout
scp benchmark-source.tgz "$BENCHMARK_HOST_ALIAS":~/loopx-bench/cache/After extraction, verify the upstream commit and clean status. If the staged
archive includes .git and Git reports a dubious ownership boundary, add a
bounded remote safe.directory entry for that checkout or restage it as a
source-only archive with a .loopx-upstream marker. Do not commit the
host-specific path or exception.
When a benchmark pins a runner dependency to a public Git repository and the
cloud host cannot fetch that dependency, stage the dependency source separately
and patch only a temporary run-work copy to use a local path source. Keep the
upstream-clean checkout unchanged, record the dependency commit, and classify
the result as dependency readiness or a precise dependency-fetch blocker. Do
not patch official scorer, task, prompt, or runner behavior merely to work
around network fetch.
Avoid putting uvx --from git+https://... or equivalent Git dependency fetches
in the hot path of repeated benchmark case launches. A wrapper can return
success while the detached runner is still stuck in dependency acquisition, so
the public signal must include job-root/result materialization, not only
process start. Prefer a pre-materialized runner checkout with a local virtual
environment, or stage the dependency source from the operator machine and run
from that checkout. If the process tree shows the runner blocked in Git fetch
before job materialization, classify it as dependency-fetch readiness, not a
benchmark score result.
Remote bootstrap snippets should assume a small base image: use grep, find,
git, python, and the benchmark runner itself unless the bootstrap probe has
confirmed extra tools such as rg.
For Terminal-Bench, the first product-path launcher should be no-upload and probe-only:
python3 scripts/terminal_bench_no_upload_smoke.py \
--task-id hello-world \
--jobs-dir ~/loopx-bench/runs/terminal-bench/jobs \
--run-root ~/loopx-bench/runs/terminal-bench/no-upload-smoke \
--prettyThat command is a dry-run by default. Add --execute only after Codex auth,
network, Docker, source, and task-data readiness are known. It emits command
shape and boundary facts, not argv values or raw runner output.
For direct tb run smoke runs on the cloud host, keep the runner invocation
boring and Docker-compose-safe:
-
verify the task directory before launching. Some materialized Terminal-Bench checkouts store tasks under
original-tasks/, not the CLI defaulttasks/. For those checkouts pass--dataset-path original-tasks; a correct task id with the wrong dataset path fails before the agent reaches the case; -
use an all-lowercase run id with only letters, digits, hyphens, or underscores; Docker Compose rejects project names with uppercase timestamp separators such as
T. Generate it with the public-safe guard instead of hand-formatting timestamps:python3 scripts/terminal_bench_safe_run_id.py \ --prefix <task-id>-host-codex-goal \ --pretty
-
pass
--output-pathas the parent runs directory and let Terminal-Bench create the run-id subdirectory itself; pre-creating that directory can make the runner think it is resuming a run and fail before execution; -
keep
--no-upload-resultsexplicit for every developer smoke; -
start with
--no-rebuildonly after the task image already exists, otherwise record the image/build blocker instead of hiding it behind a score result.
Classify pre-agent failures before comparing model or LoopX behavior:
- invalid or unsafe run ids are launch-shape blockers. Normalize them with
scripts/terminal_bench_safe_run_id.py; do not hand-format timestamps with uppercase separators that Docker Compose can reject; --no-rebuildis a runner-startup optimization, not a readiness proof. Use it only after the task image exists locally and the no-rebuild guard has made Compose use--no-build; otherwise classify the attempt as task-image or runner materialization blocked;- missing job roots, job locks, trial directories, or compact result files are job-materialization blockers. Do not report them as agent failures, benchmark score failures, or verifier failures until the runner has created a job and a task environment that the agent actually reached.
Before trusting --no-rebuild, guard the local Terminal-Bench checkout:
python3 scripts/terminal_bench_no_rebuild_guard.py \
--terminal-bench-root <terminal-bench-checkout> \
--apply --prettyTerminal-Bench skips the explicit docker compose build step when
--no-rebuild is set, but older checkout shapes still call
docker compose up -d without --no-build. Because task compose files usually
include build:, Compose can silently rebuild anyway and strand a case in
BuildKit. The guard is a local runner-startup patch only: it does not change
task text, scoring, verifier behavior, or official result parsing.
If compose up --no-build starts the prewarmed image but Terminal-Bench reports
that runner utilities such as tmux or asciinema are missing, derive a
task-image bootstrap layer once and retag it to the Terminal-Bench client image
name:
python3 scripts/terminal_bench_task_image_bootstrap.py \
--source-image <prebuilt-task-image> \
--target-image tb__<task-id>__client \
--work-dir <workspace>/image-bootstrap/<task-id> \
--network-host \
--execute --prettyUse a bounded timeout and a known mirror for apt-based images. This is still a task-image startup prerequisite, not per-case agent runtime installation and not a scoring or verifier change.
When the task image is ready but Codex auth and runtime should stay on the benchmark host, run Terminal-Bench through the host Codex Goal custom agent:
export PYTHONPATH=<loopx-checkout>/scripts:${PYTHONPATH:-}
RUN_ID=$(python3 <loopx-checkout>/scripts/terminal_bench_safe_run_id.py \
--prefix <task-id>-host-codex-goal | python3 -c 'import json,sys; print(json.load(sys.stdin)["safe_run_id"])')
tb run \
--dataset-path <tasks-dir> \
--task-id <task-id> \
--output-path <run-parent> \
--run-id "$RUN_ID" \
--no-upload-results \
--no-rebuild \
--agent-import-path terminal_bench_host_codex_goal_agent:HostCodexGoalAgent \
--agent-kwarg goal_surface=app_server \
--agent-kwarg goal_timeout_sec=21600This uses Codex native Goal mode on the host through the app-server Goal API
and instructs it to operate on the task container through docker exec. It
keeps Codex login state, model access, and agent runtime outside the benchmark
case container while the container remains responsible for task files and
official tests. The TUI /goal surface is a manual fallback only; do not count
it as the default baseline when app-server thread/goal/set,
thread/goal/get, and turn/start are available.
For a LoopX prompt-polling treatment arm, keep the same host agent and explicitly request the case lifecycle packet:
tb run \
... \
--agent-import-path terminal_bench_host_codex_goal_agent:HostCodexGoalAgent \
--agent-kwarg goal_surface=app_server \
--agent-kwarg goal_timeout_sec=21600 \
--agent-kwarg loopx_mode=codex_loopx \
--agent-kwarg loopx_access_packet_mode=compact \
--agent-kwarg loopx_case_id=<task-id> \
--agent-kwarg loopx_arm_id=loopx_prompt_polling_test \
--agent-kwarg loopx_max_rounds=5This injects the shared benchmark_case_lifecycle_contract into the worker
prompt and compact app-server metadata. A Terminal-Bench treatment run remains
incomplete evidence until the per-case LoopX lifecycle can be observed:
quota_should_run, todo_claim_or_update, bounded work/continuation, official
case result or validation, refresh_state, and quota_spend.
The app-server host agent treats the case-local LoopX active todo state as the
treatment completion source of truth. The agent should mark the case todo done
when the task is complete; the host exits only after it confirms that no
case-local active todo remains. It drains app-server events opportunistically
and writes a compact turn file with turn_completed_observed,
assistant-message counters, and completion_source_of_truth. The official
verifier remains the score authority; do not add a second completion file or
hidden marker for the agent to maintain.
When a Terminal-Bench launch produces only startup or materialization state, reduce it before writing LoopX evidence:
python3 scripts/terminal_bench_compose_startup_reducer.py \
--post-launch-json ~/loopx-bench/runs/terminal-bench/no-upload-smoke/post_launch_summary.public.json \
--prettyThe reducer classifies compact startup blockers such as missing jobs directory, missing job root, missing job lock, ended worker without trial result, or stale active job without trial result. It does not read raw logs, task text, trajectories, credentials, or command argv. If a blocker repeats, improve the SOP or script in the same batch instead of preserving a private one-off shell fragment.
When Terminal-Bench reaches official closeout, reduce the official result before
writing the run ledger. The official results.json may include trial-level
fields such as task instruction, parser details, and recording paths, so use the
metadata-only route by default:
python3 scripts/terminal_bench_official_result_reducer.py \
--metadata-only \
--run-metadata-json <terminal-bench-run>/run_metadata.json \
--mode terminal_bench_host_codex_app_server_goal \
--prettyThe reducer emits both terminal_bench_official_result_reducer_v0 and a
compact benchmark_run_v0 projection suitable for
loopx benchmark run-ledger-upsert. If a run needs results.json, pass
it explicitly and rely only on the reducer's top-level summary / allowlisted
trial counters; never publish trial instruction, parser output, recording
paths, raw logs, task text, trajectories, or command argv.
For Harbor-backed Terminal-Bench launches on a cloud host, keep the operator loop explicit:
- Launch inside
tmuxwith a lowercase job name and private runner log. - Write
status.envand a bounded public file list even on non-zero exit; useset +earound the runner so the status file is not skipped. - Treat wrapper exit code, process state, job root, job lock, and compact
result as separate facts.
rc=0for the wrapper only means the wrapper completed; it does not prove the benchmark case reached the agent. - Validate that the task filter matches the selected dataset before spending an agent attempt. A filter that matches zero tasks is a launch-shape blocker, not model or benchmark performance.
- If
tasks/<name>is passed as a relative path, run Harbor from the checkout that actually contains thattasks/tree. A wrong current directory can fail before Docker or Codex start and should be fixed locally, not reported upstream.
For SkillsBench, prove the verifier dependency substrate before claiming a
no-upload task result. A timeout-looking failure can actually be a missing
verifier launcher surface: the verifier may need minimal Python, pip, curl,
certificates, uv, and uvx before it can run the official oracle sanity path.
Do not repair that first by globally extending timeouts.
For the Codex baseline arm, use the native app-server Goal route instead of the
older slash-prefix experiment. The SkillsBench route name is
codex-app-server-goal-baseline; it requires host Codex app-server Goal
methods thread/start, thread/goal/set, thread/goal/get, and turn/start.
Generate the public-safe plan first:
python3 scripts/skillsbench_automation_loop.py \
--task-id llm-prefix-cache-replay \
--route codex-app-server-goal-baseline \
--plan-onlyThe plan must show
agent_execution_mode=host_codex_app_server_goal_worker,
codex_app_server_goal_worker_turn_start_required=true, and
codex_app_server_goal_worker_remote_command_file_bridge_required=true.
By default it should also show
codex_app_server_goal_worker_remote_command_file_bridge_ready=false and
codex_app_server_goal_worker_runner_integration_ready=false.
There are two distinct gates for a real scored launch:
- Materialize a bounded command/file bridge so the host Codex app-server Goal worker can operate on the BenchFlow sandbox where task files and edits live. A host cwd that cannot see the sandbox is not a valid fallback.
- Wire that host worker through BenchFlow's ACP transport so official task staging and verification still run through BenchFlow.
The second gate is implemented through the host-local ACP relay. In
codex-app-server-goal-baseline, pass --host-local-acp-launch only after the
bridge probe is green; the launcher then starts
scripts/skillsbench_local_acp_relay.py --app-server-goal-worker, and the
relay delegates each ACP session/prompt to
scripts/skillsbench_host_codex_goal_worker.py. This preserves BenchFlow as
the official task stager/verifier while Codex runs through native app-server
Goal methods on the host:
python3 scripts/skillsbench_automation_loop.py \
--task-id llm-prefix-cache-replay \
--route codex-app-server-goal-baseline \
--remote-command-file-bridge-ready \
--host-local-acp-launch \
--plan-onlyWith both flags, the plan should show
codex_app_server_goal_worker_remote_command_file_bridge_ready=true and
codex_app_server_goal_worker_runner_integration_ready=true. Without
--host-local-acp-launch, a real launch must still fail closed as
SkillsBenchNativeGoalWorkerIntegrationPending.
A real remote launch also has a single-checkout invariant: the wrapper must use
the same LoopX checkout for cd, PYTHONPATH, and the executable
script path. Do not rely on PYTHONPATH alone to override a relative
scripts/skillsbench_automation_loop.py from an older current directory; that
can produce official results while silently dropping the public controller and
worker traces. Prefer an immutable tool snapshot and an absolute script path:
#!/usr/bin/env bash
set -euo pipefail
TOOL_ROOT=<loopx-tool-snapshot>
cd "$TOOL_ROOT"
export PYTHONPATH="$TOOL_ROOT:$TOOL_ROOT/scripts:${PYTHONPATH:-}"
python3 "$TOOL_ROOT/scripts/skillsbench_automation_loop.py" \
--task-id llm-prefix-cache-replay \
--route codex-app-server-goal-baseline \
--remote-command-file-bridge-ready \
--host-local-acp-launch \
--jobs-dir <run-dir>/jobs \
--job-name <job-name> \
--app-server-reasoning-effort highBefore comparing scores, confirm that the public closeout contains
loopx_controller_trace.public.json and
app_server_goal_worker_traces/*.compact.json. A valid native app-server
Goal baseline must show at least goal_get_present=true and
turn_id_present=true in a public worker trace, or it must close with a precise
worker-trace blocker such as worker_prompt_received_no_turn_trace. If the
official result exists but those trace files are absent, classify it as a
tool-snapshot or launcher-persistence problem, not solver-quality evidence.
A full launch of this route must fail closed rather than falling back to
codex-acp, a slash-prefix /goal prompt, or a host-only workspace. The
host-side worker surface is:
python3 scripts/skillsbench_host_codex_goal_worker.py \
--task-id <task-id> \
--contract-onlyWhen the worker is launched from a benchmark host wrapper rather than from the LoopX checkout, set the import path explicitly so shared driver modules are resolved from the shipped checkout:
PYTHONPATH=<loopx-checkout>:<loopx-checkout>/scripts \
python3 <loopx-checkout>/scripts/skillsbench_host_codex_goal_worker.py \
--task-id <task-id> \
--contract-onlyWhen used for a private case, the same worker reads the private prompt file and
workspace path on the benchmark host, invokes Codex app-server Goal mode, waits
for turn/completed, and writes the assistant response only to a private
response file for the surrounding runner. The public JSON records compact turn
proof, assistant-message hash/size, and method counters only. Do not copy raw
task text, raw assistant response, raw trajectory, raw logs, LoopX state,
credentials, or host paths into the compact result. Keep
codex-goal-mode-baseline for historical slash-prefix probes only; it is not a
scored Codex Goal baseline.
For a scored SkillsBench route, the worker should be called with an explicit private output target:
python3 scripts/skillsbench_host_codex_goal_worker.py \
--task-id <task-id> \
--work-dir <private-case-workdir> \
--prompt-file <private-prompt-file> \
--response-text-file <private-agent-response-file> \
--output-json <private-compact-worker-json>For the LoopX treatment arm, pass the same private files plus the per-case/arm lifecycle packet parameters. The packet is public-safe control context only: it names the isolated case goal, required lifecycle events, and round budget, while the official SkillsBench verifier remains authoritative and hidden from the agent loop:
python3 scripts/skillsbench_host_codex_goal_worker.py \
--task-id <task-id> \
--work-dir <private-case-workdir> \
--prompt-file <private-prompt-file> \
--response-text-file <private-agent-response-file> \
--output-json <private-compact-worker-json> \
--loopx-mode codex_loopx \
--loopx-access-packet-mode compact \
--loopx-case-id <task-id> \
--loopx-arm-id loopx_prompt_polling_test \
--loopx-max-rounds 5The compact worker JSON must then show
loopx_case_lifecycle_packet_injected=true and a
benchmark_case_lifecycle_contract with
case_isolation_scope=per_benchmark_case_arm. A baseline run should keep
loopx_access_packet_mode=none and should not include LoopX
lifecycle state.
The compact worker JSON is safe to inspect for lifecycle debugging, but the response text file is private task execution material and must stay out of public docs, ledgers, rollout logs, and PRs.
For the host-local ACP relay, materialize public trace as early as the relay
lifecycle, not only after a completed worker turn. The relay should write
compact relay_lifecycle traces for initialize, session_new, and
prompt_received when --worker-public-trace-dir is configured. These traces
are pure observation: they record stage names and boundary booleans only, never
task text, ACP payloads, stdout/stderr, session ids, response text, or host
paths. Reducers must not treat lifecycle-only traces as solver evidence. A
status such as worker_connected_no_prompt_trace or
worker_prompt_received_no_turn_trace remains a failed native-worker evidence
check; it only narrows attribution from "trace directory missing" to "BenchFlow
connected but did not reach a countable Goal worker turn."
Use --remote-command-file-bridge-ready only after a public-safe bridge probe
has passed. That flag updates only the plan's bridge readiness fields;
codex_app_server_goal_worker_runner_integration_ready must remain false
until the BenchFlow transport is requested through --host-local-acp-launch.
If the full route exits with
SkillsBenchNativeGoalWorkerIntegrationPending, the next fix belongs in the
host-worker-to-ACP transport, not in verifier timeout, Docker setup, or model
behavior.
Preview the public-safe prewarm plan:
python3 scripts/skillsbench_verifier_prewarm_plan.py \
--task-id hello-world \
--prettyThe plan is deliberately not an upstream patch. Apply it only to a temporary
task copy, wrapper layer, or derived sandbox image, then run a one-attempt
oracle no-upload sanity task. Claim SkillsBench case readiness only after the
oracle run reaches reward 1.0 with verifier errors cleared. If the sanity run
still times out after the dependency substrate is present, classify it as a
real verifier timeout and consider a bounded timeout increase for that tier.
SkillsBench runner exceptions need a second look before they become blockers.
BenchFlow can sometimes write official result.json/timing.json before a
later runner exception. In that case, reduce the official compact result and
mark the runner exception as recovery metadata instead of losing the result.
If no official result exists, close out with a compact runner-error blocker and
do not infer verifier or model behavior from raw logs.
Keep these boundaries:
- do not modify official task truth, scorer, prompt, or leaderboard behavior;
- do not publish raw verifier output, task text, trajectories, local paths, or remote run directories;
- record only compact fields such as dependency-prewarm ready/blocked, oracle
sanity pass/fail, and the next blocker label
skillsbench_verifier_dependency_prewarm_required.
Open an upstream benchmark issue only after ruling out local route mistakes: wrong current directory, wrong config schema, missing data-root migration, missing runner dependency prewarm, stale LoopX tool copy, missing Codex auth, or a launcher that failed to write status. The issue should include a compact reproduction command shape, upstream commit, runner version, no-upload boundary, and sanitized blocker label. Do not paste raw task text, raw logs, trajectories, verifier output, credentials, private hostnames, or local paths.
Good issue candidates are repeated upstream-close failures where the command schema and working directory match the README, dependencies are pre-materialized or publicly reachable, and the runner still fails before a compact result for reasons the benchmark maintainers own.
When the benchmark host is reached through a jump host, GSSAPI, or another access path with expensive handshakes, do not make every probe open a fresh SSH session. Keep one SSH multiplexed master warm for the benchmark slice and run remote commands through that connection. This is an operator workflow convention, not a LoopX protocol requirement.
For repeated benchmark work, prefer a host-local SSH config stanza instead of spelling the multiplexing flags on every command:
Host benchmark-host-alias
ControlMaster auto
ControlPath ~/.ssh/cm/%C
ControlPersist 8h
ServerAliveInterval 30
ServerAliveCountMax 6
BatchMode yes
LogLevel ERRORUse a hashed ControlPath such as %C so the socket does not expose host
names and is unlikely to exceed path-length limits. BatchMode yes keeps
automation from hanging on an interactive auth prompt; LogLevel ERROR avoids
known-host chatter in compact run logs when the operator intentionally uses an
ephemeral known-host policy. Keep the real host name, jump path, identity file,
and control socket directory in private operator config.
BENCHMARK_HOST_ALIAS=<your-ssh-config-alias>
mkdir -p ~/.ssh/cm
chmod 700 ~/.ssh/cm
ssh -MNf "$BENCHMARK_HOST_ALIAS" || ssh -O check "$BENCHMARK_HOST_ALIAS"
ssh -O check "$BENCHMARK_HOST_ALIAS"
ssh "$BENCHMARK_HOST_ALIAS" 'hostname && docker --version && codex --version'Keep commands through the master connection mostly serial when the access path
is sensitive to concurrent authentication. Do not commit SSH aliases, host
names, private keys, jump-host details, raw shell history, or local control-path
values into public benchmark evidence. Public docs should preserve the shape:
create or reuse one master, route bounded probes and launch commands through it,
then let ControlPersist expire or close it explicitly with ssh -O exit when
the benchmark slice is done.
Long-running benchmark jobs should not live in a foreground SSH session. Start a
stable remote tmux session and send benchmark launch commands into it, then
poll with capture-pane or compact artifact files:
BENCHMARK_TMUX_SESSION=gh-bench
ssh "$BENCHMARK_HOST_ALIAS" \
'tmux has-session -t gh-bench 2>/dev/null || tmux new-session -d -s gh-bench -c "$HOME"'
ssh "$BENCHMARK_HOST_ALIAS" \
'tmux send-keys -t gh-bench "cd benchmark-work && ./run-no-upload-smoke.sh" C-m'
ssh "$BENCHMARK_HOST_ALIAS" \
'tmux capture-pane -pt gh-bench -S -120'This gives the operator a durable remote workspace even if the local Codex app,
laptop network, or SSH master connection restarts. Treat tmux as benchmark
host bootstrap tooling: installing it on the host is an operations step, not a
benchmark result. Public evidence may say that a long run used a persistent
remote session; raw panes, host paths, task text, verifier output, and command
history still stay private.
The primary comparison target is Codex Goal mode running on the benchmark host. A benchmark route may call itself a Codex Goal baseline only when it has evidence for all of the following:
- the installed Codex build exposes
features.goals=trueor an equivalent enabled Goal feature; - the runner starts Goal mode through a supported Codex surface. Prefer the
Codex app-server goal API for automation: initialize with
capabilities.experimentalApi=true,thread/startthe benchmark workspace, then callthread/goal/setwithobjective,status: active, and an optionaltokenBudget. The interactive CLI slash command/goalremains the manual fallback, not the preferred benchmark automation seam; - the run evidence shows a persistent goal attached to the active thread, not
only a prompt string whose first token is
/goal; - the route does not add LoopX state, access packets, reward feedback, or polling semantics to the baseline arm.
The comparable LoopX test arm is not the native Codex Goal baseline with one extra packet. It is a prompt-driven polling route: an outer controller injects LoopX context, schedules bounded continuation prompts, withholds official reward/pass-fail/verifier output from the agent, and records a public-safe controller trace.
Use the shared protocol in loopx.benchmark_core.loop_protocol across
benchmarks:
| Benchmark family | Baseline arm | Test arm | Shared surface | Benchmark-specific glue |
|---|---|---|---|---|
| SkillsBench | codex-app-server-goal-baseline for native Goal, or historical codex-acp-blind-loop-baseline for old ACP studies |
loopx-prompt-polling-test (loopx-blind-loop-treatment is a historical alias) |
max5_blind_loop_no_feedback, round_rewards, official_feedback_blinded_count, controller trace |
BenchFlow BaseUser schedules continuation prompts and observes verifier reward only outside the agent-facing prompt |
| SWE-Marathon | host Codex app-server Goal baseline through Harbor | LoopX prompt-polling test through the same Harbor task/workdir/no-upload boundary | same protocol id, max-round budget, packet-only blocker classification | host/Harbor controller must restart or continue app-server turns and re-inject prompts without exposing official verifier feedback |
| Terminal-Bench | host Codex app-server Goal baseline or official no-upload runner baseline | LoopX prompt-polling test through the same official result/reducer path | same protocol id, max-round budget, compact official result fields | terminal runner glue must use official scorer/reducer after each attempt and keep raw panes/logs private |
The shared layer is intentionally small: route ids, max-round budget, feedback
blinding fields, packet-only classification, and public trace counters. Do not
force every benchmark into the same runner implementation when its upstream
surface differs. Do force every benchmark to use the same labels before a result
is compared: baseline is native Goal mode; test is prompt-driven polling; a
single access packet without scheduled controller trace is only
packet_only_observation.
For the test arm, also require the shared per-case lifecycle contract from
loopx.benchmark_case_state. Each benchmark/case/arm must have an
isolated benchmark_case_lifecycle_contract with
case_isolation_scope=per_benchmark_case_arm, a canonical
/app/.codex/goals/<case-arm>/ACTIVE_GOAL_STATE.md state path, and the public
lifecycle sequence quota_should_run -> todo_claim_or_update -> bounded_agent_turn -> validation_or_case_result -> refresh_state -> quota_spend. Harbor-family agents inject this contract into the LoopX
access packet and compact metadata; other adapters should reuse the same
contract rather than inventing benchmark-specific state markers. A runner that
only performs internal prompt polling without this lifecycle remains
packet_only_observation or incomplete treatment evidence.
For strict loopx-product-mode, "the agent touched LoopX" is not enough. The
test arm is countable only when compact evidence shows both task-facing solver
activity and an agent-side case closeout: todo complete, refresh-state, and
quota spend-slot --source adapter --execute for the isolated case goal. Driver
orchestrated checkpoints may prove the control plane is reachable, but they do
not substitute for the agent completing and spending the case turn. Prompt
packets and reducer checks must use the same closeout contract; if the prompt
only suggests spend-slot as optional, treat the run as a product-path mismatch
until the prompt and compact reducer are aligned.
codex exec is still useful as a tiny connectivity smoke on the cloud host,
but a successful codex exec run is not by itself a Codex Goal baseline. Do not
rename a polling loop, resume loop, or prompt-prefixed /goal experiment into a
Goal baseline without thread/goal/get or equivalent persistent-goal evidence.
If these facts are not available, classify the result as a runner/readiness probe or unverified slash-goal prompt experiment, not as a Codex Goal baseline. In that state, do not launch matched LoopX treatment for uplift claims; instead record the exact trigger gap and keep working on cloud host, runner, task-data, or compact-result readiness.
For Terminal-Bench launcher work, use the fail-closed app-server Goal surface when validating this boundary:
python3 -m loopx.cli --format json benchmark launch-terminal-bench-run \
terminal-bench \
--mode codex-app-server-goal \
--include-task-name hello-world \
--jobs-dir '<private-jobs-dir>' \
--run-root terminal-bench-app-server-goal-probe \
--job-name terminal_bench_app_server_goal_probe \
--wait-seconds 0 \
--materialization-wait-seconds 0The app-server Goal launcher now exposes a public worker contract for
thread/goal/set, thread/goal/get, and turn/start. Until a real
Terminal-Bench case launch returns compact turn/start proof plus no-upload
case proof, this mode must still return execution_ready=false,
first_blocker=terminal_bench_app_server_goal_turn_start_proof_missing, and
codex_goal_mode_baseline_claim_allowed=false. The older codex-goal-mode
launcher remains a slash-command fallback and must not be used as a scored
Codex Goal baseline.
Default cloud ECS host readiness:
- SSH access works through the operator's approved access path.
- Codex CLI is installed on the host; auth is completed by the operator on that host and is not copied from another machine.
git, Python,uv, Node/npm when required, and Docker or a Docker-compatible runtime are available.- Container image pulls use a documented reachable registry or mirror.
- The benchmark workspace is dedicated and private enough for raw artifacts.
- The first task is a no-upload dry-run or mini-pair that writes compact
benchmark_run_v0/benchmark_result_v0evidence before any score claim.
Cloud-host Codex connectivity is its own preflight, separate from benchmark runner readiness. A host can have a valid Codex login and still fail model calls because its network egress cannot reach the provider endpoints. Before blaming the benchmark runner, prove three layers in order:
- Auth:
codex login statusor equivalent reports an authenticated local user on the benchmark host. - Network: a bounded model-provider probe reaches the endpoint through the operator-approved route. If direct egress is unavailable, use an approved loopback-only proxy or tunnel instead of copying credentials or embedding proxy details in public docs.
- Execution:
codex execcan complete a tiny read-only smoke in a scratch directory and write a compact last-message or exit-code artifact.
The reusable trick is the shape, not the private wiring: keep the concrete SSH
jump path, local ports, auth-cache handling, and proxy process command in a
local-private runbook, then expose only these public-safe facts to LoopX:
auth ready, network route ready or blocked, codex exec smoke result, and the
next benchmark-family blocker. Prefer per-command tunnels or short-lived
operator-managed proxy sessions for benchmark slices; long-running unattended
network bridges should have an explicit owner and cleanup rule.
Keep upstream benchmark sources clean:
- Use upstream
mainor a pinned upstream commit for official runner code. - Keep any internal convenience changes on a tiny, rebased adapter branch.
- Prefer wrapper scripts, environment files, and reducer sidecars over editing upstream benchmark logic.
- Fork only when we need to preserve a small reusable patch set; keep the fork close enough that upstream pulls remain routine.
- Do not mix LoopX runner experiments, local bridge probes, raw logs, or credential setup into benchmark forks.
The split-control route is now a fallback and research route, not the default when a dedicated cloud host exists.
Use it when Codex auth cannot live on the execution host, when the host is shared, or when the product question is specifically about a local LoopX controller using a separate Docker substrate.
| Owner | Responsibility |
|---|---|
| Local agent | Codex CLI, auth, model invocation, planning, patch generation, LoopX state, quota, todo, and evidence filtering. |
| Remote executor | Docker runtime, runner dependencies, task-data or image staging, bounded command/file execution, and compact result reduction. |
The remote executor is not an agent-auth environment. Missing remote Codex, Codex ACP, or model credentials is not a benchmark blocker. Real blockers are things like missing split-control adapter, missing runner tooling, missing task data or images, missing remote node runtime when a specific runner requires it, or a failed cleanup/readiness check.
Historical split-control work is still useful: it records which boundaries matter when credentials cannot move, and it produced adapter/reducer seams that can be reused for compact evidence. Do not continue adding split-control bridge layers when a cloud-host route can answer the benchmark question directly.
Treat split-control assets as a retained research branch, not a live default:
- keep durable contracts, reducers, and boundary smokes that still protect public behavior;
- do not add new bridge layers unless a cloud-host run is blocked by a concrete auth, policy, or host gate;
- move future local-Codex / remote-executor experiments to an explicitly named experimental branch or research issue;
- remove or defer mainline split-control code once the cloud-host route has equivalent compact evidence for the same benchmark family.
See
benchmark-split-control-remote-executor-v0.md
for the current machine contract, and
benchmark-route-transition-retrospective-20260619.md
for the split-control retention, branch-hygiene, and retirement runbook.
After one benchmark family has a live cloud-host smoke, do not copy its private shell history into the next family. Copy the workflow shape:
- Bootstrap the host with
scripts/benchmark_ecs_bootstrap.py. - Select the shared runtime profile with
scripts/benchmark_agent_runtime_layer.py. - Run the family-specific public-safe readiness surface.
- Launch at most one no-upload case or task-free worker proof.
- Reduce the outcome into a compact ready/blocker packet before writing LoopX state or ledger evidence.
The family-specific surface should stay close to existing product code:
- Terminal-Bench uses the no-upload launcher plus
scripts/terminal_bench_compose_startup_reducer.pyfor startup blockers and the official-result reducer for countable closeout. - SkillsBench uses
scripts/skillsbench_agent_runtime_layer.py,scripts/skillsbench_automation_loop.py --plan-only,--local-driver-worker-handshake-preflight,--host-local-acp-codex-exec-preflight,--host-local-acp-launch,--require-preinstalled-benchflow-agent-runtime, and--remote-command-file-bridge-probeto prove BenchFlow worker, bridge, runtime, and canonicalloopx-product-modelifecycle readiness before spending scored attempts. - Agents' Last Exam uses the compact builders in
loopx/benchmark_adapters/agents_last_exam.py, especiallybuild_agents_last_exam_local_source_readiness,build_agents_last_exam_task_material_readiness,build_agents_last_exam_host_codex_cua_no_task_smoke, andbuild_agents_last_exam_validation_run_gate, until those surfaces are wrapped by a CLI entrypoint.
The public packet shape is the contract, not the family internals. It should
contain the benchmark family, route, ready, first_blocker, compact lifecycle
counters when applicable, and boundary booleans proving that raw logs, task
text, trajectories, verifier output, credentials, command argv, and private
paths were not published. SkillsBench product-mode evidence is not countable
unless compact counters such as
remote_command_file_bridge_driver_lifecycle_loopx_cli_call_count,
remote_command_file_bridge_driver_lifecycle_loopx_state_read_count, and
remote_command_file_bridge_driver_lifecycle_loopx_state_write_count are
nonzero or the closeout names a precise pre-agent blocker.
Prefer wrappers, reducers, and adapter-side compact builders over patching an upstream benchmark runner. If a runner patch is unavoidable, it must follow the Remote Checkout Patch Protocol above and prove that scorer, task truth, prompts, and official result parsing were not changed. Do not expand to more cases in a family until one compact no-upload cloud-host result or blocker exists and the selected reducer can be rerun without private material.
| Family | Product-path target | Current maturity |
|---|---|---|
| Terminal-Bench | Cloud Codex CLI runs the task on a dedicated benchmark host; LoopX ingests compact no-upload evidence. | Prior split-control adapters remain useful reducers, but the next run should prefer direct cloud-host Codex plus container runtime. |
| SkillsBench | Cloud Codex CLI and BenchFlow run on the same dedicated host; LoopX records compact base/test mini-pair evidence. | Prior host-local ACP relay work is historical route-repair evidence. Do not add more bridge layers before trying the cloud-host path. |
| Agents' Last Exam | Cloud Codex CLI drives the local-Docker-capable ALE route on the dedicated host; LoopX ingests compact no-upload evidence. | Formal task runs still need task-data and public-claim gates, but Docker/Codex colocation should replace the earlier local-host split-control assumption. |
This table is intentionally about runner maturity, not leaderboard score. Score claims require separate public-safe result ingestion and review.
This preflight is retained for historical split-control debugging and for shared-host environments where Codex auth cannot live on the runner host. It is not the default route when a dedicated cloud benchmark host is available.
SkillsBench currently uses BenchFlow's ACP stdio worker protocol for Codex-like agents. For split-control runs, Codex auth, model invocation, and goal state stay local. Before launching a split-control mini-pair, run:
python3 scripts/skillsbench_automation_loop.py \
--local-driver-worker-handshake-preflight \
--local-codex-cli-participant-ready \
--local-acp-relay-probe \
--host-local-acp-transport-probeThe preflight is successful only when BenchFlow is importable, the default
Codex agent is registered as ACP, the local Codex CLI participant was already
materialized, the local ACP relay completes initialize, session/new,
session/set_model, and session/prompt, BenchFlow's own ACPClient can
drive that relay over host-local stdio, and a bounded remote command/file
bridge exists for the sandbox side. The default relay and transport probes are
dry-run: they do not invoke Codex, read task text, copy credentials, record raw
logs, or launch a benchmark task.
Do not treat a successful relay probe as mini-pair readiness. It only proves
the local ACP server shape. The host-local transport probe proves BenchFlow can
talk to that local server without ContainerTransport. A no-upload mini-pair
is product-path evidence only after the remote bridge is also materialized, so
the preflight may legitimately return skillsbench_remote_command_file_bridge_missing
after both local probes pass.
For the remote bridge, prefer a machine-verifiable probe over a manual readiness flag:
python3 scripts/skillsbench_automation_loop.py \
--local-driver-worker-handshake-preflight \
--local-codex-cli-participant-ready \
--local-acp-relay-probe \
--host-local-acp-transport-probe \
--remote-command-file-bridge-probe \
--remote-command-file-bridge-probe-command '<private-remote-bridge-command>'The bridge command reads a fixed JSON request from stdin and writes compact JSON
to stdout. It must prove four bounded operations: exec, write_file,
read_file, and cleanup. Its public result records only operation kinds,
statuses, and boundary flags; it must not return raw commands, stdout, stderr,
task text, paths, credentials, logs, trajectories, uploads, or submissions.
scripts/skillsbench_remote_command_file_bridge.py --serve-probe is only a
local fake bridge for smoke tests and adapter development. It is not evidence
that a real remote executor is ready.
For a scored split-control launch, the readiness probe is not the solver
bridge. Pass a separate private --remote-command-file-bridge-solver-command
only when that command can operate on the scored BenchFlow sandbox. The runner
must fail closed when only a probe command is configured, and it must reject the
repo fake probe helper as a solver command.
Benchmark evidence may include:
- benchmark id, task id or public-safe case id;
- arm or mode label;
- readiness gate result;
- process or job handle basename;
- compact result fields such as
score,best_score,final_score,first_success_round,duration_s, andblocker; - cleanup state;
- links to public docs or compact JSON/Markdown artifacts.
Benchmark evidence must not include:
- raw task text, hidden task files, verifier body output, or solution material;
- raw trajectories, transcripts, screenshots, stdout, stderr, or shell argv;
- credentials, tokens, local absolute paths, remote absolute paths, or private hostnames;
- uploads, submit paths, or leaderboard claims unless a specific public release gate has approved them.
Before a PR that changes benchmark behavior:
- Name which layer changed: selection, launch, observe, ingest, scoring, or docs.
- Keep benchmark-specific runner details inside the adapter.
- Preserve the split-control boundary when a remote executor is involved.
- Add or update a focused smoke for the durable contract.
- Run
loopx check --scan-path <changed-public-path>for public docs or examples. - Do not commit
.local, raw logs, private run directories, active state, or local runner configs.
Near-term work should make the benchmark workflow feel like a small product:
- expose a single developer-facing command path for readiness and runner batch planning;
- add observable launch handles so long runs can be polled without chat memory;
- align Terminal-Bench, SkillsBench, and Agents' Last Exam on the same launch/observe/ingest lifecycle;
- document the no-upload dry-run path before chasing broad score matrices;
- make compact blockers first-class, so a failed launch still teaches the next developer exactly what to repair.