Distributed run pipeline#22
Conversation
…o feature/run-pipeline
…into feature/run-pipeline # Conflicts: # ogbench
# Conflicts: # interface/agents.py # pyproject.toml
PR #18 added last_usage telemetry to the old single-file interface/agents.py, which no longer exists (now an interface/agents/ package). Port the pattern: ClaudeAnthropicAgent captures usage from the Anthropic response via normalize_token_usage; Qwen35VLAgent records prompt/output token counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the canonical pipeline stages over the interface/ runner (Stack A) and the scorer/ package into a single inspectable orchestrator: - pipeline/run_stage3.py: run one live-model episode -> episode.json - pipeline/episode_metrics.py: derive path_choice (test2), mechanism_interaction_order + failure_point (test3), token totals, and the Appendix A.3 episode_runs.jsonl row; enrich runs for the scorer - pipeline/reports.py: scoring_calibration_summary / complexity_distance_summary / mechanism_ordering_pairs aggregators - scripts/run_pipeline.py: Stage 1->5 CLI (multinet-run-pipeline) - scripts/validate_fixtures.py: validate fixtures + derive test2 route cells - gridworld/fixtures/: manifest + test2 shortcut maze + test3 ordering pairs (test1 reuses the existing validation_10 set) - tests for episode metrics, reports, and an end-to-end pipeline run Baselines (BFS/greedy) stay Stage-2 difficulty/canonical-path generators via the scorer; Stage-3 episodes are live-model-only. No DAG runner (kept sequential). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…m prompt template
Add a run-config layer that maps each model to its own task selection and provider/params, keeping the manifest as a separate metadata catalog: - scripts/run_pipeline.py: load_run_config + resolve_task_rows (entries may be task-file paths, catalog task_ids, or experiment keywords; catalog metadata is attached by path so test2/test3 signals survive); run_from_config drives multiple models, scoring the union suite once and aggregating one episode_runs.jsonl + report set. _build_agent_from_spec constructs claude/qwen agents from the model entry (provider/model/temperature/max_tokens). - CLI: --run-config is the primary path; --agent/--experiment remain a single-model fallback. - gridworld/fixtures/run_config.example.json: sample config. - tests for task resolution and a config-driven multi-model run (stub factory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cached artifacts are now reused only when their inputs hash still matches,
instead of skipping purely on file existence:
- Stage 2: reuse scored_static.json/canonical_paths.json only when the stored
inputs_hash equals the hash recomputed from the current task spec + scorer
config; otherwise regenerate the bundle. _expected_static_hash mirrors the
scorer recipe (guarded by a parity test).
- Stage 3 (model calls, the expensive stage): stamp each episode with a sidecar
run_inputs.json carrying an inputs_hash over {task spec, model_id, seed, prompt
config, backend, pipeline_version}; reuse the cached episode only on a hash
match. Scorer-config changes intentionally do NOT invalidate the episode.
- Stage 4 (cheap, deterministic): always re-score from the cached/fresh episode,
so scorer-config / static / canonical changes propagate to run_score.json.
- canonical_paths.json now carries its own inputs_hash (scorer/artifacts.py +
solvers.py), closing the last unhashed scorer artifact.
Tests: hash parity with the scorer, episode cache hit on unchanged re-run, task
edit invalidating both static and episode, and scorer-config change re-scoring
without re-running the model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Prompts are not versioned yet while we iterate, so the Stage-3 run-inputs hash no longer includes the ExperimentConfig; the prompt variant still separates runs via the <condition> directory. Left a TODO to fold backend/adapter code versions into the run hash at v1 release. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply /code-review max findings on the run pipeline: - reports: fix test-3 expected_order_match_rate to compare only the expected mechanisms' relative order, not the full interaction order (which also carries downstream doors/gates, so it was always 0). - episode_metrics: lift final_state.reward into the scorer-facing dict so run_score.json matches the jsonl row; guard an explicit final_state=null; align optimality_ratio with the scorer's step_ratio for optimal_steps==0. - run_pipeline: require canonical_paths.json for the Stage-2 cache hit; raise a clear error for an unknown --conditions name; derive episode metrics once per run instead of twice. Separate the prompt-variant axis from the manifest condition (F2): catalog rows always define `condition`, so the old setdefault collapsed prompt variants and collided composite keys. Thread a distinct `prompt_variant` field through build_run_row, the composites key, and reports (_run_key, success_rate_by_prompt_variant, test-2 grouping). Add regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Moving ExperimentConfig and run_episode imports inside their respective functions allows Stage 1-2 (manifest + solvers + static score) to be imported without pulling in the heavy interface stack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A schema-valid task that Stage 2 marks is_beatable=false is ineligible: skip its Stage 3/4 work (model/API calls + scoring) in _run_one_model instead of spending runs on it, and surface the ineligible set via scoring_calibration_summary's new ineligible_tasks field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The installed multinet-run-pipeline CLI defaults to gridworld/fixtures/manifest.json, but package-data only shipped gridworld/tasks, so the default manifest and its test2/test3 fixture task files were omitted from the wheel. Add fixtures/**/*.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ature/run-pipeline
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| _write_json_atomic(state_file, local_state) | ||
| stop_heartbeat = threading.Event() | ||
|
|
||
| def heartbeat_loop() -> None: |
There was a problem hiding this comment.
No try/except inside the loop - If client.heartbeat raises (e.g., one HTTP request times out because the coordinator is briefly busy doing tar extraction), the exception propagates out and the daemon thread dies. After 5 minutes (stale_after_seconds), the coordinator marks the unit "stale" and reassigns to another worker. The original worker is still running the episode though
| _write_json_atomic(state_file, local_state) | ||
| if once: | ||
| return True | ||
| except Exception as exc: |
There was a problem hiding this comment.
After calling fail, the exception propagates out of run_worker_loop and the worker process exits.
Realistic scenarios:
- Anthropic returns a 503 (happens occasionally) - episode raises → worker dies → all remaining Claude work on that worker is stalled
- Moonshot rate-limits → same
- Qwen has a transient CUDA OOM → same, and now a GPU VM costing $X/hour sits idle
This is a likely failure mode and one unlucky API call kills the worker.
| local_state["worker_id"] = worker_id | ||
| _write_json_atomic(state_file, local_state) | ||
|
|
||
| while True: |
There was a problem hiding this comment.
No try/except around client.assign. If the coordinator restarts (or a single HTTP call fails), the worker exits.
For our setup, the coordinator is on its own VM. A VM restart, a momentary OOM, even just Python's GC pausing the HTTP server for a half-second - any of these can fail one assign request and the worker dies.
This is just a commit for the final run pipeline before we kick things off.
After the pipeline lands this is configuration and management changes.