Distributed run pipeline by seanrivera · Pull Request #22 · ManifoldRG/MultiNet-v2.0

seanrivera · 2026-06-18T21:03:23Z

This is just a commit for the final run pipeline before we kick things off.

After the pipeline lands this is configuration and management changes.

…odules

…ps left

…o feature/run-pipeline

…into feature/run-pipeline # Conflicts: # ogbench

# Conflicts: # interface/agents.py # pyproject.toml

PR #18 added last_usage telemetry to the old single-file interface/agents.py, which no longer exists (now an interface/agents/ package). Port the pattern: ClaudeAnthropicAgent captures usage from the Anthropic response via normalize_token_usage; Qwen35VLAgent records prompt/output token counts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Wire the canonical pipeline stages over the interface/ runner (Stack A) and the scorer/ package into a single inspectable orchestrator: - pipeline/run_stage3.py: run one live-model episode -> episode.json - pipeline/episode_metrics.py: derive path_choice (test2), mechanism_interaction_order + failure_point (test3), token totals, and the Appendix A.3 episode_runs.jsonl row; enrich runs for the scorer - pipeline/reports.py: scoring_calibration_summary / complexity_distance_summary / mechanism_ordering_pairs aggregators - scripts/run_pipeline.py: Stage 1->5 CLI (multinet-run-pipeline) - scripts/validate_fixtures.py: validate fixtures + derive test2 route cells - gridworld/fixtures/: manifest + test2 shortcut maze + test3 ordering pairs (test1 reuses the existing validation_10 set) - tests for episode metrics, reports, and an end-to-end pipeline run Baselines (BFS/greedy) stay Stage-2 difficulty/canonical-path generators via the scorer; Stage-3 episodes are live-model-only. No DAG runner (kept sequential). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…m prompt template

…w=3 prompt

… prompt

Add a run-config layer that maps each model to its own task selection and provider/params, keeping the manifest as a separate metadata catalog: - scripts/run_pipeline.py: load_run_config + resolve_task_rows (entries may be task-file paths, catalog task_ids, or experiment keywords; catalog metadata is attached by path so test2/test3 signals survive); run_from_config drives multiple models, scoring the union suite once and aggregating one episode_runs.jsonl + report set. _build_agent_from_spec constructs claude/qwen agents from the model entry (provider/model/temperature/max_tokens). - CLI: --run-config is the primary path; --agent/--experiment remain a single-model fallback. - gridworld/fixtures/run_config.example.json: sample config. - tests for task resolution and a config-driven multi-model run (stub factory). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cached artifacts are now reused only when their inputs hash still matches, instead of skipping purely on file existence: - Stage 2: reuse scored_static.json/canonical_paths.json only when the stored inputs_hash equals the hash recomputed from the current task spec + scorer config; otherwise regenerate the bundle. _expected_static_hash mirrors the scorer recipe (guarded by a parity test). - Stage 3 (model calls, the expensive stage): stamp each episode with a sidecar run_inputs.json carrying an inputs_hash over {task spec, model_id, seed, prompt config, backend, pipeline_version}; reuse the cached episode only on a hash match. Scorer-config changes intentionally do NOT invalidate the episode. - Stage 4 (cheap, deterministic): always re-score from the cached/fresh episode, so scorer-config / static / canonical changes propagate to run_score.json. - canonical_paths.json now carries its own inputs_hash (scorer/artifacts.py + solvers.py), closing the last unhashed scorer artifact. Tests: hash parity with the scorer, episode cache hit on unchanged re-run, task edit invalidating both static and episode, and scorer-config change re-scoring without re-running the model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Prompts are not versioned yet while we iterate, so the Stage-3 run-inputs hash no longer includes the ExperimentConfig; the prompt variant still separates runs via the <condition> directory. Left a TODO to fold backend/adapter code versions into the run hash at v1 release. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Apply /code-review max findings on the run pipeline: - reports: fix test-3 expected_order_match_rate to compare only the expected mechanisms' relative order, not the full interaction order (which also carries downstream doors/gates, so it was always 0). - episode_metrics: lift final_state.reward into the scorer-facing dict so run_score.json matches the jsonl row; guard an explicit final_state=null; align optimality_ratio with the scorer's step_ratio for optimal_steps==0. - run_pipeline: require canonical_paths.json for the Stage-2 cache hit; raise a clear error for an unknown --conditions name; derive episode metrics once per run instead of twice. Separate the prompt-variant axis from the manifest condition (F2): catalog rows always define `condition`, so the old setdefault collapsed prompt variants and collided composite keys. Thread a distinct `prompt_variant` field through build_run_row, the composites key, and reports (_run_key, success_rate_by_prompt_variant, test-2 grouping). Add regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Moving ExperimentConfig and run_episode imports inside their respective functions allows Stage 1-2 (manifest + solvers + static score) to be imported without pulling in the heavy interface stack. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A schema-valid task that Stage 2 marks is_beatable=false is ineligible: skip its Stage 3/4 work (model/API calls + scoring) in _run_one_model instead of spending runs on it, and surface the ineligible set via scoring_calibration_summary's new ineligible_tasks field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The installed multinet-run-pipeline CLI defaults to gridworld/fixtures/manifest.json, but package-data only shipped gridworld/tasks, so the default manifest and its test2/test3 fixture task files were omitted from the wheel. Add fixtures/**/*.json. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ature/run-pipeline

…ome cleanup

…ew tests

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

pranavguru · 2026-06-19T23:38:32Z

+        _write_json_atomic(state_file, local_state)
+        stop_heartbeat = threading.Event()
+
+        def heartbeat_loop() -> None:


No try/except inside the loop - If client.heartbeat raises (e.g., one HTTP request times out because the coordinator is briefly busy doing tar extraction), the exception propagates out and the daemon thread dies. After 5 minutes (stale_after_seconds), the coordinator marks the unit "stale" and reassigns to another worker. The original worker is still running the episode though

pranavguru · 2026-06-19T23:45:49Z

+            _write_json_atomic(state_file, local_state)
+            if once:
+                return True
+        except Exception as exc:


After calling fail, the exception propagates out of run_worker_loop and the worker process exits.

Realistic scenarios:

Anthropic returns a 503 (happens occasionally) - episode raises → worker dies → all remaining Claude work on that worker is stalled

Moonshot rate-limits → same

Qwen has a transient CUDA OOM → same, and now a GPU VM costing $X/hour sits idle

This is a likely failure mode and one unlucky API call kills the worker.

pranavguru · 2026-06-19T23:49:52Z

+    local_state["worker_id"] = worker_id
+    _write_json_atomic(state_file, local_state)
+
+    while True:


No try/except around client.assign. If the coordinator restarts (or a single HTTP call fails), the worker exits.

For our setup, the coordinator is on its own VM. A VM restart, a momentary OOM, even just Python's GC pausing the HTTP server for a half-second - any of these can fail one assign request and the worker dies.

helenlu66 and others added 30 commits May 29, 2026 01:27

made sure exp 3 prompts interface with the existing model interface m…

b47e2bc

…odules

added prompting_experiments

9ed266f

removed cardinal direction condition

410f5f7

removed the part of the prompt that tells the agent the number of ste…

314baf2

…ps left

got rid of the minimal prompt condition

5567780

Merge remote-tracking branch 'origin/codex/add-ogbench-submodule' int…

c5c589a

…o feature/run-pipeline

Merge remote-tracking branch 'origin/interface-prompts-consolidated' …

d6a5c4b

…into feature/run-pipeline # Conflicts: # ogbench

Merge remote-tracking branch 'origin/scorer' into feature/run-pipeline

ea109ea

# Conflicts: # interface/agents.py # pyproject.toml

moved the prose about the direction the agent's facing into the syste…

15dba61

…m prompt template

removed initial maze desc from standard prompts

6c7e54a

removed unecessary NL desc from system prompt

a3516c2

take 3 steps in the maze when previewing prompts to see context_windo…

9d74919

…w=3 prompt

fixed the prompt for full sequence of actions

652d686

added support for subgoal planning, moved output format cue into user…

9222d40

… prompt

renamed many conditions to standard

fcedf3c

added a description of the inventory to every prompt

f6aa8d7

Make scorer import interface-free via lazy telemetry import

5ae5e7c

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Make episode_metrics import interface-free via lazy telemetry import

b550216

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add per-model report aggregator (model_report)

5c53f59

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Write a machine-readable per-model report per run set

91c6d9b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Skip None steps in model_report steps_mean (review nit)

ddf4275

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

seanrivera and others added 8 commits June 9, 2026 21:29

Making sure we're up to date with the current 'scorer' branch into fe…

b17c502

…ature/run-pipeline

Pulling scorer forward

8cf30a3

Fixing the comments. Mostly reliability changes, some new tests and s…

9703d8e

…ome cleanup

Fixed the comments, added more reliability to the run pipeline, and n…

5bf1c0e

…ew tests

only test for RULEs in the verbose condition

3632980

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Merge interface prompts consolidated into run pipeline

b9c053a

Address PR20 review comments

96794f2

Add distributed coordinator pipeline

cf0b972

seanrivera requested a review from pranavguru June 18, 2026 21:03

Merge origin/main into distributed pipeline

37252f2

pranavguru requested changes Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed run pipeline#22

Distributed run pipeline#22
seanrivera wants to merge 39 commits into
mainfrom
Distributed-run-pipeline

seanrivera commented Jun 18, 2026

Uh oh!

pranavguru Jun 19, 2026

Uh oh!

pranavguru Jun 19, 2026

Uh oh!

pranavguru Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

seanrivera commented Jun 18, 2026

Uh oh!

pranavguru Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

pranavguru Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

pranavguru Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants