Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source by jkbennitt · Pull Request #20 · AppSprout-dev/RLE

jkbennitt · 2026-05-23T23:38:58Z

Summary

Phase A of the observability-first restructuring plan. Eight commits, +963 lines of source, +56 lines of cost-tracker docs/tests. Goal: the next live benchmark run produces a fully reviewable, replayable artifact.

Stacked on #19 (fix/post-live-test-findings). Merge after #19.

#	Commit	What
A1	`ee52eaf`	Restore per-scenario `*_deliberations.jsonl` export — was in docker-benchmark (Apr 12), lost in pr17-live. New `RLEGameLoop.deliberation_log` public property; both `run_benchmark.py` and `run_scenario.py` now write the JSONL.
A2	`c958dba`	Expand `EventType.DELIBERATION` payload — events.jsonl now carries the full actions array + summary + reason text (was only latency/confidence/num_actions). Truncation constants `_ACTION_REASON_CHARS=200`, `_PLAN_SUMMARY_CHARS=300`, `_PARSE_FAILURE_RAW_CHARS=500` consolidated at module top.
A3	`9e5dae5`	`_last_raw_output` exported via `PROVIDER_CALL` events (4 KB head-truncated). Means a weird deliberation is grep-able without keeping the process alive.
A6	`45cb9fb` + `da5e745`	Replay-grade metadata: `SCORING_VERSION="1.0"`, `file_sha256()`, RIMAPI DLL hash via Workshop path lookup, RIMAPI fork commit lookup, `--seed` CLI flag. Plus `save_sha256` on every scenario YAML (all 6 now pinned) + loader validation. `scripts/hash_saves.py` helper.
A7	`e2d6695`	Per-task LLM timeout via `asyncio.wait_for`. Hung deliberation emits `deliberation_timeout` ERROR event and the tick proceeds. Default 60s (~8× the docker-benchmark 7s avg). All three deliberation paths covered (parallel, sequential, MapAnalyst).
A8	`83586bf`	`run_scenario.py` now appends to `results/benchmark_history.jsonl` with `run_type: "scenario"` so single-scenario runs are distinguishable from full batteries. Skipped when no LLM was actually called (smoke tests).
A9	`ed2720d`	Cost tracker surfaces `pricing_source` (`openrouter_api` / `override` / `unknown`) on every snapshot, plus the actual prices used. New `--prompt-price-per-mtok` / `--completion-price-per-mtok` CLI overrides for when OpenRouter's /models pricing diverges from billed cost.

Test results

pytest — 398 passed (+13 new tests covering deliberation log, raw-output event, save SHA validation, metadata, timeout, pricing source, overrides). ruff + mypy strict clean.

Why this matters for the benchmark

The original review found that pr17-live (the post-PR-#17 live run) had:

No persisted deliberation JSONLs (regressed between PR HeadlessRim Docker + real benchmarks + leaderboard infrastructure (#13) #15 and Full RIMAPI integration: fix data gaps, alerts, incident control, scenario pipeline #18)
No raw LLM output anywhere on disk
No way to know which RIMAPI DLL the run used
Silent timeout-blocking (one hung LLM call stalled the entire tick)
1-byte benchmark_history.jsonl despite dozens of runs persisted to subdirs

After this PR, a fresh single-scenario run produces:

<scenario>.csv (per-tick time series)
<scenario>_deliberations.jsonl (full agent rationale per tick)
<scenario>_summary.json (metadata + outcome + cost + event summary)
events.jsonl (every tick event, with actions/reasons/raw LLM output inline)
An entry in results/benchmark_history.jsonl tagged run_type: "scenario"

That's the artifact format Phase B (baseline calibration + first real long run) will produce.

What's still open in Phase A

A10: re-run the live smoke test on this branch (Crashlanded, 10 ticks, Nemotron 120B) to measure whether the action-error feedback loop from Surface RIMAPI action errors to agents + fix load-settle race #19 moves the mood (0.408) / research (0.226) bottleneck metrics. Cost ~$0.17. Cannot run from CI — needs live RimWorld + RIMAPI.
A11: upstream RIMAPI PR to tighten ResponceBuilder.DetermineStatusCode. Low urgency since the RLE-side ActionOutcome workaround from Surface RIMAPI action errors to agents + fix load-settle race #19 is in place.
A12: wire pawn-social endpoints (opinions / relations / interactions) into SocialOverseer, filtered to mood<0.4. Should be its own PR.

Test plan

pytest (398 pass)
ruff check src/ tests/ scripts/ (clean)
mypy src/ (clean, 42 source files)
python scripts/hash_saves.py --print (round-trips all 6 scenario saves)
python scripts/run_scenario.py --help shows --seed, --prompt-price-per-mtok, --completion-price-per-mtok
Live run (A10) — pending operator

🤖 Generated with Claude Code

Plan item A1. docker-benchmark (Apr 12) wrote *_deliberations.jsonl per scenario via run_benchmark.py; pr17-live (run via run_scenario.py) silently dropped the export, so post-PR-#17 runs lost the per-tick agent rationale (actions, reasons, summary). Re-add it via a public deliberation_log property on RLEGameLoop so both scripts use the same contract. - RLEGameLoop.deliberation_log: public property returning a shallow copy of the per-tick records (tick, agent, status, plus actions/summary/ confidence for success and raw/reason for failures). - run_scenario.py: writes <scenario>_deliberations.jsonl alongside the CSV when --output is given, off-thread via asyncio.to_thread. - run_benchmark.py: switched to the public property (was reading the private _deliberation_log attribute). - Integration test asserts record count and shallow-copy semantics. Tests: 383 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A2. The DELIBERATION events emitted to events.jsonl previously carried only latency_ms / confidence / num_actions; the rich payload (actions array with type/target/priority/reason, plus the plan summary) lived only in the in-memory _deliberation_log mirror. Now both writers emit the same data so events.jsonl is self-contained for downstream analysis without needing to cross-reference *_deliberations.jsonl. Also surfaces the raw LLM content (first 500 chars) on parse_failure ERROR events, matching what _deliberation_log already records. Extracted three truncation constants (_ACTION_REASON_CHARS=200, _PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500) so the two writers share the same shape and the limits are documented in one place. Tests: 384 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A3. _last_raw_output (the full LLM completion text) was stored in-memory on each agent but never exported, so post-hoc analysis of a weird deliberation required keeping the process alive. Now every PROVIDER_CALL event carries the agent's raw completion text (head- truncated to 4 KB) plus a raw_output_truncated bool so consumers know whether the JSONL has the full response or just a prefix. PROVIDER_CALL is the right semantic home — it's the provider's response to the call we already log token counts for. The truncation length is configurable via _RAW_OUTPUT_CHARS (matches the other event-log truncation constants from A2). 4 KB covers typical action-plan JSON plus a reasoning preamble. Tests: 385 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A6 (commit 1 of 2). Surfaces the metadata a benchmark replay kit needs: a versioned scoring identifier, the deployed RIMAPI DLL hash, the RIMAPI fork commit, and an explicit seed for RLE-side stochasticity. Scenario save_sha256 + loader validation lands in a follow-up. metadata.py: * SCORING_VERSION constant ("1.0") — bumped when DEFAULT_WEIGHTS, metric implementations, or composite math change in a way that makes older runs not directly comparable. Surfaced in every run summary. * file_sha256(path): stdlib-only chunked SHA-256, returns None on missing/unreadable so the metadata dict stays serializable. * _rimapi_dll_path(): env override ($RIMAPI_DLL_PATH) → Workshop default (Steam path). Hashed via file_sha256. * _rimapi_fork_commit(): env override ($RIMAPI_FORK_PATH) → sibling ../RIMAPI checkout. Empty string when not reachable. * collect_metadata(random_seed=None): threads the seed through so the summary records what was actually set. Docstring notes that the seed only affects RLE-side randomness, not RimWorld's internal RNG. CLI: * Both run_benchmark.py and run_scenario.py gain --seed INT. When set, random.seed() is called before any work and the value is threaded through to collect_metadata() at every summary write. run_scenario.py: * Writes <scenario>_summary.json alongside the CSV/deliberations, including the metadata block, model/provider config, outcome, final composite, ticks_run, cost_snapshot, and event_summary. The new helper _build_run_summary keeps main() from getting longer. Tests: 391 pass (+6 new metadata tests). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A6 (commit 2 of 2). Live game runs can silently drift from the canonical docker/saves/ mirror (RimWorld saves into AppData on Windows; nothing forces it to match the repo). When the agent + scoring code are working off a different save than the one the leaderboard tracks, every result is uncomparable. Pin and verify. schema.py: * ScenarioConfig.save_sha256: str | None — optional pinned hash of the docker/saves/<save_name>.rws file. loader.py: * ScenarioSaveMismatchError — distinct exception so callers can handle this specifically (e.g., a CLI flag --allow-unpinned). * canonical_save_path(save_name) — resolves to <repo>/docker/saves/. * load_scenario(path, allow_unpinned=False) — if save_sha256 is set AND the canonical save exists on disk AND the actual SHA doesn't match, raises. allow_unpinned=True is the intentional escape hatch. Short-circuits silently when the canonical file is missing (CI without docker volumes) — that's a separate failure mode caught at game-load time. scripts/hash_saves.py: * New helper that hashes every canonical save and updates the matching scenario YAML's save_sha256 field. --print for dry-run. definitions/*.yaml: * All 6 scenarios now carry pinned save_sha256 fields (the actual hashes of the current docker/saves/*.rws files). Re-run scripts/hash_saves.py after rebuilding saves to re-pin. Tests: 394 pass (+3 new SHA validation tests). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A7. A single hung LLM call previously blocked the entire tick because asyncio.gather has no per-task timeout. Now each deliberation runs under a wait_for(role_timeout_s) wrapper; on timeout the orchestrator emits a deliberation_timeout ERROR event, records the agent's empty contribution in the deliberation log, and proceeds. The hung thread keeps running in the background until the provider's own timeout cleans it up (Python threads can't be force-killed) but the tick advances regardless. Why it matters for the benchmark: cheap and premium models can be in the same matrix only if tail latency is bounded. Without this, a stuck 30B/120B request stalls every other agent's runtime measurement. config.py: * RLEConfig.role_timeout_s: float = 60.0 (~8x the 7s avg deliberation observed in docker-benchmark; tunable via .env or CLI override). game_loop.py: * _deliberate_agent_with_timeout(): new async wrapper, runs _deliberate_agent in a worker thread under asyncio.wait_for. Emits EventType.ERROR with error_type=deliberation_timeout + timeout_s on TimeoutError; returns (agent, None). * _deliberate_parallel(): collapsed the inline _run closure into the new wrapper. * _deliberate_sequential(): is now async and awaits the wrapper. Sequential-mode dispatch in run_tick() now awaited. * MapAnalyst dispatch also goes through the timeout wrapper. Test: TestMultiAgent.test_role_timeout_emits_event_and_other_agents_continue injects a 5s-sleep provider with role_timeout_s=0.05; asserts 7 deliberation_timeout events fire, merged plan has 0 actions, tick completes successfully. Tests: 395 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A8. run_benchmark.py already calls append_history (line 711); run_scenario.py never did. Combined with the fact that the user has been doing live smoke testing via run_scenario.py (not run_benchmark.py), results/benchmark_history.jsonl stayed at 1 byte even though dozens of runs were persisted in per-run subdirectories. Now run_scenario.py appends one history entry per --output run, tagged run_type: "scenario" so consumers can distinguish single-scenario runs from full benchmark batteries (which would tag run_type: "benchmark" when run_benchmark.py is updated; that's a separate trivial follow-up). Skips the history append when cost_tracker.num_calls == 0 (smoke tests where the LLM was never reached, e.g. today's broken nano-4b run that generated parse failures across the board because LM Studio was down). Those rows would skew the leaderboard's avg-score-per-model otherwise. Tests: 395 pass. ruff + mypy src/ clean. (Script-level mypy errors at run_scenario.py:171, :175 are pre-existing in untouched code.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Plan item A9. The live smoke test reported \$0.044 internal vs \$0.168 actual OpenRouter charge. Internal estimates depend on the public /models pricing, which can diverge from billed cost due to provider routing surcharges, BYOK markups, or stale prices. The tracker would silently fall back to \$0/token if the model wasn't found or the API was unreachable. Now those failure modes are visible, and operators can plug in the actual price they're seeing on their bill. cost_tracker.py: * CostSnapshot gains prompt_price_per_token, completion_price_per_token, and pricing_source (one of: openrouter_api / override / unknown). "unknown" flags that estimated_cost_usd is \$0 because pricing resolution failed — consumers should not aggregate it. * create_cost_tracker() accepts prompt_price_override / completion_price_override (per-token USD); when both are passed, the live fetch is skipped entirely (source="override"). * Logs the resolved prices at INFO ("prompt=\$X.XX/MTok completion= \$Y.YY/MTok source=...") and a WARNING when the fetch resolved to \$0 for a model the user explicitly named. CLI: * --prompt-price-per-mtok / --completion-price-per-mtok on both run_scenario.py and run_benchmark.py. Per-MTok is the standard human-readable unit on OpenRouter's pricing page; the scripts convert to per-token before handing to create_cost_tracker. Tests: 398 pass (+3 new pricing-source / override tests; +1 nemotron-120B reconciliation test that pins the exact math from the docker-benchmark cost recording). ruff + mypy strict clean. Reconciliation against an actual billed amount is still a follow-up: needs an authenticated query to OpenRouter's /credits or per-generation endpoints. For now the override is the unblock. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jkbennitt and others added 8 commits May 16, 2026 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20
jkbennitt wants to merge 8 commits into
fix/post-live-test-findingsfrom
feat/restore-deliberations-export

jkbennitt commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkbennitt commented May 23, 2026

Summary

Test results

Why this matters for the benchmark

What's still open in Phase A

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant