Skip to content

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20

Open
jkbennitt wants to merge 8 commits into
fix/post-live-test-findingsfrom
feat/restore-deliberations-export
Open

Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20
jkbennitt wants to merge 8 commits into
fix/post-live-test-findingsfrom
feat/restore-deliberations-export

Conversation

@jkbennitt
Copy link
Copy Markdown
Member

Summary

Phase A of the observability-first restructuring plan. Eight commits, +963 lines of source, +56 lines of cost-tracker docs/tests. Goal: the next live benchmark run produces a fully reviewable, replayable artifact.

Stacked on #19 (fix/post-live-test-findings). Merge after #19.

# Commit What
A1 ee52eaf Restore per-scenario *_deliberations.jsonl export — was in docker-benchmark (Apr 12), lost in pr17-live. New RLEGameLoop.deliberation_log public property; both run_benchmark.py and run_scenario.py now write the JSONL.
A2 c958dba Expand EventType.DELIBERATION payload — events.jsonl now carries the full actions array + summary + reason text (was only latency/confidence/num_actions). Truncation constants _ACTION_REASON_CHARS=200, _PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500 consolidated at module top.
A3 9e5dae5 _last_raw_output exported via PROVIDER_CALL events (4 KB head-truncated). Means a weird deliberation is grep-able without keeping the process alive.
A6 45cb9fb + da5e745 Replay-grade metadata: SCORING_VERSION="1.0", file_sha256(), RIMAPI DLL hash via Workshop path lookup, RIMAPI fork commit lookup, --seed CLI flag. Plus save_sha256 on every scenario YAML (all 6 now pinned) + loader validation. scripts/hash_saves.py helper.
A7 e2d6695 Per-task LLM timeout via asyncio.wait_for. Hung deliberation emits deliberation_timeout ERROR event and the tick proceeds. Default 60s (~8× the docker-benchmark 7s avg). All three deliberation paths covered (parallel, sequential, MapAnalyst).
A8 83586bf run_scenario.py now appends to results/benchmark_history.jsonl with run_type: "scenario" so single-scenario runs are distinguishable from full batteries. Skipped when no LLM was actually called (smoke tests).
A9 ed2720d Cost tracker surfaces pricing_source (openrouter_api / override / unknown) on every snapshot, plus the actual prices used. New --prompt-price-per-mtok / --completion-price-per-mtok CLI overrides for when OpenRouter's /models pricing diverges from billed cost.

Test results

pytest398 passed (+13 new tests covering deliberation log, raw-output event, save SHA validation, metadata, timeout, pricing source, overrides). ruff + mypy strict clean.

Why this matters for the benchmark

The original review found that pr17-live (the post-PR-#17 live run) had:

After this PR, a fresh single-scenario run produces:

  • <scenario>.csv (per-tick time series)
  • <scenario>_deliberations.jsonl (full agent rationale per tick)
  • <scenario>_summary.json (metadata + outcome + cost + event summary)
  • events.jsonl (every tick event, with actions/reasons/raw LLM output inline)
  • An entry in results/benchmark_history.jsonl tagged run_type: "scenario"

That's the artifact format Phase B (baseline calibration + first real long run) will produce.

What's still open in Phase A

  • A10: re-run the live smoke test on this branch (Crashlanded, 10 ticks, Nemotron 120B) to measure whether the action-error feedback loop from Surface RIMAPI action errors to agents + fix load-settle race #19 moves the mood (0.408) / research (0.226) bottleneck metrics. Cost ~$0.17. Cannot run from CI — needs live RimWorld + RIMAPI.
  • A11: upstream RIMAPI PR to tighten ResponceBuilder.DetermineStatusCode. Low urgency since the RLE-side ActionOutcome workaround from Surface RIMAPI action errors to agents + fix load-settle race #19 is in place.
  • A12: wire pawn-social endpoints (opinions / relations / interactions) into SocialOverseer, filtered to mood<0.4. Should be its own PR.

Test plan

  • pytest (398 pass)
  • ruff check src/ tests/ scripts/ (clean)
  • mypy src/ (clean, 42 source files)
  • python scripts/hash_saves.py --print (round-trips all 6 scenario saves)
  • python scripts/run_scenario.py --help shows --seed, --prompt-price-per-mtok, --completion-price-per-mtok
  • Live run (A10) — pending operator

🤖 Generated with Claude Code

jkbennitt and others added 8 commits May 16, 2026 02:43
Plan item A1. docker-benchmark (Apr 12) wrote *_deliberations.jsonl per
scenario via run_benchmark.py; pr17-live (run via run_scenario.py)
silently dropped the export, so post-PR-#17 runs lost the per-tick
agent rationale (actions, reasons, summary). Re-add it via a public
deliberation_log property on RLEGameLoop so both scripts use the same
contract.

- RLEGameLoop.deliberation_log: public property returning a shallow copy
  of the per-tick records (tick, agent, status, plus actions/summary/
  confidence for success and raw/reason for failures).
- run_scenario.py: writes <scenario>_deliberations.jsonl alongside the
  CSV when --output is given, off-thread via asyncio.to_thread.
- run_benchmark.py: switched to the public property (was reading the
  private _deliberation_log attribute).
- Integration test asserts record count and shallow-copy semantics.

Tests: 383 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A2. The DELIBERATION events emitted to events.jsonl previously
carried only latency_ms / confidence / num_actions; the rich payload
(actions array with type/target/priority/reason, plus the plan summary)
lived only in the in-memory _deliberation_log mirror. Now both writers
emit the same data so events.jsonl is self-contained for downstream
analysis without needing to cross-reference *_deliberations.jsonl.

Also surfaces the raw LLM content (first 500 chars) on parse_failure
ERROR events, matching what _deliberation_log already records.

Extracted three truncation constants (_ACTION_REASON_CHARS=200,
_PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500) so the two
writers share the same shape and the limits are documented in one place.

Tests: 384 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A3. _last_raw_output (the full LLM completion text) was stored
in-memory on each agent but never exported, so post-hoc analysis of a
weird deliberation required keeping the process alive. Now every
PROVIDER_CALL event carries the agent's raw completion text (head-
truncated to 4 KB) plus a raw_output_truncated bool so consumers know
whether the JSONL has the full response or just a prefix.

PROVIDER_CALL is the right semantic home — it's the provider's response
to the call we already log token counts for. The truncation length is
configurable via _RAW_OUTPUT_CHARS (matches the other event-log
truncation constants from A2). 4 KB covers typical action-plan JSON
plus a reasoning preamble.

Tests: 385 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A6 (commit 1 of 2). Surfaces the metadata a benchmark replay
kit needs: a versioned scoring identifier, the deployed RIMAPI DLL hash,
the RIMAPI fork commit, and an explicit seed for RLE-side stochasticity.
Scenario save_sha256 + loader validation lands in a follow-up.

metadata.py:
  * SCORING_VERSION constant ("1.0") — bumped when DEFAULT_WEIGHTS,
    metric implementations, or composite math change in a way that makes
    older runs not directly comparable. Surfaced in every run summary.
  * file_sha256(path): stdlib-only chunked SHA-256, returns None on
    missing/unreadable so the metadata dict stays serializable.
  * _rimapi_dll_path(): env override ($RIMAPI_DLL_PATH) → Workshop
    default (Steam path). Hashed via file_sha256.
  * _rimapi_fork_commit(): env override ($RIMAPI_FORK_PATH) → sibling
    ../RIMAPI checkout. Empty string when not reachable.
  * collect_metadata(random_seed=None): threads the seed through so the
    summary records what was actually set. Docstring notes that the
    seed only affects RLE-side randomness, not RimWorld's internal RNG.

CLI:
  * Both run_benchmark.py and run_scenario.py gain --seed INT. When
    set, random.seed() is called before any work and the value is
    threaded through to collect_metadata() at every summary write.

run_scenario.py:
  * Writes <scenario>_summary.json alongside the CSV/deliberations,
    including the metadata block, model/provider config, outcome,
    final composite, ticks_run, cost_snapshot, and event_summary.
    The new helper _build_run_summary keeps main() from getting longer.

Tests: 391 pass (+6 new metadata tests). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A6 (commit 2 of 2). Live game runs can silently drift from the
canonical docker/saves/ mirror (RimWorld saves into AppData on Windows;
nothing forces it to match the repo). When the agent + scoring code are
working off a different save than the one the leaderboard tracks, every
result is uncomparable. Pin and verify.

schema.py:
  * ScenarioConfig.save_sha256: str | None — optional pinned hash of
    the docker/saves/<save_name>.rws file.

loader.py:
  * ScenarioSaveMismatchError — distinct exception so callers can
    handle this specifically (e.g., a CLI flag --allow-unpinned).
  * canonical_save_path(save_name) — resolves to <repo>/docker/saves/.
  * load_scenario(path, allow_unpinned=False) — if save_sha256 is set
    AND the canonical save exists on disk AND the actual SHA doesn't
    match, raises. allow_unpinned=True is the intentional escape hatch.
    Short-circuits silently when the canonical file is missing (CI
    without docker volumes) — that's a separate failure mode caught
    at game-load time.

scripts/hash_saves.py:
  * New helper that hashes every canonical save and updates the matching
    scenario YAML's save_sha256 field. --print for dry-run.

definitions/*.yaml:
  * All 6 scenarios now carry pinned save_sha256 fields (the actual
    hashes of the current docker/saves/*.rws files). Re-run
    scripts/hash_saves.py after rebuilding saves to re-pin.

Tests: 394 pass (+3 new SHA validation tests). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A7. A single hung LLM call previously blocked the entire tick
because asyncio.gather has no per-task timeout. Now each deliberation
runs under a wait_for(role_timeout_s) wrapper; on timeout the orchestrator
emits a deliberation_timeout ERROR event, records the agent's empty
contribution in the deliberation log, and proceeds. The hung thread keeps
running in the background until the provider's own timeout cleans it up
(Python threads can't be force-killed) but the tick advances regardless.

Why it matters for the benchmark: cheap and premium models can be in the
same matrix only if tail latency is bounded. Without this, a stuck
30B/120B request stalls every other agent's runtime measurement.

config.py:
  * RLEConfig.role_timeout_s: float = 60.0 (~8x the 7s avg deliberation
    observed in docker-benchmark; tunable via .env or CLI override).

game_loop.py:
  * _deliberate_agent_with_timeout(): new async wrapper, runs
    _deliberate_agent in a worker thread under asyncio.wait_for.
    Emits EventType.ERROR with error_type=deliberation_timeout +
    timeout_s on TimeoutError; returns (agent, None).
  * _deliberate_parallel(): collapsed the inline _run closure into the
    new wrapper.
  * _deliberate_sequential(): is now async and awaits the wrapper.
    Sequential-mode dispatch in run_tick() now awaited.
  * MapAnalyst dispatch also goes through the timeout wrapper.

Test: TestMultiAgent.test_role_timeout_emits_event_and_other_agents_continue
injects a 5s-sleep provider with role_timeout_s=0.05; asserts 7
deliberation_timeout events fire, merged plan has 0 actions, tick
completes successfully.

Tests: 395 pass (+1). ruff + mypy strict clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A8. run_benchmark.py already calls append_history (line 711);
run_scenario.py never did. Combined with the fact that the user has been
doing live smoke testing via run_scenario.py (not run_benchmark.py),
results/benchmark_history.jsonl stayed at 1 byte even though dozens of
runs were persisted in per-run subdirectories.

Now run_scenario.py appends one history entry per --output run, tagged
run_type: "scenario" so consumers can distinguish single-scenario runs
from full benchmark batteries (which would tag run_type: "benchmark"
when run_benchmark.py is updated; that's a separate trivial follow-up).

Skips the history append when cost_tracker.num_calls == 0 (smoke tests
where the LLM was never reached, e.g. today's broken nano-4b run that
generated parse failures across the board because LM Studio was down).
Those rows would skew the leaderboard's avg-score-per-model otherwise.

Tests: 395 pass. ruff + mypy src/ clean. (Script-level mypy errors at
run_scenario.py:171, :175 are pre-existing in untouched code.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A9. The live smoke test reported \$0.044 internal vs \$0.168
actual OpenRouter charge. Internal estimates depend on the public
/models pricing, which can diverge from billed cost due to provider
routing surcharges, BYOK markups, or stale prices. The tracker would
silently fall back to \$0/token if the model wasn't found or the API
was unreachable. Now those failure modes are visible, and operators
can plug in the actual price they're seeing on their bill.

cost_tracker.py:
  * CostSnapshot gains prompt_price_per_token, completion_price_per_token,
    and pricing_source (one of: openrouter_api / override / unknown).
    "unknown" flags that estimated_cost_usd is \$0 because pricing
    resolution failed — consumers should not aggregate it.
  * create_cost_tracker() accepts prompt_price_override /
    completion_price_override (per-token USD); when both are passed,
    the live fetch is skipped entirely (source="override").
  * Logs the resolved prices at INFO ("prompt=\$X.XX/MTok completion=
    \$Y.YY/MTok source=...") and a WARNING when the fetch resolved
    to \$0 for a model the user explicitly named.

CLI:
  * --prompt-price-per-mtok / --completion-price-per-mtok on both
    run_scenario.py and run_benchmark.py. Per-MTok is the standard
    human-readable unit on OpenRouter's pricing page; the scripts
    convert to per-token before handing to create_cost_tracker.

Tests: 398 pass (+3 new pricing-source / override tests; +1
nemotron-120B reconciliation test that pins the exact math from
the docker-benchmark cost recording). ruff + mypy strict clean.

Reconciliation against an actual billed amount is still a follow-up:
needs an authenticated query to OpenRouter's /credits or per-generation
endpoints. For now the override is the unblock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant