Phase A observability floor: deliberation logs, replay metadata, timeouts, cost source#20
Open
jkbennitt wants to merge 8 commits into
Open
Conversation
Plan item A1. docker-benchmark (Apr 12) wrote *_deliberations.jsonl per scenario via run_benchmark.py; pr17-live (run via run_scenario.py) silently dropped the export, so post-PR-#17 runs lost the per-tick agent rationale (actions, reasons, summary). Re-add it via a public deliberation_log property on RLEGameLoop so both scripts use the same contract. - RLEGameLoop.deliberation_log: public property returning a shallow copy of the per-tick records (tick, agent, status, plus actions/summary/ confidence for success and raw/reason for failures). - run_scenario.py: writes <scenario>_deliberations.jsonl alongside the CSV when --output is given, off-thread via asyncio.to_thread. - run_benchmark.py: switched to the public property (was reading the private _deliberation_log attribute). - Integration test asserts record count and shallow-copy semantics. Tests: 383 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A2. The DELIBERATION events emitted to events.jsonl previously carried only latency_ms / confidence / num_actions; the rich payload (actions array with type/target/priority/reason, plus the plan summary) lived only in the in-memory _deliberation_log mirror. Now both writers emit the same data so events.jsonl is self-contained for downstream analysis without needing to cross-reference *_deliberations.jsonl. Also surfaces the raw LLM content (first 500 chars) on parse_failure ERROR events, matching what _deliberation_log already records. Extracted three truncation constants (_ACTION_REASON_CHARS=200, _PLAN_SUMMARY_CHARS=300, _PARSE_FAILURE_RAW_CHARS=500) so the two writers share the same shape and the limits are documented in one place. Tests: 384 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A3. _last_raw_output (the full LLM completion text) was stored in-memory on each agent but never exported, so post-hoc analysis of a weird deliberation required keeping the process alive. Now every PROVIDER_CALL event carries the agent's raw completion text (head- truncated to 4 KB) plus a raw_output_truncated bool so consumers know whether the JSONL has the full response or just a prefix. PROVIDER_CALL is the right semantic home — it's the provider's response to the call we already log token counts for. The truncation length is configurable via _RAW_OUTPUT_CHARS (matches the other event-log truncation constants from A2). 4 KB covers typical action-plan JSON plus a reasoning preamble. Tests: 385 pass (+1). ruff + mypy strict clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A6 (commit 1 of 2). Surfaces the metadata a benchmark replay
kit needs: a versioned scoring identifier, the deployed RIMAPI DLL hash,
the RIMAPI fork commit, and an explicit seed for RLE-side stochasticity.
Scenario save_sha256 + loader validation lands in a follow-up.
metadata.py:
* SCORING_VERSION constant ("1.0") — bumped when DEFAULT_WEIGHTS,
metric implementations, or composite math change in a way that makes
older runs not directly comparable. Surfaced in every run summary.
* file_sha256(path): stdlib-only chunked SHA-256, returns None on
missing/unreadable so the metadata dict stays serializable.
* _rimapi_dll_path(): env override ($RIMAPI_DLL_PATH) → Workshop
default (Steam path). Hashed via file_sha256.
* _rimapi_fork_commit(): env override ($RIMAPI_FORK_PATH) → sibling
../RIMAPI checkout. Empty string when not reachable.
* collect_metadata(random_seed=None): threads the seed through so the
summary records what was actually set. Docstring notes that the
seed only affects RLE-side randomness, not RimWorld's internal RNG.
CLI:
* Both run_benchmark.py and run_scenario.py gain --seed INT. When
set, random.seed() is called before any work and the value is
threaded through to collect_metadata() at every summary write.
run_scenario.py:
* Writes <scenario>_summary.json alongside the CSV/deliberations,
including the metadata block, model/provider config, outcome,
final composite, ticks_run, cost_snapshot, and event_summary.
The new helper _build_run_summary keeps main() from getting longer.
Tests: 391 pass (+6 new metadata tests). ruff + mypy strict clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A6 (commit 2 of 2). Live game runs can silently drift from the
canonical docker/saves/ mirror (RimWorld saves into AppData on Windows;
nothing forces it to match the repo). When the agent + scoring code are
working off a different save than the one the leaderboard tracks, every
result is uncomparable. Pin and verify.
schema.py:
* ScenarioConfig.save_sha256: str | None — optional pinned hash of
the docker/saves/<save_name>.rws file.
loader.py:
* ScenarioSaveMismatchError — distinct exception so callers can
handle this specifically (e.g., a CLI flag --allow-unpinned).
* canonical_save_path(save_name) — resolves to <repo>/docker/saves/.
* load_scenario(path, allow_unpinned=False) — if save_sha256 is set
AND the canonical save exists on disk AND the actual SHA doesn't
match, raises. allow_unpinned=True is the intentional escape hatch.
Short-circuits silently when the canonical file is missing (CI
without docker volumes) — that's a separate failure mode caught
at game-load time.
scripts/hash_saves.py:
* New helper that hashes every canonical save and updates the matching
scenario YAML's save_sha256 field. --print for dry-run.
definitions/*.yaml:
* All 6 scenarios now carry pinned save_sha256 fields (the actual
hashes of the current docker/saves/*.rws files). Re-run
scripts/hash_saves.py after rebuilding saves to re-pin.
Tests: 394 pass (+3 new SHA validation tests). ruff + mypy strict clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A7. A single hung LLM call previously blocked the entire tick
because asyncio.gather has no per-task timeout. Now each deliberation
runs under a wait_for(role_timeout_s) wrapper; on timeout the orchestrator
emits a deliberation_timeout ERROR event, records the agent's empty
contribution in the deliberation log, and proceeds. The hung thread keeps
running in the background until the provider's own timeout cleans it up
(Python threads can't be force-killed) but the tick advances regardless.
Why it matters for the benchmark: cheap and premium models can be in the
same matrix only if tail latency is bounded. Without this, a stuck
30B/120B request stalls every other agent's runtime measurement.
config.py:
* RLEConfig.role_timeout_s: float = 60.0 (~8x the 7s avg deliberation
observed in docker-benchmark; tunable via .env or CLI override).
game_loop.py:
* _deliberate_agent_with_timeout(): new async wrapper, runs
_deliberate_agent in a worker thread under asyncio.wait_for.
Emits EventType.ERROR with error_type=deliberation_timeout +
timeout_s on TimeoutError; returns (agent, None).
* _deliberate_parallel(): collapsed the inline _run closure into the
new wrapper.
* _deliberate_sequential(): is now async and awaits the wrapper.
Sequential-mode dispatch in run_tick() now awaited.
* MapAnalyst dispatch also goes through the timeout wrapper.
Test: TestMultiAgent.test_role_timeout_emits_event_and_other_agents_continue
injects a 5s-sleep provider with role_timeout_s=0.05; asserts 7
deliberation_timeout events fire, merged plan has 0 actions, tick
completes successfully.
Tests: 395 pass (+1). ruff + mypy strict clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A8. run_benchmark.py already calls append_history (line 711); run_scenario.py never did. Combined with the fact that the user has been doing live smoke testing via run_scenario.py (not run_benchmark.py), results/benchmark_history.jsonl stayed at 1 byte even though dozens of runs were persisted in per-run subdirectories. Now run_scenario.py appends one history entry per --output run, tagged run_type: "scenario" so consumers can distinguish single-scenario runs from full benchmark batteries (which would tag run_type: "benchmark" when run_benchmark.py is updated; that's a separate trivial follow-up). Skips the history append when cost_tracker.num_calls == 0 (smoke tests where the LLM was never reached, e.g. today's broken nano-4b run that generated parse failures across the board because LM Studio was down). Those rows would skew the leaderboard's avg-score-per-model otherwise. Tests: 395 pass. ruff + mypy src/ clean. (Script-level mypy errors at run_scenario.py:171, :175 are pre-existing in untouched code.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan item A9. The live smoke test reported \$0.044 internal vs \$0.168
actual OpenRouter charge. Internal estimates depend on the public
/models pricing, which can diverge from billed cost due to provider
routing surcharges, BYOK markups, or stale prices. The tracker would
silently fall back to \$0/token if the model wasn't found or the API
was unreachable. Now those failure modes are visible, and operators
can plug in the actual price they're seeing on their bill.
cost_tracker.py:
* CostSnapshot gains prompt_price_per_token, completion_price_per_token,
and pricing_source (one of: openrouter_api / override / unknown).
"unknown" flags that estimated_cost_usd is \$0 because pricing
resolution failed — consumers should not aggregate it.
* create_cost_tracker() accepts prompt_price_override /
completion_price_override (per-token USD); when both are passed,
the live fetch is skipped entirely (source="override").
* Logs the resolved prices at INFO ("prompt=\$X.XX/MTok completion=
\$Y.YY/MTok source=...") and a WARNING when the fetch resolved
to \$0 for a model the user explicitly named.
CLI:
* --prompt-price-per-mtok / --completion-price-per-mtok on both
run_scenario.py and run_benchmark.py. Per-MTok is the standard
human-readable unit on OpenRouter's pricing page; the scripts
convert to per-token before handing to create_cost_tracker.
Tests: 398 pass (+3 new pricing-source / override tests; +1
nemotron-120B reconciliation test that pins the exact math from
the docker-benchmark cost recording). ruff + mypy strict clean.
Reconciliation against an actual billed amount is still a follow-up:
needs an authenticated query to OpenRouter's /credits or per-generation
endpoints. For now the override is the unblock.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase A of the observability-first restructuring plan. Eight commits, +963 lines of source, +56 lines of cost-tracker docs/tests. Goal: the next live benchmark run produces a fully reviewable, replayable artifact.
Stacked on #19 (fix/post-live-test-findings). Merge after #19.
ee52eaf*_deliberations.jsonlexport — was in docker-benchmark (Apr 12), lost in pr17-live. NewRLEGameLoop.deliberation_logpublic property; bothrun_benchmark.pyandrun_scenario.pynow write the JSONL.c958dbaEventType.DELIBERATIONpayload — events.jsonl now carries the full actions array + summary + reason text (was only latency/confidence/num_actions). Truncation constants_ACTION_REASON_CHARS=200,_PLAN_SUMMARY_CHARS=300,_PARSE_FAILURE_RAW_CHARS=500consolidated at module top.9e5dae5_last_raw_outputexported viaPROVIDER_CALLevents (4 KB head-truncated). Means a weird deliberation is grep-able without keeping the process alive.45cb9fb+da5e745SCORING_VERSION="1.0",file_sha256(), RIMAPI DLL hash via Workshop path lookup, RIMAPI fork commit lookup,--seedCLI flag. Plussave_sha256on every scenario YAML (all 6 now pinned) + loader validation.scripts/hash_saves.pyhelper.e2d6695asyncio.wait_for. Hung deliberation emitsdeliberation_timeoutERROR event and the tick proceeds. Default 60s (~8× the docker-benchmark 7s avg). All three deliberation paths covered (parallel, sequential, MapAnalyst).83586bfrun_scenario.pynow appends toresults/benchmark_history.jsonlwithrun_type: "scenario"so single-scenario runs are distinguishable from full batteries. Skipped when no LLM was actually called (smoke tests).ed2720dpricing_source(openrouter_api/override/unknown) on every snapshot, plus the actual prices used. New--prompt-price-per-mtok/--completion-price-per-mtokCLI overrides for when OpenRouter's /models pricing diverges from billed cost.Test results
pytest— 398 passed (+13 new tests covering deliberation log, raw-output event, save SHA validation, metadata, timeout, pricing source, overrides). ruff + mypy strict clean.Why this matters for the benchmark
The original review found that pr17-live (the post-PR-#17 live run) had:
benchmark_history.jsonldespite dozens of runs persisted to subdirsAfter this PR, a fresh single-scenario run produces:
<scenario>.csv(per-tick time series)<scenario>_deliberations.jsonl(full agent rationale per tick)<scenario>_summary.json(metadata + outcome + cost + event summary)events.jsonl(every tick event, with actions/reasons/raw LLM output inline)results/benchmark_history.jsonltaggedrun_type: "scenario"That's the artifact format Phase B (baseline calibration + first real long run) will produce.
What's still open in Phase A
ResponceBuilder.DetermineStatusCode. Low urgency since the RLE-sideActionOutcomeworkaround from Surface RIMAPI action errors to agents + fix load-settle race #19 is in place.Test plan
pytest(398 pass)ruff check src/ tests/ scripts/(clean)mypy src/(clean, 42 source files)python scripts/hash_saves.py --print(round-trips all 6 scenario saves)python scripts/run_scenario.py --helpshows--seed,--prompt-price-per-mtok,--completion-price-per-mtok🤖 Generated with Claude Code