feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup by 7xuanlu · Pull Request #191 · 7xuanlu/origin

7xuanlu · 2026-05-25T09:36:35Z

Summary

Phase 0c of the eval-foundations refactor. Additive types + new save path + loud error replacement.

Builds on P0a (#178, merged 46a9703) + P0b (#190, merged 032ce63).

What changed

Cost caps

parse_eval_max_usd(value: Option<&str>) -> anyhow::Result<Option<f64>> in eval/anthropic.rs. Replaces silent unwrap_or(0.0) with explicit bails: parse-error / non-finite / <= 0 / > $10 (unless EVAL_I_REALLY_MEAN_IT=1). Both submit_batch + submit_batch_with_tool call sites converted.
RunCostTracker in new eval/cost.rs — AtomicU64 millicents accumulator with soft-fence cap. record_usd refunds the increment on cap-overage so total_usd() stays honest after a failed call.
reconcile_cost_usd(input_tokens, output_tokens) in eval/anthropic.rs — Haiku batch rates ($0.25/$1.25 per MTok). P1 will wire this into batch judge path.

Wall-clock

WallClockWatchdog in new eval/wall_clock.rs. start(cap) / start_with_check_interval(cap, check) / disabled() / from_env(). Reads EVAL_MAX_WALL_SECS (default 14400 = 4h). Spawns tokio task that flips is_exceeded() atomic when cap elapses. Uses log crate (origin-core convention; tracing not in deps).

Save guards

save_full_report(&Path, &EvalReport) -> anyhow::Result<PathBuf> in eval/report.rs. Strict guards:
- env: Some(...) required (panic-with-message otherwise)
- All metric f64 fields finite (15 fields covered: 12 mandatory + 3 Option<f64>). Walks struct directly because serde_json silently maps NaN to null — a JSON-walk would miss the very value we're rejecting.
- Skip rate ≤ 5% when total_scenarios > 0
- enrichment_failures == 0 unless EVAL_ACCEPT_PARTIAL=1
Atomic write: <final>.tmp.<pid>.<nanos> in SAME directory as final, then rename. Same-filesystem guaranteed.
save_partial_report writes to partial/<runid>__<layer>__<task>__<variant>.json — never baselines/. Stamps truncated_reason on the report copy.
EvalReport extended with total_scenarios, skipped_scenarios: Vec<String>, enrichment_failures: usize, truncated_reason: Option<String> — all #[serde(default)].

Cleanup

eval/shared.rs: 9 let _ = ... swallow sites converted to if let Err(e) = ... { log::warn!(...) } with context (memory_id / entity_id / source_id + error). 2 sites simplified let _ = expr.await? → expr.await? (was just discarding success type). 6 best-effort filesystem I/O sites kept with explicit // best-effort: ... comments.
Behavioral fix bundled in cleanup: chunk_linked += 1 now increments only on successful update_memory_entity_id (prior overcounted on silent failures).

Adversarial review (3 NITs, all non-blocking)

RunCostTracker::record_usd saturating cast on pathological huge USD inputs — defensible per soft-fence semantics; parse_eval_max_usd caps inputs to $10 unless override.
WallClockWatchdog tokio task outlives struct drop — documented as acceptable per spec.
save_full_report tmp filename pid + nanos could collide on same-nanosecond concurrent calls from same process — practically impossible on macOS clock resolution.

Test plan

cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings → clean
cargo test -p origin-core --lib --features eval-harness → 1160 lib tests pass
cargo test -p origin-core --test eval_cost_caps --features eval-harness → 15 tests pass (6 parse + 5 tracker + 1 reconcile + 3 watchdog)
cargo test -p origin-core --test eval_save_guards --features eval-harness → 7 tests pass (4 original + 3 Option regression)
Pre-existing failures unchanged: eval::retrieval::tests::test_multi_turn_eval (FastEmbed network) + cmd_backfill::tests::check_service_unloaded_returns_ok_when_no_service_installed (env-specific to dev machines with daemon installed)

Follow-ups for P1+

reconcile_cost_usd + RunCostTracker get wired through answer_quality.rs batch judge in P1 Task 4
save_full_report becomes the canonical save path for L1 baselines in P1 Task 4
WallClockWatchdog::from_env() invoked at L1 / L2 runner start in P1 / P2
EVAL_MAX_USD_RUN (cumulative) cap — declared via RunCostTracker::new(parse_eval_max_usd(env::var("EVAL_MAX_USD_RUN").ok().as_deref())?) in P1 orchestration

🤖 Generated with Claude Code

Add parse_eval_max_usd() with explicit failure modes: garbage input, non-finite, <= 0, and > $10 without EVAL_I_REALLY_MEAN_IT=1. Replace both unwrap_or(0.0) sites in submit_batch and submit_batch_with_tool. 6 new tests in eval_cost_caps.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirrors the pattern from PR #160 (eval_harness.rs:3770). Without a lock, parallel test execution races on the shared process env. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… cap On cap-exceeded, fetch_sub refunds the increment so total_usd() reflects only successful spend. Negative/non-finite cap_usd saturated to 0 via .max(0.0) cast with debug_assert for visibility. Two regression tests added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add reconcile_cost_usd(input_tokens, output_tokens) -> f64 to eval::anthropic using Claude 3.5 Haiku batch-discounted pricing ($0.25/MTok input, $1.25/MTok output). Companion test in eval_cost_caps verifies all four cases (input-only, output-only, mixed, zero). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds save_full_report() in eval/report.rs writing full EvalReport to layered path via encode_baseline_path. Guards: env required, finite metrics, skip ≤5%, enrichment_failures==0 unless EVAL_ACCEPT_PARTIAL=1. Atomic same-dir tmp+rename. Adds save_partial_report() to partial/ dir with truncated_reason stamp. EvalReport extended with total_scenarios, skipped_scenarios, enrichment_failures, truncated_reason — all additive #[serde(default)]. NaN detection via first_non_finite_field() (serde_json maps NaN to null). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

9 sites converted to if-let-Err with log::warn! and full context (entity/memory IDs, error). 2 let _ = expr.await? reduced to plain expr.await? (error already propagated via ?; let _ = was discarding only the success usize). 6 filesystem I/O sites kept as let _ = with explicit // best-effort: comments (create_dir_all, writeln, flush on cache files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

7xuanlu and others added 9 commits May 25, 2026 01:55

test(eval): serialize EVAL_I_REALLY_MEAN_IT env touches via static Mutex

a7d2d8e

Mirrors the pattern from PR #160 (eval_harness.rs:3770). Without a lock, parallel test execution races on the shared process env. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval): RunCostTracker for cumulative EVAL_MAX_USD_RUN cap

b8a8a93

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval): WallClockWatchdog for EVAL_MAX_WALL_SECS cap

9048abd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(eval): cover 3 missing Option<f64> metric fields in non-finite guard

2fcf1d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#191

feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#191
7xuanlu wants to merge 9 commits into
mainfrom
worktree-feature+eval-foundations-p0c

7xuanlu commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

7xuanlu commented May 25, 2026

Summary

What changed

Adversarial review (3 NITs, all non-blocking)

Test plan

Follow-ups for P1+

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant