feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#191
Open
7xuanlu wants to merge 9 commits into
Open
feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#1917xuanlu wants to merge 9 commits into
7xuanlu wants to merge 9 commits into
Conversation
Add parse_eval_max_usd() with explicit failure modes: garbage input, non-finite, <= 0, and > $10 without EVAL_I_REALLY_MEAN_IT=1. Replace both unwrap_or(0.0) sites in submit_batch and submit_batch_with_tool. 6 new tests in eval_cost_caps.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the pattern from PR #160 (eval_harness.rs:3770). Without a lock, parallel test execution races on the shared process env. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… cap On cap-exceeded, fetch_sub refunds the increment so total_usd() reflects only successful spend. Negative/non-finite cap_usd saturated to 0 via .max(0.0) cast with debug_assert for visibility. Two regression tests added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add reconcile_cost_usd(input_tokens, output_tokens) -> f64 to eval::anthropic using Claude 3.5 Haiku batch-discounted pricing ($0.25/MTok input, $1.25/MTok output). Companion test in eval_cost_caps verifies all four cases (input-only, output-only, mixed, zero). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds save_full_report() in eval/report.rs writing full EvalReport to layered path via encode_baseline_path. Guards: env required, finite metrics, skip ≤5%, enrichment_failures==0 unless EVAL_ACCEPT_PARTIAL=1. Atomic same-dir tmp+rename. Adds save_partial_report() to partial/ dir with truncated_reason stamp. EvalReport extended with total_scenarios, skipped_scenarios, enrichment_failures, truncated_reason — all additive #[serde(default)]. NaN detection via first_non_finite_field() (serde_json maps NaN to null). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9 sites converted to if-let-Err with log::warn! and full context (entity/memory IDs, error). 2 let _ = expr.await? reduced to plain expr.await? (error already propagated via ?; let _ = was discarding only the success usize). 6 filesystem I/O sites kept as let _ = with explicit // best-effort: comments (create_dir_all, writeln, flush on cache files). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 0c of the eval-foundations refactor. Additive types + new save path + loud error replacement.
Builds on P0a (#178, merged 46a9703) + P0b (#190, merged 032ce63).
What changed
Cost caps
parse_eval_max_usd(value: Option<&str>) -> anyhow::Result<Option<f64>>ineval/anthropic.rs. Replaces silentunwrap_or(0.0)with explicit bails: parse-error / non-finite /<= 0/> $10(unlessEVAL_I_REALLY_MEAN_IT=1). Bothsubmit_batch+submit_batch_with_toolcall sites converted.RunCostTrackerin neweval/cost.rs—AtomicU64millicents accumulator with soft-fence cap.record_usdrefunds the increment on cap-overage sototal_usd()stays honest after a failed call.reconcile_cost_usd(input_tokens, output_tokens)ineval/anthropic.rs— Haiku batch rates ($0.25/$1.25 per MTok). P1 will wire this into batch judge path.Wall-clock
WallClockWatchdogin neweval/wall_clock.rs.start(cap) / start_with_check_interval(cap, check) / disabled() / from_env(). ReadsEVAL_MAX_WALL_SECS(default 14400 = 4h). Spawns tokio task that flipsis_exceeded()atomic when cap elapses. Useslogcrate (origin-core convention;tracingnot in deps).Save guards
save_full_report(&Path, &EvalReport) -> anyhow::Result<PathBuf>ineval/report.rs. Strict guards:env: Some(...)required (panic-with-message otherwise)Option<f64>). Walks struct directly becauseserde_jsonsilently maps NaN to null — a JSON-walk would miss the very value we're rejecting.total_scenarios > 0enrichment_failures == 0unlessEVAL_ACCEPT_PARTIAL=1<final>.tmp.<pid>.<nanos>in SAME directory as final, thenrename. Same-filesystem guaranteed.save_partial_reportwrites topartial/<runid>__<layer>__<task>__<variant>.json— never baselines/. Stampstruncated_reasonon the report copy.EvalReportextended withtotal_scenarios,skipped_scenarios: Vec<String>,enrichment_failures: usize,truncated_reason: Option<String>— all#[serde(default)].Cleanup
eval/shared.rs: 9let _ = ...swallow sites converted toif let Err(e) = ... { log::warn!(...) }with context (memory_id / entity_id / source_id + error). 2 sites simplifiedlet _ = expr.await?→expr.await?(was just discarding success type). 6 best-effort filesystem I/O sites kept with explicit// best-effort: ...comments.chunk_linked += 1now increments only on successfulupdate_memory_entity_id(prior overcounted on silent failures).Adversarial review (3 NITs, all non-blocking)
RunCostTracker::record_usdsaturating cast on pathological huge USD inputs — defensible per soft-fence semantics;parse_eval_max_usdcaps inputs to $10 unless override.WallClockWatchdogtokio task outlives struct drop — documented as acceptable per spec.save_full_reporttmp filenamepid + nanoscould collide on same-nanosecond concurrent calls from same process — practically impossible on macOS clock resolution.Test plan
cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings→ cleancargo test -p origin-core --lib --features eval-harness→ 1160 lib tests passcargo test -p origin-core --test eval_cost_caps --features eval-harness→ 15 tests pass (6 parse + 5 tracker + 1 reconcile + 3 watchdog)cargo test -p origin-core --test eval_save_guards --features eval-harness→ 7 tests pass (4 original + 3 Option regression)eval::retrieval::tests::test_multi_turn_eval(FastEmbed network) +cmd_backfill::tests::check_service_unloaded_returns_ok_when_no_service_installed(env-specific to dev machines with daemon installed)Follow-ups for P1+
reconcile_cost_usd+RunCostTrackerget wired throughanswer_quality.rsbatch judge in P1 Task 4save_full_reportbecomes the canonical save path for L1 baselines in P1 Task 4WallClockWatchdog::from_env()invoked at L1 / L2 runner start in P1 / P2EVAL_MAX_USD_RUN(cumulative) cap — declared viaRunCostTracker::new(parse_eval_max_usd(env::var("EVAL_MAX_USD_RUN").ok().as_deref())?)in P1 orchestration🤖 Generated with Claude Code