Skip to content

feat(eval): foundations P0a — additive types + LatencySummary wiring#178

Open
7xuanlu wants to merge 13 commits into
mainfrom
worktree-feature+eval-foundations-p0a
Open

feat(eval): foundations P0a — additive types + LatencySummary wiring#178
7xuanlu wants to merge 13 commits into
mainfrom
worktree-feature+eval-foundations-p0a

Conversation

@7xuanlu
Copy link
Copy Markdown
Owner

@7xuanlu 7xuanlu commented May 24, 2026

Summary

Phase 0a of the eval-foundations refactor. Pure additive type extension; zero behavior change for current callers. Foundation for P0b (paths + port discovery), P0c (cost caps + save guards), P1 (L1 baselines), P2 (L2 live-daemon), P3 (L3 MCP contract tests).

What changed

  • EvalLayer enum (L1Db / L2Http / L3Mcp / L4Skill) in new eval/layer.rs. L4Skill reserved.
  • ReportEnv extended with 22 additive fields: layer, task, variant, embed_dim, similarity_fn_name, judge_model_id, mcp_schema_hash, skill_prompt_hash, schema_version, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations, eval_max_usd_baseline_cap, eval_max_usd_run_cap, eval_max_wall_secs_cap, total_cost_usd, total_wall_secs. All Option<T> or #[serde(default)] — legacy JSON still deserializes.
  • LatencySummary wired in eval/runner.rs (was hardcoded None); collected via std::time::Instant per query, summarized post-loop. from_micros() helper added to eval/latency.rs.
  • fixture_set_hash() in eval/fixtures.rs for directory-wide TOML hashing (sort + per-file inner sha256 + outer hash, sha256[..16]).
  • build_locomo_env() + build_lme_env() helpers populate the new env fields in all locomo + lme runners (including _with_gate for both).
  • Cargo feature eval-harness gates tests/eval_harness.rs compilation so unrelated PRs don't break on eval refactors.
  • pub const SCHEMA_VERSION: u32 = 51 exposed in db.rs (eval cache invalidation key for P1).
  • fs2 workspace dep added (used by P0c/P1 for scenario.db file locks).
  • Hand-rolled Default for ReportEnv mirrors serde default = "fn" attributes (#[derive(Default)] doesn't honor those — caught by adversarial review).

Honest deviations from plan (all approved in review)

  1. Plan listed origin_version: Option<String> as new — already exists as String. Omitted duplicate.
  2. Plan listed llm_provider_kind / llm_model_id as MTEB-aligned new field names — these don't exist on actual ReportEnv. Used legacy names (llm_provider_class, llm_model). Note for P0b: comparable_env_hash must hash the legacy names.
  3. Roundtrip back-compat test had to include all 9 pre-existing required fields in legacy JSON (the plan's minimal { "fixture_revision": ... } would've failed required-field deserialization).
  4. Both _with_gate runners converted to use build_*_env() helpers — adversarial review caught locomo _with_gate initially using ..Default::default() which silently produced inconsistent env state. Fixed in bed96dd9.

Test plan

  • cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings → clean
  • cargo test --workspace --features origin-core/eval-harness → 1152 passed, 1 pre-existing failure (eval::retrieval::tests::test_multi_turn_eval — FastEmbed model network download, baseline carried, unrelated to P0a)
  • cargo test --workspace (default features, no eval-harness) → eval_harness.rs gated out, no regression
  • Existing baseline JSON files still deserialize via #[serde(default)] on new fields
  • ReportEnv::default() returns same values as serde_json::from_str of a legacy-shape JSON — drift-catcher test added in bed96dd9

Follow-ups for P0b+

  • fs2 dep added but unused at P0a; P0c/P1 use it for scenario.lock.
  • migrations_hash field reads option_env!(\"ORIGIN_MIGRATIONS_HASH\")None until P1's build.rs sets it.
  • LatencySummary truncates sub-ms to 0ms (ms-resolution storage). Acceptable for current callers; revisit if sub-ms percentile needed.

Adversarial review

Full review of integrated impl ran before merge per CLAUDE.md "Code review before merge" rule. Two IMPORTANT findings fixed in bed96dd9:

  1. locomo _with_gate runner converted to build_locomo_env
  2. #[derive(Default)] replaced with hand-rolled impl Default that mirrors serde defaults

🤖 Generated with Claude Code

7xuanlu and others added 13 commits May 24, 2026 15:02
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer nit: pub use layer::EvalLayer; was visually attached to the
backward-compat alias's doc comment. Add blank line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `pub fn fixture_set_hash(dir: &Path) -> anyhow::Result<String>` that
hashes a directory of TOML files into a stable 16-char hex digest. Files
are sorted by path before hashing so the result is independent of filesystem
ordering. Three TDD tests cover: stability across invocations, sensitivity
to content changes, and ordering independence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, caps

Add 22 additive fields to ReportEnv: layer (EvalLayer), task/variant
identifiers, embed_dim, similarity_fn_name, judge_model_id,
mcp_schema_hash, skill_prompt_hash, schema_version, schema_db_version,
migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha,
warmup_iterations, eval_max_usd_baseline_cap, eval_max_usd_run_cap,
eval_max_wall_secs_cap, total_cost_usd, total_wall_secs.

All new fields carry #[serde(default)] so existing baseline JSON files
deserialize without change. Fix 9 struct literal call sites in locomo.rs
and longmemeval.rs to use ..Default::default() spread.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…one)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add build_locomo_env() helper that fills both the legacy 9 ReportEnv
fields and all 22 new P0a additive fields (layer, task, variant,
embed_dim, similarity_fn_name, judge_model_id, schema_db_version,
migrations_hash, n_runs, is_single_run, run_id, timestamp_utc,
git_sha, warmup_iterations). Replace the three inline ReportEnv
literal blocks in run_locomo_eval / run_locomo_eval_reranked /
run_locomo_eval_expanded with calls to the helper.

Also add pub const SCHEMA_VERSION: u32 = 51 to db.rs (highest
user_version applied by migrate()); used as schema_db_version in env.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add build_lme_env() helper mirroring build_locomo_env() with task="lme".
Replace 4 inline ReportEnv literal blocks in run_longmemeval_eval,
run_longmemeval_eval_reranked, run_longmemeval_eval_expanded, and
run_longmemeval_eval_with_gate with calls to the helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_harness.rs:3440 had an inline ReportEnv literal that fell out of date
when Task 5 extended the struct (the file is feature-gated so cargo check
without --features didn't catch it). Add ..ReportEnv::default() spread.

eval_report_roundtrip.rs: replace assert_eq\!(bool, literal) with assert\!()
per clippy bool_assert_comparison lint that fires under -D warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…runner

Replace #[derive(Default)] on ReportEnv with a hand-rolled impl that mirrors
the serde default= attributes (similarity_fn_name="cosine", schema_version=1,
n_runs=1). The derived impl produced 0/"" for those fields, inconsistent with
deserialized-from-legacy-JSON state.

Convert run_locomo_eval_with_gate to call build_locomo_env("gated", ...)
instead of an inline ReportEnv literal with ..Default::default(), matching
the equivalent pattern in longmemeval's _with_gate runner.

Add report_env_default_matches_serde_defaults test to eval_report_roundtrip.rs
to catch any future drift.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant