feat(eval): foundations P0a — additive types + LatencySummary wiring by 7xuanlu · Pull Request #178 · 7xuanlu/origin

7xuanlu · 2026-05-24T23:03:11Z

Summary

Phase 0a of the eval-foundations refactor. Pure additive type extension; zero behavior change for current callers. Foundation for P0b (paths + port discovery), P0c (cost caps + save guards), P1 (L1 baselines), P2 (L2 live-daemon), P3 (L3 MCP contract tests).

What changed

EvalLayer enum (L1Db / L2Http / L3Mcp / L4Skill) in new eval/layer.rs. L4Skill reserved.
ReportEnv extended with 22 additive fields: layer, task, variant, embed_dim, similarity_fn_name, judge_model_id, mcp_schema_hash, skill_prompt_hash, schema_version, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations, eval_max_usd_baseline_cap, eval_max_usd_run_cap, eval_max_wall_secs_cap, total_cost_usd, total_wall_secs. All Option<T> or #[serde(default)] — legacy JSON still deserializes.
LatencySummary wired in eval/runner.rs (was hardcoded None); collected via std::time::Instant per query, summarized post-loop. from_micros() helper added to eval/latency.rs.
fixture_set_hash() in eval/fixtures.rs for directory-wide TOML hashing (sort + per-file inner sha256 + outer hash, sha256[..16]).
build_locomo_env() + build_lme_env() helpers populate the new env fields in all locomo + lme runners (including _with_gate for both).
Cargo feature eval-harness gates tests/eval_harness.rs compilation so unrelated PRs don't break on eval refactors.
pub const SCHEMA_VERSION: u32 = 51 exposed in db.rs (eval cache invalidation key for P1).
fs2 workspace dep added (used by P0c/P1 for scenario.db file locks).
Hand-rolled Default for ReportEnv mirrors serde default = "fn" attributes (#[derive(Default)] doesn't honor those — caught by adversarial review).

Honest deviations from plan (all approved in review)

Plan listed origin_version: Option<String> as new — already exists as String. Omitted duplicate.
Plan listed llm_provider_kind / llm_model_id as MTEB-aligned new field names — these don't exist on actual ReportEnv. Used legacy names (llm_provider_class, llm_model). Note for P0b: comparable_env_hash must hash the legacy names.
Roundtrip back-compat test had to include all 9 pre-existing required fields in legacy JSON (the plan's minimal { "fixture_revision": ... } would've failed required-field deserialization).
Both _with_gate runners converted to use build_*_env() helpers — adversarial review caught locomo _with_gate initially using ..Default::default() which silently produced inconsistent env state. Fixed in bed96dd9.

Test plan

cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings → clean
cargo test --workspace --features origin-core/eval-harness → 1152 passed, 1 pre-existing failure (eval::retrieval::tests::test_multi_turn_eval — FastEmbed model network download, baseline carried, unrelated to P0a)
cargo test --workspace (default features, no eval-harness) → eval_harness.rs gated out, no regression
Existing baseline JSON files still deserialize via #[serde(default)] on new fields
ReportEnv::default() returns same values as serde_json::from_str of a legacy-shape JSON — drift-catcher test added in bed96dd9

Follow-ups for P0b+

fs2 dep added but unused at P0a; P0c/P1 use it for scenario.lock.
migrations_hash field reads option_env!(\"ORIGIN_MIGRATIONS_HASH\") — None until P1's build.rs sets it.
LatencySummary truncates sub-ms to 0ms (ms-resolution storage). Acceptable for current callers; revisit if sub-ms percentile needed.

Adversarial review

Full review of integrated impl ran before merge per CLAUDE.md "Code review before merge" rule. Two IMPORTANT findings fixed in bed96dd9:

locomo _with_gate runner converted to build_locomo_env
#[derive(Default)] replaced with hand-rolled impl Default that mirrors serde defaults

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reviewer nit: pub use layer::EvalLayer; was visually attached to the backward-compat alias's doc comment. Add blank line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `pub fn fixture_set_hash(dir: &Path) -> anyhow::Result<String>` that hashes a directory of TOML files into a stable 16-char hex digest. Files are sorted by path before hashing so the result is independent of filesystem ordering. Three TDD tests cover: stability across invocations, sensitivity to content changes, and ordering independence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, caps Add 22 additive fields to ReportEnv: layer (EvalLayer), task/variant identifiers, embed_dim, similarity_fn_name, judge_model_id, mcp_schema_hash, skill_prompt_hash, schema_version, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations, eval_max_usd_baseline_cap, eval_max_usd_run_cap, eval_max_wall_secs_cap, total_cost_usd, total_wall_secs. All new fields carry #[serde(default)] so existing baseline JSON files deserialize without change. Fix 9 struct literal call sites in locomo.rs and longmemeval.rs to use ..Default::default() spread. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…one) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add build_locomo_env() helper that fills both the legacy 9 ReportEnv fields and all 22 new P0a additive fields (layer, task, variant, embed_dim, similarity_fn_name, judge_model_id, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations). Replace the three inline ReportEnv literal blocks in run_locomo_eval / run_locomo_eval_reranked / run_locomo_eval_expanded with calls to the helper. Also add pub const SCHEMA_VERSION: u32 = 51 to db.rs (highest user_version applied by migrate()); used as schema_db_version in env. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add build_lme_env() helper mirroring build_locomo_env() with task="lme". Replace 4 inline ReportEnv literal blocks in run_longmemeval_eval, run_longmemeval_eval_reranked, run_longmemeval_eval_expanded, and run_longmemeval_eval_with_gate with calls to the helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval_harness.rs:3440 had an inline ReportEnv literal that fell out of date when Task 5 extended the struct (the file is feature-gated so cargo check without --features didn't catch it). Add ..ReportEnv::default() spread. eval_report_roundtrip.rs: replace assert_eq\!(bool, literal) with assert\!() per clippy bool_assert_comparison lint that fires under -D warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…runner Replace #[derive(Default)] on ReportEnv with a hand-rolled impl that mirrors the serde default= attributes (similarity_fn_name="cosine", schema_version=1, n_runs=1). The derived impl produced 0/"" for those fields, inconsistent with deserialized-from-legacy-JSON state. Convert run_locomo_eval_with_gate to call build_locomo_env("gated", ...) instead of an inline ReportEnv literal with ..Default::default(), matching the equivalent pattern in longmemeval's _with_gate runner. Add report_env_default_matches_serde_defaults test to eval_report_roundtrip.rs to catch any future drift. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

7xuanlu and others added 13 commits May 24, 2026 15:02

feat(eval): gate eval-harness tests behind cargo feature

549c03f

feat(eval): add fs2 dependency for scenario.db lock files

309666f

feat(eval): add EvalLayer enum (L1Db/L2Http/L3Mcp/L4Skill)

e2a22d5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style(eval): blank line before backward-compat alias doc comment

770b36c

Reviewer nit: pub use layer::EvalLayer; was visually attached to the backward-compat alias's doc comment. Add blank line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(eval): roundtrip tests for ReportEnv additive fields

2a783a5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval): wire LatencySummary collection in runner (was hardcoded N…

711d45e

…one) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(eval): bind chrono::Utc::now() once in build_locomo_env

cfcfbb0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): foundations P0a — additive types + LatencySummary wiring#178

feat(eval): foundations P0a — additive types + LatencySummary wiring#178
7xuanlu wants to merge 13 commits into
mainfrom
worktree-feature+eval-foundations-p0a

7xuanlu commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

7xuanlu commented May 24, 2026

Summary

What changed

Honest deviations from plan (all approved in review)

Test plan

Follow-ups for P0b+

Adversarial review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant