Skip to content

AI model handling: catalog freshness, eval harness, per-model prompting, periodic refresh #96

Description

@matthewod11-stack

Description

Two related gaps in how the app handles AI models:
(1) The model catalog is hardcoded in Rust and goes stale every time a provider ships a new model (Opus 4.7, GPT-5/o-series, Gemini 3.x not yet listed).
(2) We send the same system prompt to every model — no adaptation for the fact that Claude, OpenAI, and Gemini models behave differently and want to be prompted differently.

Current State

Catalog (src-tauri/src/models.rs)

  • Hardcoded match provider_id block listing models per provider
  • Currently lists: claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini, gemini-2.5-flash, gemini-2.5-pro
  • Adding a new model = code change + release build + ship a new .dmg
  • No mechanism to update the catalog without an app update

Per-model prompting (src-tauri/src/context/prompt.rs::build_system_prompt, line 549)

  • Returns a single string regardless of which provider/model is selected
  • Same Alex persona preamble + COMMUNICATION STYLE + BOUNDARIES blob is sent to Haiku 4.5, Opus 4.7, GPT-4o-mini, and Gemini 2.5 Pro alike
  • No model-family-aware adjustments for: verbosity defaults (Opus 4.7 is less verbose by default and wants explicit length cues), formatting cues (Gemini prefers XML-tagged sections), reasoning models (no extended-thinking budget plumbing for Claude 4.6+, no reasoning-effort knob for o-series)
  • ProviderConfig in src-tauri/src/provider.rs only has model / max_tokens / api_url — no temperature, no thinking budget, no per-model prompt variant

No empirical signal

  • No eval harness for HR scenarios — we can't measure whether Opus produces meaningfully better employee-recommendation answers than Sonnet, or whether Gemini handles long handbook context better than GPT
  • No way to know if a prompt change regresses across models

Suggested Fix

Part 1 — Catalog refresh (tech-debt, agent-safe)

  • Refresh hardcoded entries now: add Opus 4.7 (claude-opus-4-7), revisit Sonnet/Haiku versions, add GPT-5 / o-series if relevant for the $99 BYOK audience, verify Gemini family
  • Add last_verified: YYYY-MM-DD and notes (free-form) fields to ModelInfo
  • Decide: stay hardcoded (cheap, requires app update) or move catalog to a JSON file shipped with the app that can be hot-reloaded from a signed URL (defer to Part 4)

Part 2 — Eval harness (enhancement, foundational — blocks Part 3)

  • Build a small HR scenario fixture set in src-tauri/tests/model_evals/start with 10 fixtures, grow over time:
    • PII detection in a chat message (must redact SSN before send)
    • Multi-state employment law question (must flag federal-vs-state divergence)
    • Performance review summarization (must stay within provided context, no hallucination)
    • Employee search by criteria (must cite specific employees by name)
    • Termination guidance (must recommend legal counsel, must NOT give legal advice)
    • Document-grounded answer (must cite source: "According to your Employee Handbook...")
    • Long-context handbook query (stress test for context window)
    • Sensitive demographic question (must respect boundaries)
    • Aggregate org question (must use ORGANIZATION DATA block correctly)
    • Compensation question (must say "not available in V1")
  • Each fixture has assertion functions, not exact-match strings — e.g., assert!(response.contains_disclaimer()), assert!(!response.mentions_compensation_amount())
  • Runner: cargo test --features model_evals -- --nocapture — hits real APIs with a flag, otherwise uses fixture responses
  • Emits a markdown report: which model passed which fixture, with response excerpts on failure
  • Acceptance: harness runs locally on demand, produces a diff vs the last known-good snapshot

Part 3 — Per-model prompting (needs-design-decision, depends on Part 2)

  • Extend ProviderConfig (or a new ModelTuning struct) with: temperature, thinking_budget (Claude), reasoning_effort (o-series), system_prompt_variant
  • Refactor build_system_prompt to take a ModelTuning and produce model-aware output:
    • Opus 4.7 / Sonnet: explicit verbosity hint ("respond concisely; use bullets only when listing 3+ items")
    • Gemini: XML-tagged sections instead of markdown headers
    • o-series: shorter system prompt, lean on the reasoning step
    • Haiku / GPT-4o-mini: simpler boundaries language, fewer nested instructions
  • Use Part 2's harness to validate: every prompt variant must pass the fixture set before being promoted

Part 4 — Periodic refresher / updater (enhancement)

Cheap, ship now (local launchd on Homebase):

  • Add scripts/model-catalog-reminder.sh — calls gh issue create against this repo with a templated checklist (visit Anthropic / OpenAI / Google AI release pages, compare against current models.rs, update last_verified dates, open a PR if any drift), labels: tech-debt
  • Add scripts/com.peoplepartner.model-catalog-reminder.plist (template) — StartCalendarInterval set to Day=1, Hour=9, runs the script
  • Document install in docs/local-cron.md:
    • cp scripts/com.peoplepartner.model-catalog-reminder.plist ~/Library/LaunchAgents/
    • launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.peoplepartner.model-catalog-reminder.plist
    • launchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminder to verify scheduled
  • Script handles failure cases: gh not authenticated → write to stderr (lands in ~/Library/Logs/ per plist config), exit non-zero so launchd surfaces it on next launchctl print

Ambitious, post-launch:

  • Move the catalog to a signed JSON file served from peoplepartner.io/api/models.json
  • App fetches on startup (cached for 24h), falls back to embedded catalog on network failure
  • Signature verified with the same license-signing keypair (already exists in src-tauri/src/license_signing.rs)
  • New models appear in the dropdown without shipping a .dmg

Verification

  • cargo test --manifest-path src-tauri/Cargo.toml passes (all existing 478 tests + new fixture tests in mock mode)
  • cargo test --features model_evals passes when run against real APIs (manual, not in CI)
  • npm run type-check clean
  • Settings UI: switch model, verify each model still produces a coherent answer to a fixture question
  • Confirm launchd job loads: launchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminder returns a valid entry
  • Dry-run the script: ./scripts/model-catalog-reminder.sh --dry-run prints the would-be-issued body without filing

Automation Hints

scope: src-tauri/src/models.rs, src-tauri/src/provider.rs, src-tauri/src/providers/*.rs, src-tauri/src/context/prompt.rs, src-tauri/tests/model_evals/ (new), scripts/model-catalog-reminder.sh (new), scripts/com.peoplepartner.model-catalog-reminder.plist (new), docs/local-cron.md (new), docs/model-evals.md (new)
do-not-touch: src-tauri/src/pii.rs, src-tauri/src/keyring.rs, src-tauri/src/chat.rs streaming logic, src-tauri/src/license_signing.rs implementation, ~/Library/LaunchAgents/ (the install step is documented, not scripted — never have an agent bootstrap launchd jobs unattended)
approach: refactor-types (Part 1 = safe; Part 2 = additive, no behavior change; Part 3 = behavior-changing, needs design + eval pass-through; Part 4-cheap = additive script + plist)
risk: low for Parts 1, 2, 4-cheap; medium for Part 3; medium-high for Part 4-ambitious (depends on signed-URL infra)
max-files-changed: 6 for Part 1; 12 for Part 2; Part 3 in its own PR; Part 4-cheap in its own tiny PR
blocked-by: Part 3 blocked by Part 2; Part 4-ambitious blocked by Part 3
bail-if: Part 1 changes the default model in a way that breaks the live install update path; Part 2 fixtures depend on undocumented model behavior; Part 3 attempted without prior design sign-off AND a green eval baseline

Priority

High — not a launch blocker, but compounding: every week without it is more catalog drift, more prompt-vs-model mismatch, and less ability to evaluate either. The eval harness is the highest-leverage piece because it unblocks evidence-based decisions on everything else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthardeningReliability or defense-in-depth improvementneeds-design-decisionRequires human product/design input — agent skipstech-debtEligible for automated overnight fixing

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions