AI model handling: catalog freshness, eval harness, per-model prompting, periodic refresh

## Description
Two related gaps in how the app handles AI models:
(1) The model catalog is hardcoded in Rust and goes stale every time a provider ships a new model (Opus 4.7, GPT-5/o-series, Gemini 3.x not yet listed).
(2) We send the **same system prompt** to every model — no adaptation for the fact that Claude, OpenAI, and Gemini models behave differently and want to be prompted differently.

## Current State

**Catalog** (`src-tauri/src/models.rs`)
- Hardcoded `match provider_id` block listing models per provider
- Currently lists: `claude-sonnet-4-6`, `claude-opus-4-6`, `claude-haiku-4-5-20251001`, `gpt-4o`, `gpt-4o-mini`, `gemini-2.5-flash`, `gemini-2.5-pro`
- Adding a new model = code change + release build + ship a new .dmg
- No mechanism to update the catalog without an app update

**Per-model prompting** (`src-tauri/src/context/prompt.rs::build_system_prompt`, line 549)
- Returns a single string regardless of which provider/model is selected
- Same Alex persona preamble + COMMUNICATION STYLE + BOUNDARIES blob is sent to Haiku 4.5, Opus 4.7, GPT-4o-mini, and Gemini 2.5 Pro alike
- No model-family-aware adjustments for: verbosity defaults (Opus 4.7 is less verbose by default and wants explicit length cues), formatting cues (Gemini prefers XML-tagged sections), reasoning models (no extended-thinking budget plumbing for Claude 4.6+, no reasoning-effort knob for o-series)
- `ProviderConfig` in `src-tauri/src/provider.rs` only has `model` / `max_tokens` / `api_url` — no temperature, no thinking budget, no per-model prompt variant

**No empirical signal**
- No eval harness for HR scenarios — we can't measure whether Opus produces meaningfully better employee-recommendation answers than Sonnet, or whether Gemini handles long handbook context better than GPT
- No way to know if a prompt change regresses across models

## Suggested Fix

### Part 1 — Catalog refresh (`tech-debt`, agent-safe)
- [ ] Refresh hardcoded entries now: add Opus 4.7 (`claude-opus-4-7`), revisit Sonnet/Haiku versions, add GPT-5 / o-series if relevant for the $99 BYOK audience, verify Gemini family
- [ ] Add `last_verified: YYYY-MM-DD` and `notes` (free-form) fields to `ModelInfo`
- [ ] Decide: stay hardcoded (cheap, requires app update) **or** move catalog to a JSON file shipped with the app that can be hot-reloaded from a signed URL (defer to Part 4)

### Part 2 — Eval harness (`enhancement`, foundational — blocks Part 3)
- [ ] Build a small HR scenario fixture set in `src-tauri/tests/model_evals/` — **start with 10 fixtures**, grow over time:
  - PII detection in a chat message (must redact SSN before send)
  - Multi-state employment law question (must flag federal-vs-state divergence)
  - Performance review summarization (must stay within provided context, no hallucination)
  - Employee search by criteria (must cite specific employees by name)
  - Termination guidance (must recommend legal counsel, must NOT give legal advice)
  - Document-grounded answer (must cite source: "According to your Employee Handbook...")
  - Long-context handbook query (stress test for context window)
  - Sensitive demographic question (must respect boundaries)
  - Aggregate org question (must use ORGANIZATION DATA block correctly)
  - Compensation question (must say "not available in V1")
- [ ] Each fixture has assertion functions, not exact-match strings — e.g., `assert!(response.contains_disclaimer())`, `assert!(!response.mentions_compensation_amount())`
- [ ] Runner: `cargo test --features model_evals -- --nocapture` — hits real APIs with a flag, otherwise uses fixture responses
- [ ] Emits a markdown report: which model passed which fixture, with response excerpts on failure
- [ ] **Acceptance:** harness runs locally on demand, produces a diff vs the last known-good snapshot

### Part 3 — Per-model prompting (`needs-design-decision`, depends on Part 2)
- [ ] Extend `ProviderConfig` (or a new `ModelTuning` struct) with: `temperature`, `thinking_budget` (Claude), `reasoning_effort` (o-series), `system_prompt_variant`
- [ ] Refactor `build_system_prompt` to take a `ModelTuning` and produce model-aware output:
  - Opus 4.7 / Sonnet: explicit verbosity hint ("respond concisely; use bullets only when listing 3+ items")
  - Gemini: XML-tagged sections instead of markdown headers
  - o-series: shorter system prompt, lean on the reasoning step
  - Haiku / GPT-4o-mini: simpler boundaries language, fewer nested instructions
- [ ] Use Part 2's harness to validate: every prompt variant must pass the fixture set before being promoted

### Part 4 — Periodic refresher / updater (`enhancement`)

**Cheap, ship now (local launchd on Homebase):**
- [ ] Add `scripts/model-catalog-reminder.sh` — calls `gh issue create` against this repo with a templated checklist (visit Anthropic / OpenAI / Google AI release pages, compare against current `models.rs`, update `last_verified` dates, open a PR if any drift), labels: `tech-debt`
- [ ] Add `scripts/com.peoplepartner.model-catalog-reminder.plist` (template) — `StartCalendarInterval` set to Day=1, Hour=9, runs the script
- [ ] Document install in `docs/local-cron.md`:
  - `cp scripts/com.peoplepartner.model-catalog-reminder.plist ~/Library/LaunchAgents/`
  - `launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.peoplepartner.model-catalog-reminder.plist`
  - `launchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminder` to verify scheduled
- [ ] Script handles failure cases: `gh` not authenticated → write to stderr (lands in `~/Library/Logs/` per plist config), exit non-zero so launchd surfaces it on next `launchctl print`

**Ambitious, post-launch:**
- [ ] Move the catalog to a signed JSON file served from `peoplepartner.io/api/models.json`
- [ ] App fetches on startup (cached for 24h), falls back to embedded catalog on network failure
- [ ] Signature verified with the same license-signing keypair (already exists in `src-tauri/src/license_signing.rs`)
- [ ] New models appear in the dropdown without shipping a .dmg

## Verification
- [ ] `cargo test --manifest-path src-tauri/Cargo.toml` passes (all existing 478 tests + new fixture tests in mock mode)
- [ ] `cargo test --features model_evals` passes when run against real APIs (manual, not in CI)
- [ ] `npm run type-check` clean
- [ ] Settings UI: switch model, verify each model still produces a coherent answer to a fixture question
- [ ] Confirm launchd job loads: `launchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminder` returns a valid entry
- [ ] Dry-run the script: `./scripts/model-catalog-reminder.sh --dry-run` prints the would-be-issued body without filing

## Automation Hints
scope: src-tauri/src/models.rs, src-tauri/src/provider.rs, src-tauri/src/providers/*.rs, src-tauri/src/context/prompt.rs, src-tauri/tests/model_evals/ (new), scripts/model-catalog-reminder.sh (new), scripts/com.peoplepartner.model-catalog-reminder.plist (new), docs/local-cron.md (new), docs/model-evals.md (new)
do-not-touch: src-tauri/src/pii.rs, src-tauri/src/keyring.rs, src-tauri/src/chat.rs streaming logic, src-tauri/src/license_signing.rs implementation, ~/Library/LaunchAgents/ (the install step is documented, not scripted — never have an agent bootstrap launchd jobs unattended)
approach: refactor-types (Part 1 = safe; Part 2 = additive, no behavior change; Part 3 = behavior-changing, needs design + eval pass-through; Part 4-cheap = additive script + plist)
risk: low for Parts 1, 2, 4-cheap; medium for Part 3; medium-high for Part 4-ambitious (depends on signed-URL infra)
max-files-changed: 6 for Part 1; 12 for Part 2; Part 3 in its own PR; Part 4-cheap in its own tiny PR
blocked-by: Part 3 blocked by Part 2; Part 4-ambitious blocked by Part 3
bail-if: Part 1 changes the default model in a way that breaks the live install update path; Part 2 fixtures depend on undocumented model behavior; Part 3 attempted without prior design sign-off AND a green eval baseline

## Priority
**High** — not a launch blocker, but compounding: every week without it is more catalog drift, more prompt-vs-model mismatch, and less ability to evaluate either. The eval harness is the highest-leverage piece because it unblocks evidence-based decisions on everything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI model handling: catalog freshness, eval harness, per-model prompting, periodic refresh #96

Description

Current State

Suggested Fix

Part 1 — Catalog refresh (`tech-debt`, agent-safe)

Part 2 — Eval harness (`enhancement`, foundational — blocks Part 3)

Part 3 — Per-model prompting (`needs-design-decision`, depends on Part 2)

Part 4 — Periodic refresher / updater (`enhancement`)

Verification

Automation Hints

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

AI model handling: catalog freshness, eval harness, per-model prompting, periodic refresh #96

Description

Description

Current State

Suggested Fix

Part 1 — Catalog refresh (tech-debt, agent-safe)

Part 2 — Eval harness (enhancement, foundational — blocks Part 3)

Part 3 — Per-model prompting (needs-design-decision, depends on Part 2)

Part 4 — Periodic refresher / updater (enhancement)

Verification

Automation Hints

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Part 1 — Catalog refresh (`tech-debt`, agent-safe)

Part 2 — Eval harness (`enhancement`, foundational — blocks Part 3)

Part 3 — Per-model prompting (`needs-design-decision`, depends on Part 2)

Part 4 — Periodic refresher / updater (`enhancement`)