Description
Two related gaps in how the app handles AI models:
(1) The model catalog is hardcoded in Rust and goes stale every time a provider ships a new model (Opus 4.7, GPT-5/o-series, Gemini 3.x not yet listed).
(2) We send the same system prompt to every model — no adaptation for the fact that Claude, OpenAI, and Gemini models behave differently and want to be prompted differently.
Current State
Catalog (src-tauri/src/models.rs)
- Hardcoded
match provider_id block listing models per provider
- Currently lists:
claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5-20251001, gpt-4o, gpt-4o-mini, gemini-2.5-flash, gemini-2.5-pro
- Adding a new model = code change + release build + ship a new .dmg
- No mechanism to update the catalog without an app update
Per-model prompting (src-tauri/src/context/prompt.rs::build_system_prompt, line 549)
- Returns a single string regardless of which provider/model is selected
- Same Alex persona preamble + COMMUNICATION STYLE + BOUNDARIES blob is sent to Haiku 4.5, Opus 4.7, GPT-4o-mini, and Gemini 2.5 Pro alike
- No model-family-aware adjustments for: verbosity defaults (Opus 4.7 is less verbose by default and wants explicit length cues), formatting cues (Gemini prefers XML-tagged sections), reasoning models (no extended-thinking budget plumbing for Claude 4.6+, no reasoning-effort knob for o-series)
ProviderConfig in src-tauri/src/provider.rs only has model / max_tokens / api_url — no temperature, no thinking budget, no per-model prompt variant
No empirical signal
- No eval harness for HR scenarios — we can't measure whether Opus produces meaningfully better employee-recommendation answers than Sonnet, or whether Gemini handles long handbook context better than GPT
- No way to know if a prompt change regresses across models
Suggested Fix
Part 1 — Catalog refresh (tech-debt, agent-safe)
Part 2 — Eval harness (enhancement, foundational — blocks Part 3)
Part 3 — Per-model prompting (needs-design-decision, depends on Part 2)
Part 4 — Periodic refresher / updater (enhancement)
Cheap, ship now (local launchd on Homebase):
Ambitious, post-launch:
Verification
Automation Hints
scope: src-tauri/src/models.rs, src-tauri/src/provider.rs, src-tauri/src/providers/*.rs, src-tauri/src/context/prompt.rs, src-tauri/tests/model_evals/ (new), scripts/model-catalog-reminder.sh (new), scripts/com.peoplepartner.model-catalog-reminder.plist (new), docs/local-cron.md (new), docs/model-evals.md (new)
do-not-touch: src-tauri/src/pii.rs, src-tauri/src/keyring.rs, src-tauri/src/chat.rs streaming logic, src-tauri/src/license_signing.rs implementation, ~/Library/LaunchAgents/ (the install step is documented, not scripted — never have an agent bootstrap launchd jobs unattended)
approach: refactor-types (Part 1 = safe; Part 2 = additive, no behavior change; Part 3 = behavior-changing, needs design + eval pass-through; Part 4-cheap = additive script + plist)
risk: low for Parts 1, 2, 4-cheap; medium for Part 3; medium-high for Part 4-ambitious (depends on signed-URL infra)
max-files-changed: 6 for Part 1; 12 for Part 2; Part 3 in its own PR; Part 4-cheap in its own tiny PR
blocked-by: Part 3 blocked by Part 2; Part 4-ambitious blocked by Part 3
bail-if: Part 1 changes the default model in a way that breaks the live install update path; Part 2 fixtures depend on undocumented model behavior; Part 3 attempted without prior design sign-off AND a green eval baseline
Priority
High — not a launch blocker, but compounding: every week without it is more catalog drift, more prompt-vs-model mismatch, and less ability to evaluate either. The eval harness is the highest-leverage piece because it unblocks evidence-based decisions on everything else.
Description
Two related gaps in how the app handles AI models:
(1) The model catalog is hardcoded in Rust and goes stale every time a provider ships a new model (Opus 4.7, GPT-5/o-series, Gemini 3.x not yet listed).
(2) We send the same system prompt to every model — no adaptation for the fact that Claude, OpenAI, and Gemini models behave differently and want to be prompted differently.
Current State
Catalog (
src-tauri/src/models.rs)match provider_idblock listing models per providerclaude-sonnet-4-6,claude-opus-4-6,claude-haiku-4-5-20251001,gpt-4o,gpt-4o-mini,gemini-2.5-flash,gemini-2.5-proPer-model prompting (
src-tauri/src/context/prompt.rs::build_system_prompt, line 549)ProviderConfiginsrc-tauri/src/provider.rsonly hasmodel/max_tokens/api_url— no temperature, no thinking budget, no per-model prompt variantNo empirical signal
Suggested Fix
Part 1 — Catalog refresh (
tech-debt, agent-safe)claude-opus-4-7), revisit Sonnet/Haiku versions, add GPT-5 / o-series if relevant for the $99 BYOK audience, verify Gemini familylast_verified: YYYY-MM-DDandnotes(free-form) fields toModelInfoPart 2 — Eval harness (
enhancement, foundational — blocks Part 3)src-tauri/tests/model_evals/— start with 10 fixtures, grow over time:assert!(response.contains_disclaimer()),assert!(!response.mentions_compensation_amount())cargo test --features model_evals -- --nocapture— hits real APIs with a flag, otherwise uses fixture responsesPart 3 — Per-model prompting (
needs-design-decision, depends on Part 2)ProviderConfig(or a newModelTuningstruct) with:temperature,thinking_budget(Claude),reasoning_effort(o-series),system_prompt_variantbuild_system_promptto take aModelTuningand produce model-aware output:Part 4 — Periodic refresher / updater (
enhancement)Cheap, ship now (local launchd on Homebase):
scripts/model-catalog-reminder.sh— callsgh issue createagainst this repo with a templated checklist (visit Anthropic / OpenAI / Google AI release pages, compare against currentmodels.rs, updatelast_verifieddates, open a PR if any drift), labels:tech-debtscripts/com.peoplepartner.model-catalog-reminder.plist(template) —StartCalendarIntervalset to Day=1, Hour=9, runs the scriptdocs/local-cron.md:cp scripts/com.peoplepartner.model-catalog-reminder.plist ~/Library/LaunchAgents/launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.peoplepartner.model-catalog-reminder.plistlaunchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminderto verify scheduledghnot authenticated → write to stderr (lands in~/Library/Logs/per plist config), exit non-zero so launchd surfaces it on nextlaunchctl printAmbitious, post-launch:
peoplepartner.io/api/models.jsonsrc-tauri/src/license_signing.rs)Verification
cargo test --manifest-path src-tauri/Cargo.tomlpasses (all existing 478 tests + new fixture tests in mock mode)cargo test --features model_evalspasses when run against real APIs (manual, not in CI)npm run type-checkcleanlaunchctl print gui/$(id -u)/com.peoplepartner.model-catalog-reminderreturns a valid entry./scripts/model-catalog-reminder.sh --dry-runprints the would-be-issued body without filingAutomation Hints
scope: src-tauri/src/models.rs, src-tauri/src/provider.rs, src-tauri/src/providers/*.rs, src-tauri/src/context/prompt.rs, src-tauri/tests/model_evals/ (new), scripts/model-catalog-reminder.sh (new), scripts/com.peoplepartner.model-catalog-reminder.plist (new), docs/local-cron.md (new), docs/model-evals.md (new)
do-not-touch: src-tauri/src/pii.rs, src-tauri/src/keyring.rs, src-tauri/src/chat.rs streaming logic, src-tauri/src/license_signing.rs implementation, ~/Library/LaunchAgents/ (the install step is documented, not scripted — never have an agent bootstrap launchd jobs unattended)
approach: refactor-types (Part 1 = safe; Part 2 = additive, no behavior change; Part 3 = behavior-changing, needs design + eval pass-through; Part 4-cheap = additive script + plist)
risk: low for Parts 1, 2, 4-cheap; medium for Part 3; medium-high for Part 4-ambitious (depends on signed-URL infra)
max-files-changed: 6 for Part 1; 12 for Part 2; Part 3 in its own PR; Part 4-cheap in its own tiny PR
blocked-by: Part 3 blocked by Part 2; Part 4-ambitious blocked by Part 3
bail-if: Part 1 changes the default model in a way that breaks the live install update path; Part 2 fixtures depend on undocumented model behavior; Part 3 attempted without prior design sign-off AND a green eval baseline
Priority
High — not a launch blocker, but compounding: every week without it is more catalog drift, more prompt-vs-model mismatch, and less ability to evaluate either. The eval harness is the highest-leverage piece because it unblocks evidence-based decisions on everything else.