Skip to content

feat(admin): configure LLM provider with primary + optional fallback model #1

Description

@cdcore09

Summary

When configuring LLMs in the admin app, instructors should pick a provider, enter the API key, choose a primary model, and optionally specify a fallback model that takes over on rate-limit (429) or upstream errors. Surfaced by a real incident where google/gemma-4-31b-it:free hit upstream rate limits with no graceful degradation — the worker returned 200 but streamed zero events.

Requirements

  • Admin UI: provider dropdown, API key input (masked, write-only), primary model picker (live catalog), optional fallback model picker
  • Live model catalog fetch proxied via the worker for each provider (OpenRouter, OpenAI, Anthropic)
  • Test-connection button that validates the key + model availability before save
  • Schema delta: add fallback_llm_config_id self-FK on llm_configs (nullable, ON DELETE SET NULL)
  • Runtime: chat handler catches AI_RetryError / AI_APICallError with status 429 or 5xx, retries via the resolved fallback config
  • Cycle / chain-depth protection (e.g. cap at 3 hops)
  • Provider-agnostic — fallback should work for Anthropic, OpenAI, and any future provider, not just OpenRouter

Context

  • Surfaced 2026-06-03 during chat smoke-testing on cdcore/chore/research. google/gemma-4-31b-it:free hit 429 upstream from Google AI Studio. The AI SDK retried 3 times (reason: 'maxRetriesExceeded') and closed the stream silently.
  • The existing schema in apps/web/src/db/schema/content.ts already covers provider, model, and credential pointer on LLMConfig. Fallback is the one new piece.
  • Architecture doc: docs/architecture/multi-tenant-data-model.md §6.2 (LLMConfig + OrganizationCredential).

Implementation Notes

  • Self-FK pattern matches existing PromptTemplate.previous_version_id — Drizzle migration is a single column.
  • Runtime change in apps/web/src/server/routes/chat.ts is ~20 lines once the schema lands: wrap streamText in a try/catch keyed on retryable errors, look up the fallback config, retry once.
  • Admin UI is larger — needs a /api/admin/llm-providers/<provider>/models endpoint per supported provider so the dropdown shows live availability, ideally with a short cache TTL.

Open Questions

  • Does fallback resolution walk the existing inheritance chain (HomeworkCourseOrganization), or is it pinned to whichever LLMConfig actually fired?
  • When fallback fires, does the user see anything (muted toast: "switched to backup model") or is it silent?
  • Test-connection in admin, or rely on first real chat call to surface bad config?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions