Add per-task model routing and LLM usage telemetry#59
Conversation
…hrottle speculative jobs Token-spend levers: - Per-task model selection via src/utils/models.ts. Each task has an optional MODEL_ID_* env override and falls back to MODEL_ID_GRAPH_EXTRACTION (the prior single model), so behavior is unchanged until a task is deliberately pointed elsewhere — a safe, one-task-at-a-time rollout/rollback. - Cap output tokens (MODEL_MAX_OUTPUT_TOKENS, default 16000) on every completion to bound runaway generations; per-call overridable. - Tag every LLM call with a Helicone-Property-Task header so spend can be broken down per task in Helicone even before models are split. - Throttle speculative background work: DEEP_RESEARCH_PROBABILITY now defaults to 0 (opt-in to experiment), DREAM_PROBABILITY lowered to 0.05. Both are env-overridable. Threaded task/maxTokens through performStructuredAnalysis, crateTextCompletion, and every direct chat.completions call site (graph extraction, document spine, transcript segmentation, conversation summary, graph cleanup, atlas, profile synthesis, dream, deep research).
…d `user` field Helicone's per-request properties die with Helicone, so add a portable telemetry layer that survives a switch to any other backend: - New src/utils/llm-telemetry.ts emits one structured `llm_usage` stdout event per LLM call (task, model, userId, prompt/cached/completion/total tokens). A log drain / OTel collector / Sentry / Postgres sink can consume it to build per-task and per-model cost dashboards — `task` is our own concept and only survives if we emit it ourselves. `cachedTokens` reads OpenRouter's prompt_tokens_details.cached_tokens so prefix-cache effectiveness is visible without a provider dashboard. - Set the OpenAI-standard, vendor-neutral `user` field on every request (OpenRouter passes it through), alongside the existing Helicone header. - Emission is centralized in parseStructuredCompletion and crateTextCompletion (the two choke points), so there's one seam to repoint later. Threaded task + userId into parseStructuredCompletion. - Fold atlas-assistant's bespoke chat.completions.create call into crateTextCompletion so it inherits tiering, the output cap, telemetry, and the `user` field instead of duplicating them.
…mic payload⚠️ NOT YET EVAL-VALIDATED — run `pnpm eval:ingest` and `pnpm eval:memory` before merging/deploying. This reorders the core extraction prompt, which can shift extraction quality; the sandbox this was authored in has no model key / Postgres, so the eval harness could not be run here. What changed: - The graph-extraction prompt was a single user message that placed the dynamic payload (existing-node <context>, commitments, speaker map, source refs, and the source content) in the MIDDLE, ahead of the large static rule block. That left no stable prefix, so OpenRouter/OpenAI prompt caching never engaged on our highest-volume, most input-heavy call. - Now sent as two messages: a `system` message with the static instructions (role, IMPORTANT, CRITICAL RULES, assertionKind, few-shot, graph rules, metrics) — byte-identical across ingests of the same shape and well over the ~1024-token minimum, so providers auto-cache it as the request prefix — and a `user` message with the per-call dynamic payload, which can no longer invalidate the cached prefix. Instruction text is otherwise unchanged. - Adjusted one positional reference ("allowed source refs above" → "provided with the content below") now that refs live in the trailing user message. - onLlmIO debug hook now logs `${systemPrompt}\n\n${userPrompt}`. - Tests: stub capture now joins all messages, matching the combined prompt. Risk to check in eval: the transcript / multi-party path, where assertionKind rules reference the speaker list that now sits in the trailing user message rather than ahead of the rules.
There was a problem hiding this comment.
Code Review
This pull request introduces LLM task tiering and telemetry tracking, allowing different logical LLM tasks to be routed to specific models via environment variable overrides, with a fallback to a base model. It also implements structured LLM usage telemetry to track token usage per task and optimizes prompt caching for graph extraction by separating static instructions from dynamic payloads. The review feedback suggests lowering the default MODEL_MAX_OUTPUT_TOKENS from 16000 to 4096 to avoid potential API errors with models that have lower output limits, avoiding direct mutation of the body parameter in parseStructuredCompletion, and removing redundant z.string().parse calls where responses are already validated.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // generation (providers can emit up to 128k); set well above the observed | ||
| // average output (~850) so it effectively never truncates real work. | ||
| // Per-call overrides are allowed in code. | ||
| MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000), |
There was a problem hiding this comment.
Setting the default MODEL_MAX_OUTPUT_TOKENS to 16000 can cause 400 Bad Request errors from upstream LLM providers (including OpenAI and OpenRouter) when routing tasks to models with lower maximum output limits (e.g., gpt-4-turbo or Claude models, which strictly limit output tokens to 4096 or 8192). If a request specifies a max_tokens value greater than the model's maximum supported output tokens, the API call will fail.
Since the observed average output is only ~850 tokens, lowering the default to 4096 provides a safe buffer that prevents runaway generations while ensuring broad compatibility across different routed models.
| MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000), | |
| MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(4096), |
There was a problem hiding this comment.
Keeping the default at 16000 deliberately. Two reasons:
- Extraction has a long output tail. Graph extraction (especially documents, where the prompt says "exhaustively extract every concrete fact") can emit dense node/claim JSON well above 4096 tokens. A 4096 cap would truncate those into
LengthFinishReasonretries (wasted tokens) or incomplete graphs — the opposite of the goal here. - The 400 risk is narrow and self-announcing. It only triggers if a task is tiered onto a model whose max output < 16000, and the API 400s on the first such call — easy to spot and fix per-case. It's also safe on the current base model and modern targets (GPT-4o 16384, Claude ≥8192).
If we later tier cheap tasks onto small-output models, the clean fix is per-task output caps (extraction stays high, cheap tasks low) rather than a single low global default. max_tokens is already overridable per call, so this needs no schema change.
Generated by Claude Code
| if (opts?.userId) body.user = opts.userId; | ||
| let lastError: unknown; | ||
| for ( | ||
| let attempt = 1; | ||
| attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS; | ||
| attempt++ | ||
| ) { | ||
| try { | ||
| return await client.chat.completions.parse(body); | ||
| const completion = await client.chat.completions.parse(body); |
There was a problem hiding this comment.
Mutating the body parameter directly can lead to unexpected side-effects if the caller passes a shared, reused, or frozen configuration object. In JavaScript/TypeScript, mutating function arguments is generally considered an anti-pattern (often flagged by no-param-reassign).
We can avoid mutating the parameter by creating a shallow copy of the body and casting it back to Body to preserve the generic type inference for .parsed.
| if (opts?.userId) body.user = opts.userId; | |
| let lastError: unknown; | |
| for ( | |
| let attempt = 1; | |
| attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS; | |
| attempt++ | |
| ) { | |
| try { | |
| return await client.chat.completions.parse(body); | |
| const completion = await client.chat.completions.parse(body); | |
| const finalBody = opts?.userId ? ({ ...body, user: opts.userId } as Body) : body; | |
| let lastError: unknown; | |
| for ( | |
| let attempt = 1; | |
| attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS; | |
| attempt++ | |
| ) { | |
| try { | |
| const completion = await client.chat.completions.parse(finalBody); |
| schema: AtlasOutputSchema, | ||
| task: "atlas", | ||
| }); | ||
| derivedAtlas = z.string().parse(parsed["atlas"]).trim(); |
There was a problem hiding this comment.
Since performStructuredAnalysis already validates and parses the response against the provided AtlasOutputSchema (which defines atlas as a string), parsed.atlas is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.
| derivedAtlas = z.string().parse(parsed["atlas"]).trim(); | |
| derivedAtlas = parsed.atlas.trim(); |
| task: "profile_synthesis", | ||
| }); | ||
|
|
||
| const description = z.string().parse(parsed["description"]).trim(); |
There was a problem hiding this comment.
Since performStructuredAnalysis already validates the response against ProfileSynthesisOutputSchema (where description is defined as a string), parsed.description is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.
| const description = z.string().parse(parsed["description"]).trim(); | |
| const description = parsed.description.trim(); |
…mpletion Addresses PR #59 review: avoid mutating the caller's `body` argument when injecting the `user` field (no-param-reassign anti-pattern; unsafe if a caller ever passes a shared/frozen object). Build a shallow copy cast back to `Body` so the generic inference that keeps `.parsed` typed is preserved.
Summary
This PR introduces per-task model routing and structured LLM usage telemetry to enable cost tracking and optimization by task type. The system now supports routing different LLM tasks to different models via environment variables while maintaining backward compatibility.
Key Changes
New telemetry infrastructure (
src/utils/llm-telemetry.ts):LlmUsageEventinterface for provider-neutral LLM call accountingnormalizeUsage()to flatten OpenAI/OpenRouter usage objectsrecordLlmUsage()to emit structured telemetry as JSON to stdoutPer-task model routing (
src/utils/models.ts):ModelTasktype with 9 task categories (graph_extraction, document_spine, transcript_segmentation, conversation_summary, graph_cleanup, atlas, profile_synthesis, dream, deep_research)modelForTask()function with safe fallback toMODEL_ID_GRAPH_EXTRACTIONEnhanced AI client functions (
src/lib/ai.ts):parseStructuredCompletion()to accept optionaltaskanduserIdparameterscreateCompletionClient()to accept task parameter and setHelicone-Property-Taskheader for spend breakdowncrateTextCompletion()andperformStructuredAnalysis()to support task routing and telemetry recordingmax_tokensparameter support withMODEL_MAX_OUTPUT_TOKENSdefaultEnvironment configuration (
src/utils/env.ts):MODEL_MAX_OUTPUT_TOKENSconfiguration (default 16000)DREAM_PROBABILITY(0.1 → 0.05),DEEP_RESEARCH_PROBABILITY(0.5 → 0)Updated all LLM call sites to pass task parameter:
extract-graph.ts): Refactored prompt into static system message and dynamic user message for OpenRouter prompt cachingsummarize-conversation.ts)segment-transcript.ts)extract-document-spine.ts)cleanup-graph.ts)atlas-assistant.ts): Switched to usecrateTextCompletion()helperdream.ts)deep-research.ts)atlas-user.ts)profile-synthesis.ts)Test updates (
extract-graph.test.ts): Updated mock to handle multi-message prompts by concatenating all message contentNotable Implementation Details
MODEL_ID_GRAPH_EXTRACTION, preserving existing behavior until explicitly configuredHelicone-Property-Taskheader for vendor-side spend breakdownhttps://claude.ai/code/session_01NQ5oJ6eoQiKqmxk3E1NJxr