Skip to content

Add per-task model routing and LLM usage telemetry#59

Merged
marcelsamyn merged 4 commits into
mainfrom
claude/zen-davinci-4yo8O
Jun 9, 2026
Merged

Add per-task model routing and LLM usage telemetry#59
marcelsamyn merged 4 commits into
mainfrom
claude/zen-davinci-4yo8O

Conversation

@marcelsamyn

Copy link
Copy Markdown
Owner

Summary

This PR introduces per-task model routing and structured LLM usage telemetry to enable cost tracking and optimization by task type. The system now supports routing different LLM tasks to different models via environment variables while maintaining backward compatibility.

Key Changes

  • New telemetry infrastructure (src/utils/llm-telemetry.ts):

    • Added LlmUsageEvent interface for provider-neutral LLM call accounting
    • Implemented normalizeUsage() to flatten OpenAI/OpenRouter usage objects
    • Implemented recordLlmUsage() to emit structured telemetry as JSON to stdout
    • Captures task, model, user, token counts, and cached token metrics
  • Per-task model routing (src/utils/models.ts):

    • Defined ModelTask type with 9 task categories (graph_extraction, document_spine, transcript_segmentation, conversation_summary, graph_cleanup, atlas, profile_synthesis, dream, deep_research)
    • Implemented modelForTask() function with safe fallback to MODEL_ID_GRAPH_EXTRACTION
    • Allows independent model assignment per task via environment variables
  • Enhanced AI client functions (src/lib/ai.ts):

    • Updated parseStructuredCompletion() to accept optional task and userId parameters
    • Updated createCompletionClient() to accept task parameter and set Helicone-Property-Task header for spend breakdown
    • Updated crateTextCompletion() and performStructuredAnalysis() to support task routing and telemetry recording
    • Added max_tokens parameter support with MODEL_MAX_OUTPUT_TOKENS default
  • Environment configuration (src/utils/env.ts):

    • Added 8 optional per-task model override variables
    • Added MODEL_MAX_OUTPUT_TOKENS configuration (default 16000)
    • Adjusted default probabilities: DREAM_PROBABILITY (0.1 → 0.05), DEEP_RESEARCH_PROBABILITY (0.5 → 0)
  • Updated all LLM call sites to pass task parameter:

    • Graph extraction (extract-graph.ts): Refactored prompt into static system message and dynamic user message for OpenRouter prompt caching
    • Conversation summarization (summarize-conversation.ts)
    • Transcript segmentation (segment-transcript.ts)
    • Document spine extraction (extract-document-spine.ts)
    • Graph cleanup (cleanup-graph.ts)
    • Atlas assistant (atlas-assistant.ts): Switched to use crateTextCompletion() helper
    • Dream job (dream.ts)
    • Deep research (deep-research.ts)
    • Atlas user job (atlas-user.ts)
    • Profile synthesis (profile-synthesis.ts)
  • Test updates (extract-graph.test.ts): Updated mock to handle multi-message prompts by concatenating all message content

Notable Implementation Details

  • Backward compatibility: All task-specific model overrides are optional; undefined entries inherit MODEL_ID_GRAPH_EXTRACTION, preserving existing behavior until explicitly configured
  • Telemetry ownership: Telemetry is emitted independently of vendor integrations (Helicone), ensuring cost tracking survives infrastructure changes
  • Prompt caching optimization: Graph extraction now uses system/user message split to enable OpenRouter/OpenAI automatic prompt caching of the static instruction block
  • Helicone integration: Task information is passed via Helicone-Property-Task header for vendor-side spend breakdown
  • Token accounting: Captures both regular and cached prompt tokens to track cache hit efficiency

https://claude.ai/code/session_01NQ5oJ6eoQiKqmxk3E1NJxr

claude added 3 commits June 9, 2026 09:03
…hrottle speculative jobs

Token-spend levers:

- Per-task model selection via src/utils/models.ts. Each task has an
  optional MODEL_ID_* env override and falls back to
  MODEL_ID_GRAPH_EXTRACTION (the prior single model), so behavior is
  unchanged until a task is deliberately pointed elsewhere — a safe,
  one-task-at-a-time rollout/rollback.
- Cap output tokens (MODEL_MAX_OUTPUT_TOKENS, default 16000) on every
  completion to bound runaway generations; per-call overridable.
- Tag every LLM call with a Helicone-Property-Task header so spend can
  be broken down per task in Helicone even before models are split.
- Throttle speculative background work: DEEP_RESEARCH_PROBABILITY now
  defaults to 0 (opt-in to experiment), DREAM_PROBABILITY lowered to
  0.05. Both are env-overridable.

Threaded task/maxTokens through performStructuredAnalysis,
crateTextCompletion, and every direct chat.completions call site
(graph extraction, document spine, transcript segmentation,
conversation summary, graph cleanup, atlas, profile synthesis, dream,
deep research).
…d `user` field

Helicone's per-request properties die with Helicone, so add a portable
telemetry layer that survives a switch to any other backend:

- New src/utils/llm-telemetry.ts emits one structured `llm_usage` stdout
  event per LLM call (task, model, userId, prompt/cached/completion/total
  tokens). A log drain / OTel collector / Sentry / Postgres sink can
  consume it to build per-task and per-model cost dashboards — `task` is
  our own concept and only survives if we emit it ourselves. `cachedTokens`
  reads OpenRouter's prompt_tokens_details.cached_tokens so prefix-cache
  effectiveness is visible without a provider dashboard.
- Set the OpenAI-standard, vendor-neutral `user` field on every request
  (OpenRouter passes it through), alongside the existing Helicone header.
- Emission is centralized in parseStructuredCompletion and
  crateTextCompletion (the two choke points), so there's one seam to
  repoint later. Threaded task + userId into parseStructuredCompletion.
- Fold atlas-assistant's bespoke chat.completions.create call into
  crateTextCompletion so it inherits tiering, the output cap, telemetry,
  and the `user` field instead of duplicating them.
…mic payload

⚠️ NOT YET EVAL-VALIDATED — run `pnpm eval:ingest` and `pnpm eval:memory`
before merging/deploying. This reorders the core extraction prompt, which
can shift extraction quality; the sandbox this was authored in has no model
key / Postgres, so the eval harness could not be run here.

What changed:
- The graph-extraction prompt was a single user message that placed the
  dynamic payload (existing-node <context>, commitments, speaker map, source
  refs, and the source content) in the MIDDLE, ahead of the large static rule
  block. That left no stable prefix, so OpenRouter/OpenAI prompt caching never
  engaged on our highest-volume, most input-heavy call.
- Now sent as two messages: a `system` message with the static instructions
  (role, IMPORTANT, CRITICAL RULES, assertionKind, few-shot, graph rules,
  metrics) — byte-identical across ingests of the same shape and well over the
  ~1024-token minimum, so providers auto-cache it as the request prefix — and
  a `user` message with the per-call dynamic payload, which can no longer
  invalidate the cached prefix. Instruction text is otherwise unchanged.
- Adjusted one positional reference ("allowed source refs above" → "provided
  with the content below") now that refs live in the trailing user message.
- onLlmIO debug hook now logs `${systemPrompt}\n\n${userPrompt}`.
- Tests: stub capture now joins all messages, matching the combined prompt.

Risk to check in eval: the transcript / multi-party path, where assertionKind
rules reference the speaker list that now sits in the trailing user message
rather than ahead of the rules.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces LLM task tiering and telemetry tracking, allowing different logical LLM tasks to be routed to specific models via environment variable overrides, with a fallback to a base model. It also implements structured LLM usage telemetry to track token usage per task and optimizes prompt caching for graph extraction by separating static instructions from dynamic payloads. The review feedback suggests lowering the default MODEL_MAX_OUTPUT_TOKENS from 16000 to 4096 to avoid potential API errors with models that have lower output limits, avoiding direct mutation of the body parameter in parseStructuredCompletion, and removing redundant z.string().parse calls where responses are already validated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/utils/env.ts
// generation (providers can emit up to 128k); set well above the observed
// average output (~850) so it effectively never truncates real work.
// Per-call overrides are allowed in code.
MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting the default MODEL_MAX_OUTPUT_TOKENS to 16000 can cause 400 Bad Request errors from upstream LLM providers (including OpenAI and OpenRouter) when routing tasks to models with lower maximum output limits (e.g., gpt-4-turbo or Claude models, which strictly limit output tokens to 4096 or 8192). If a request specifies a max_tokens value greater than the model's maximum supported output tokens, the API call will fail.

Since the observed average output is only ~850 tokens, lowering the default to 4096 provides a safe buffer that prevents runaway generations while ensuring broad compatibility across different routed models.

Suggested change
MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000),
MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(4096),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the default at 16000 deliberately. Two reasons:

  • Extraction has a long output tail. Graph extraction (especially documents, where the prompt says "exhaustively extract every concrete fact") can emit dense node/claim JSON well above 4096 tokens. A 4096 cap would truncate those into LengthFinishReason retries (wasted tokens) or incomplete graphs — the opposite of the goal here.
  • The 400 risk is narrow and self-announcing. It only triggers if a task is tiered onto a model whose max output < 16000, and the API 400s on the first such call — easy to spot and fix per-case. It's also safe on the current base model and modern targets (GPT-4o 16384, Claude ≥8192).

If we later tier cheap tasks onto small-output models, the clean fix is per-task output caps (extraction stays high, cheap tasks low) rather than a single low global default. max_tokens is already overridable per call, so this needs no schema change.


Generated by Claude Code

Comment thread src/lib/ai.ts Outdated
Comment on lines +103 to +111
if (opts?.userId) body.user = opts.userId;
let lastError: unknown;
for (
let attempt = 1;
attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;
attempt++
) {
try {
return await client.chat.completions.parse(body);
const completion = await client.chat.completions.parse(body);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mutating the body parameter directly can lead to unexpected side-effects if the caller passes a shared, reused, or frozen configuration object. In JavaScript/TypeScript, mutating function arguments is generally considered an anti-pattern (often flagged by no-param-reassign).

We can avoid mutating the parameter by creating a shallow copy of the body and casting it back to Body to preserve the generic type inference for .parsed.

Suggested change
if (opts?.userId) body.user = opts.userId;
let lastError: unknown;
for (
let attempt = 1;
attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;
attempt++
) {
try {
return await client.chat.completions.parse(body);
const completion = await client.chat.completions.parse(body);
const finalBody = opts?.userId ? ({ ...body, user: opts.userId } as Body) : body;
let lastError: unknown;
for (
let attempt = 1;
attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;
attempt++
) {
try {
const completion = await client.chat.completions.parse(finalBody);

schema: AtlasOutputSchema,
task: "atlas",
});
derivedAtlas = z.string().parse(parsed["atlas"]).trim();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since performStructuredAnalysis already validates and parses the response against the provided AtlasOutputSchema (which defines atlas as a string), parsed.atlas is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.

Suggested change
derivedAtlas = z.string().parse(parsed["atlas"]).trim();
derivedAtlas = parsed.atlas.trim();

task: "profile_synthesis",
});

const description = z.string().parse(parsed["description"]).trim();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since performStructuredAnalysis already validates the response against ProfileSynthesisOutputSchema (where description is defined as a string), parsed.description is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.

Suggested change
const description = z.string().parse(parsed["description"]).trim();
const description = parsed.description.trim();

…mpletion

Addresses PR #59 review: avoid mutating the caller's `body` argument when
injecting the `user` field (no-param-reassign anti-pattern; unsafe if a
caller ever passes a shared/frozen object). Build a shallow copy cast back
to `Body` so the generic inference that keeps `.parsed` typed is preserved.
@marcelsamyn marcelsamyn merged commit e36b835 into main Jun 9, 2026
1 check passed
@marcelsamyn marcelsamyn deleted the claude/zen-davinci-4yo8O branch June 9, 2026 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants