Add per-task model routing and LLM usage telemetry by marcelsamyn · Pull Request #59 · marcelsamyn/assistant-memory

marcelsamyn · 2026-06-09T10:34:18Z

Summary

This PR introduces per-task model routing and structured LLM usage telemetry to enable cost tracking and optimization by task type. The system now supports routing different LLM tasks to different models via environment variables while maintaining backward compatibility.

Key Changes

New telemetry infrastructure (src/utils/llm-telemetry.ts):
- Added LlmUsageEvent interface for provider-neutral LLM call accounting
- Implemented normalizeUsage() to flatten OpenAI/OpenRouter usage objects
- Implemented recordLlmUsage() to emit structured telemetry as JSON to stdout
- Captures task, model, user, token counts, and cached token metrics
Per-task model routing (src/utils/models.ts):
- Defined ModelTask type with 9 task categories (graph_extraction, document_spine, transcript_segmentation, conversation_summary, graph_cleanup, atlas, profile_synthesis, dream, deep_research)
- Implemented modelForTask() function with safe fallback to MODEL_ID_GRAPH_EXTRACTION
- Allows independent model assignment per task via environment variables
Enhanced AI client functions (src/lib/ai.ts):
- Updated parseStructuredCompletion() to accept optional task and userId parameters
- Updated createCompletionClient() to accept task parameter and set Helicone-Property-Task header for spend breakdown
- Updated crateTextCompletion() and performStructuredAnalysis() to support task routing and telemetry recording
- Added max_tokens parameter support with MODEL_MAX_OUTPUT_TOKENS default
Environment configuration (src/utils/env.ts):
- Added 8 optional per-task model override variables
- Added MODEL_MAX_OUTPUT_TOKENS configuration (default 16000)
- Adjusted default probabilities: DREAM_PROBABILITY (0.1 → 0.05), DEEP_RESEARCH_PROBABILITY (0.5 → 0)
Updated all LLM call sites to pass task parameter:
- Graph extraction (extract-graph.ts): Refactored prompt into static system message and dynamic user message for OpenRouter prompt caching
- Conversation summarization (summarize-conversation.ts)
- Transcript segmentation (segment-transcript.ts)
- Document spine extraction (extract-document-spine.ts)
- Graph cleanup (cleanup-graph.ts)
- Atlas assistant (atlas-assistant.ts): Switched to use crateTextCompletion() helper
- Dream job (dream.ts)
- Deep research (deep-research.ts)
- Atlas user job (atlas-user.ts)
- Profile synthesis (profile-synthesis.ts)
Test updates (extract-graph.test.ts): Updated mock to handle multi-message prompts by concatenating all message content

Notable Implementation Details

Backward compatibility: All task-specific model overrides are optional; undefined entries inherit MODEL_ID_GRAPH_EXTRACTION, preserving existing behavior until explicitly configured
Telemetry ownership: Telemetry is emitted independently of vendor integrations (Helicone), ensuring cost tracking survives infrastructure changes
Prompt caching optimization: Graph extraction now uses system/user message split to enable OpenRouter/OpenAI automatic prompt caching of the static instruction block
Helicone integration: Task information is passed via Helicone-Property-Task header for vendor-side spend breakdown
Token accounting: Captures both regular and cached prompt tokens to track cache hit efficiency

https://claude.ai/code/session_01NQ5oJ6eoQiKqmxk3E1NJxr

…hrottle speculative jobs Token-spend levers: - Per-task model selection via src/utils/models.ts. Each task has an optional MODEL_ID_* env override and falls back to MODEL_ID_GRAPH_EXTRACTION (the prior single model), so behavior is unchanged until a task is deliberately pointed elsewhere — a safe, one-task-at-a-time rollout/rollback. - Cap output tokens (MODEL_MAX_OUTPUT_TOKENS, default 16000) on every completion to bound runaway generations; per-call overridable. - Tag every LLM call with a Helicone-Property-Task header so spend can be broken down per task in Helicone even before models are split. - Throttle speculative background work: DEEP_RESEARCH_PROBABILITY now defaults to 0 (opt-in to experiment), DREAM_PROBABILITY lowered to 0.05. Both are env-overridable. Threaded task/maxTokens through performStructuredAnalysis, crateTextCompletion, and every direct chat.completions call site (graph extraction, document spine, transcript segmentation, conversation summary, graph cleanup, atlas, profile synthesis, dream, deep research).

…d `user` field Helicone's per-request properties die with Helicone, so add a portable telemetry layer that survives a switch to any other backend: - New src/utils/llm-telemetry.ts emits one structured `llm_usage` stdout event per LLM call (task, model, userId, prompt/cached/completion/total tokens). A log drain / OTel collector / Sentry / Postgres sink can consume it to build per-task and per-model cost dashboards — `task` is our own concept and only survives if we emit it ourselves. `cachedTokens` reads OpenRouter's prompt_tokens_details.cached_tokens so prefix-cache effectiveness is visible without a provider dashboard. - Set the OpenAI-standard, vendor-neutral `user` field on every request (OpenRouter passes it through), alongside the existing Helicone header. - Emission is centralized in parseStructuredCompletion and crateTextCompletion (the two choke points), so there's one seam to repoint later. Threaded task + userId into parseStructuredCompletion. - Fold atlas-assistant's bespoke chat.completions.create call into crateTextCompletion so it inherits tiering, the output cap, telemetry, and the `user` field instead of duplicating them.

…mic payload ⚠️ NOT YET EVAL-VALIDATED — run `pnpm eval:ingest` and `pnpm eval:memory` before merging/deploying. This reorders the core extraction prompt, which can shift extraction quality; the sandbox this was authored in has no model key / Postgres, so the eval harness could not be run here. What changed: - The graph-extraction prompt was a single user message that placed the dynamic payload (existing-node <context>, commitments, speaker map, source refs, and the source content) in the MIDDLE, ahead of the large static rule block. That left no stable prefix, so OpenRouter/OpenAI prompt caching never engaged on our highest-volume, most input-heavy call. - Now sent as two messages: a `system` message with the static instructions (role, IMPORTANT, CRITICAL RULES, assertionKind, few-shot, graph rules, metrics) — byte-identical across ingests of the same shape and well over the ~1024-token minimum, so providers auto-cache it as the request prefix — and a `user` message with the per-call dynamic payload, which can no longer invalidate the cached prefix. Instruction text is otherwise unchanged. - Adjusted one positional reference ("allowed source refs above" → "provided with the content below") now that refs live in the trailing user message. - onLlmIO debug hook now logs `${systemPrompt}\n\n${userPrompt}`. - Tests: stub capture now joins all messages, matching the combined prompt. Risk to check in eval: the transcript / multi-party path, where assertionKind rules reference the speaker list that now sits in the trailing user message rather than ahead of the rules.

gemini-code-assist

Code Review

This pull request introduces LLM task tiering and telemetry tracking, allowing different logical LLM tasks to be routed to specific models via environment variable overrides, with a fallback to a base model. It also implements structured LLM usage telemetry to track token usage per task and optimizes prompt caching for graph extraction by separating static instructions from dynamic payloads. The review feedback suggests lowering the default MODEL_MAX_OUTPUT_TOKENS from 16000 to 4096 to avoid potential API errors with models that have lower output limits, avoiding direct mutation of the body parameter in parseStructuredCompletion, and removing redundant z.string().parse calls where responses are already validated.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T10:36:32Z

+  // generation (providers can emit up to 128k); set well above the observed
+  // average output (~850) so it effectively never truncates real work.
+  // Per-call overrides are allowed in code.
+  MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000),


Setting the default MODEL_MAX_OUTPUT_TOKENS to 16000 can cause 400 Bad Request errors from upstream LLM providers (including OpenAI and OpenRouter) when routing tasks to models with lower maximum output limits (e.g., gpt-4-turbo or Claude models, which strictly limit output tokens to 4096 or 8192). If a request specifies a max_tokens value greater than the model's maximum supported output tokens, the API call will fail.

Since the observed average output is only ~850 tokens, lowering the default to 4096 provides a safe buffer that prevents runaway generations while ensuring broad compatibility across different routed models.

Suggested change

MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000),

MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(4096),

Keeping the default at 16000 deliberately. Two reasons:

Extraction has a long output tail. Graph extraction (especially documents, where the prompt says "exhaustively extract every concrete fact") can emit dense node/claim JSON well above 4096 tokens. A 4096 cap would truncate those into LengthFinishReason retries (wasted tokens) or incomplete graphs — the opposite of the goal here.

The 400 risk is narrow and self-announcing. It only triggers if a task is tiered onto a model whose max output < 16000, and the API 400s on the first such call — easy to spot and fix per-case. It's also safe on the current base model and modern targets (GPT-4o 16384, Claude ≥8192).

If we later tier cheap tasks onto small-output models, the clean fix is per-task output caps (extraction stays high, cheap tasks low) rather than a single low global default. max_tokens is already overridable per call, so this needs no schema change.

Generated by Claude Code

gemini-code-assist · 2026-06-09T10:36:32Z

+  if (opts?.userId) body.user = opts.userId;
  let lastError: unknown;
  for (
    let attempt = 1;
    attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;
    attempt++
  ) {
    try {
-      return await client.chat.completions.parse(body);
+      const completion = await client.chat.completions.parse(body);


Mutating the body parameter directly can lead to unexpected side-effects if the caller passes a shared, reused, or frozen configuration object. In JavaScript/TypeScript, mutating function arguments is generally considered an anti-pattern (often flagged by no-param-reassign).

We can avoid mutating the parameter by creating a shallow copy of the body and casting it back to Body to preserve the generic type inference for .parsed.

Suggested change

if (opts?.userId) body.user = opts.userId;

let lastError: unknown;

for (

let attempt = 1;

attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;

attempt++

) {

try {

return await client.chat.completions.parse(body);

const completion = await client.chat.completions.parse(body);

const finalBody = opts?.userId ? ({ ...body, user: opts.userId } as Body) : body;

let lastError: unknown;

for (

let attempt = 1;

attempt <= STRUCTURED_COMPLETION_MAX_ATTEMPTS;

attempt++

) {

try {

const completion = await client.chat.completions.parse(finalBody);

gemini-code-assist · 2026-06-09T10:36:32Z

      schema: AtlasOutputSchema,
+      task: "atlas",
    });
    derivedAtlas = z.string().parse(parsed["atlas"]).trim();


Since performStructuredAnalysis already validates and parses the response against the provided AtlasOutputSchema (which defines atlas as a string), parsed.atlas is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.

Suggested change

derivedAtlas = z.string().parse(parsed["atlas"]).trim();

derivedAtlas = parsed.atlas.trim();

gemini-code-assist · 2026-06-09T10:36:32Z

+    task: "profile_synthesis",
  });

  const description = z.string().parse(parsed["description"]).trim();


Since performStructuredAnalysis already validates the response against ProfileSynthesisOutputSchema (where description is defined as a string), parsed.description is guaranteed to be a string. Re-parsing it with z.string().parse is redundant and can be simplified to direct property access.

Suggested change

const description = z.string().parse(parsed["description"]).trim();

const description = parsed.description.trim();

…mpletion Addresses PR #59 review: avoid mutating the caller's `body` argument when injecting the `user` field (no-param-reassign anti-pattern; unsafe if a caller ever passes a shared/frozen object). Build a shallow copy cast back to `Body` so the generic inference that keeps `.parsed` typed is preserved.

claude added 3 commits June 9, 2026 09:03

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

marcelsamyn merged commit e36b835 into main Jun 9, 2026
1 check passed

marcelsamyn deleted the claude/zen-davinci-4yo8O branch June 9, 2026 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-task model routing and LLM usage telemetry#59

Add per-task model routing and LLM usage telemetry#59
marcelsamyn merged 4 commits into
mainfrom
claude/zen-davinci-4yo8O

marcelsamyn commented Jun 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

marcelsamyn Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(16000),
	MODEL_MAX_OUTPUT_TOKENS: z.coerce.number().int().positive().default(4096),

	derivedAtlas = z.string().parse(parsed["atlas"]).trim();
	derivedAtlas = parsed.atlas.trim();

	const description = z.string().parse(parsed["description"]).trim();
	const description = parsed.description.trim();

Conversation

marcelsamyn commented Jun 9, 2026

Summary

Key Changes

Notable Implementation Details

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

marcelsamyn Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants