feat(0.22.0): OpenAI-compat backend exposes tools + transport errors fail loud#52
Merged
Merged
Conversation
…fail loud
Two substrate fixes surfaced by the gtm-agent delegation eval smoke
(2026-05-24). The smoke ran composite 0.07 with delegate_research 0/3
because the eval's OpenAI-compat backend (a) had no way to advertise
MCP tools to the LLM and (b) silently swallowed the router's HTTP 402
free-tier denial into an empty finalText + null error.
createOpenAICompatibleBackend
- New `tools?: OpenAIChatTool[]` option — forwarded verbatim on every
/chat/completions request when set. OpenAI Chat Completions tools[]
shape; router proxies it to Anthropic, DeepSeek, Groq, etc.
- New `toolChoice?` option — 'auto' | 'none' | 'required' | function pin.
Omitted by default; provider falls back to its own default.
- Streamed `tool_calls` deltas (OpenAI shape) and `tool_use`
content blocks (Anthropic shape proxied by the router) are
accumulated across SSE chunks and emitted as a single
`tool_call` RuntimeStreamEvent per call. Args JSON-parsed when
valid, raw string otherwise (truncation safety).
- Parallel tool calls, mixed text+tool streams, and routers that
drop the terminal `finish_reason` chunk all flush cleanly.
Transport errors fail loud
- BackendTransportError now carries `status` + truncated upstream
`body` (≤2 KiB). Non-success HTTP responses capture the response
text before throwing so `free_tier_limit`, `invalid_api_key`, etc.
surface in the error envelope.
- backend_error and final events both carry a typed
`error: { kind: 'transport' | 'backend', message, status?, body? }`
field. Consumers building a RunRecord MUST map final.error onto
RunRecord.error — silent empty finalText hides credit exhaustion.
- Sanitized telemetry exposes `error.kind` + `error.status` by
default; `error.body` is gated behind `includeControlPayloads`
because it can echo user-visible text from a provider's error page.
Tests
- 11 new in backends-openai-tools.test.ts: request shape with/without
tools, tool_choice variants, OpenAI streaming tool_calls (fragmented
args, parallel calls, truncated JSON, no-finish-reason flush,
mixed text+tool, single-chunk message.tool_calls), Anthropic
tool_use content blocks (assembled + interleaved with text).
- 10 new in backends-fail-loud.test.ts: 402/401/403/404 with typed
detail, 5xx after retry exhaustion, network failure (status=0),
body truncation, non-transport custom-backend errors, success path
leaves error undefined, BackendTransportError catchable directly.
- 235 prior tests unchanged.
Migration
- Consumers that silently treated empty finalText on transport failure
as "agent produced nothing" will now see typed errors on
final.error. Usually correct behavior; sometimes requires updating
downstream error handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
Live full-loop smoke against gtm-agent's delegation eval on 2026-05-24 (`pnpm eval:delegation --scenario competitor-dashboard --backend tcloud`) returned composite 0.07 with `delegate_research: 0/3` calls — the agent never fired any delegation tool. Diagnostic: `/tmp/live-smoke-evidence.md`.
Two distinct substrate bugs collided:
1. `createOpenAICompatibleBackend` could not advertise tools to the LLM. The streaming request body was hard-coded to `{ model, stream, stream_options, messages }` with no `tools` field. Even when the caller's AgentProfile mounted the 5 MCP delegation tools, the OpenAI-compat backend had no way to tell the LLM they existed — production routes through MCP-aware Claude Code / Codex / etc., but the eval routes through the bare LLM via this backend. The eval rubric was checking for tool_call events that physically could not fire on this transport.
2. Transport errors silently swallowed. When the router returned HTTP 402 `Model "claude-sonnet-4-6" requires credits`, `BackendTransportError` fired inside the backend, the runtime emitted a `backend_error` event, the stream drained cleanly, and the resulting `RunRecord` showed `finalText: ""`, `toolCalls: []`, `error: null`. Operator sees "agent produced nothing" with no signal pointing at credit exhaustion. Direct violation of the runtime's fail-loud doctrine — "External-boundary calls return typed outcomes; callers MUST inspect succeeded before using value."
Changes
`createOpenAICompatibleBackend` — `tools` + `toolChoice`
```ts
createOpenAICompatibleBackend({
apiKey, baseUrl, model,
tools?: ReadonlyArray,
toolChoice?: 'auto' | 'none' | 'required' | { type: 'function'; function: { name: string } },
// ... existing fields
})
```
`OpenAIChatTool` mirrors the OpenAI Chat Completions spec exactly (`{ type: 'function', function: { name, description?, parameters? } }`) so callers pass tool definitions through without runtime translation. The router proxies this shape verbatim to Anthropic, DeepSeek, Groq, OpenAI, Gemini.
Streaming tool call assembly handles both:
Each finalized call surfaces as a single `tool_call` RuntimeStreamEvent with `{ toolName, toolCallId, args }`. Args JSON-parsed when valid, raw string fallback when truncated.
The backend does not execute tools — surfacing the call is the contract; dispatch is the caller's problem (typically through their MCP / sandbox runtime).
Transport errors fail loud
`BackendTransportError` now carries:
```ts
class BackendTransportError extends AgentEvalError {
readonly backend: string
readonly status?: number
readonly body?: string // truncated to 2 KiB
}
```
Non-success HTTP responses read `response.text()` (best-effort, truncated) before throwing — so `free_tier_limit`, `invalid_api_key`, `model_not_found` reach the operator's eye.
`backend_error` and `final` events both carry a typed `error` field:
```ts
interface BackendErrorDetail {
kind: 'transport' | 'backend'
message: string
status?: number
body?: string
}
```
`run.ts` discriminates: `BackendTransportError` → `kind: 'transport'` with `status` + `body`; any other thrown error → `kind: 'backend'` (custom adapter / sandbox crashes).
Sanitized telemetry exposes `error.kind` + `error.status` always (operators need failure classification regardless of payload opt-in), but redacts `error.body` behind `RuntimeTelemetryOptions.includeControlPayloads` — a 4xx body can echo user-visible text from the provider's error page.
Tests
```
Test Files 26 passed (26)
Tests 256 passed (256)
```
`pnpm typecheck` clean. `pnpm build` clean. `biome check` clean.
Migration
Consumers that silently treated empty `finalText` on a transport failure as "agent produced nothing" will now see typed errors on `final.error`. Usually correct behavior; occasionally requires updating downstream error handling.
Concretely, the gtm-agent (and every other delegation-eval consumer) needs to:
Version
`0.22.0` minor. Additive surface (`tools`, `toolChoice`, `BackendErrorDetail`, `OpenAIChatTool`, `OpenAIChatToolChoice` exports). Behavior change is fail-loud-on-error — strictly more truthful telemetry, technically observable to consumers.
Test plan