Skip to content

feat(0.22.0): OpenAI-compat backend exposes tools + transport errors fail loud#52

Merged
tangletools merged 1 commit into
mainfrom
feat/openai-compat-tools-and-fail-loud-errors
May 24, 2026
Merged

feat(0.22.0): OpenAI-compat backend exposes tools + transport errors fail loud#52
tangletools merged 1 commit into
mainfrom
feat/openai-compat-tools-and-fail-loud-errors

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Root cause

Live full-loop smoke against gtm-agent's delegation eval on 2026-05-24 (`pnpm eval:delegation --scenario competitor-dashboard --backend tcloud`) returned composite 0.07 with `delegate_research: 0/3` calls — the agent never fired any delegation tool. Diagnostic: `/tmp/live-smoke-evidence.md`.

Two distinct substrate bugs collided:

1. `createOpenAICompatibleBackend` could not advertise tools to the LLM. The streaming request body was hard-coded to `{ model, stream, stream_options, messages }` with no `tools` field. Even when the caller's AgentProfile mounted the 5 MCP delegation tools, the OpenAI-compat backend had no way to tell the LLM they existed — production routes through MCP-aware Claude Code / Codex / etc., but the eval routes through the bare LLM via this backend. The eval rubric was checking for tool_call events that physically could not fire on this transport.

2. Transport errors silently swallowed. When the router returned HTTP 402 `Model "claude-sonnet-4-6" requires credits`, `BackendTransportError` fired inside the backend, the runtime emitted a `backend_error` event, the stream drained cleanly, and the resulting `RunRecord` showed `finalText: ""`, `toolCalls: []`, `error: null`. Operator sees "agent produced nothing" with no signal pointing at credit exhaustion. Direct violation of the runtime's fail-loud doctrine — "External-boundary calls return typed outcomes; callers MUST inspect succeeded before using value."

Changes

`createOpenAICompatibleBackend` — `tools` + `toolChoice`

```ts
createOpenAICompatibleBackend({
apiKey, baseUrl, model,
tools?: ReadonlyArray,
toolChoice?: 'auto' | 'none' | 'required' | { type: 'function'; function: { name: string } },
// ... existing fields
})
```

`OpenAIChatTool` mirrors the OpenAI Chat Completions spec exactly (`{ type: 'function', function: { name, description?, parameters? } }`) so callers pass tool definitions through without runtime translation. The router proxies this shape verbatim to Anthropic, DeepSeek, Groq, OpenAI, Gemini.

Streaming tool call assembly handles both:

  • OpenAI shape: incremental `delta.tool_calls[index].function.arguments` chunks per index, finalized by `finish_reason: 'tool_calls'`, plus the non-streamed `message.tool_calls` collapse.
  • Anthropic shape (router-proxied): `content_block_start` (`tool_use` block with id+name), `content_block_delta` (`input_json_delta.partial_json`), `content_block_stop`.

Each finalized call surfaces as a single `tool_call` RuntimeStreamEvent with `{ toolName, toolCallId, args }`. Args JSON-parsed when valid, raw string fallback when truncated.

The backend does not execute tools — surfacing the call is the contract; dispatch is the caller's problem (typically through their MCP / sandbox runtime).

Transport errors fail loud

`BackendTransportError` now carries:

```ts
class BackendTransportError extends AgentEvalError {
readonly backend: string
readonly status?: number
readonly body?: string // truncated to 2 KiB
}
```

Non-success HTTP responses read `response.text()` (best-effort, truncated) before throwing — so `free_tier_limit`, `invalid_api_key`, `model_not_found` reach the operator's eye.

`backend_error` and `final` events both carry a typed `error` field:

```ts
interface BackendErrorDetail {
kind: 'transport' | 'backend'
message: string
status?: number
body?: string
}
```

`run.ts` discriminates: `BackendTransportError` → `kind: 'transport'` with `status` + `body`; any other thrown error → `kind: 'backend'` (custom adapter / sandbox crashes).

Sanitized telemetry exposes `error.kind` + `error.status` always (operators need failure classification regardless of payload opt-in), but redacts `error.body` behind `RuntimeTelemetryOptions.includeControlPayloads` — a 4xx body can echo user-visible text from the provider's error page.

Tests

  • `tests/backends-openai-tools.test.ts` (11 new): request shape with/without tools; `toolChoice` variants; OpenAI streaming tool_calls — fragmented args reassembly, parallel calls, truncated JSON fallback, no-finish-reason flush, mixed text+tool streams, single-chunk `message.tool_calls` collapse; Anthropic `tool_use` content blocks assembled + interleaved with text.
  • `tests/backends-fail-loud.test.ts` (10 new): 402 free-tier denial with typed detail + upstream body; 401/403/404 parametric; 5xx after retry exhaustion; `fetch failed` (status=0); 2 KiB body truncation cap; non-transport custom-backend errors as `kind: 'backend'`; success path leaves `final.error` undefined; `BackendTransportError` catchable directly with `status` + `body`.
  • 235 prior tests unchanged. Final count: 256 passing.

```
Test Files 26 passed (26)
Tests 256 passed (256)
```

`pnpm typecheck` clean. `pnpm build` clean. `biome check` clean.

Migration

Consumers that silently treated empty `finalText` on a transport failure as "agent produced nothing" will now see typed errors on `final.error`. Usually correct behavior; occasionally requires updating downstream error handling.

Concretely, the gtm-agent (and every other delegation-eval consumer) needs to:

  1. Pass its delegation tools through `tools:` on `createOpenAICompatibleBackend` for the eval to be structurally capable of measuring tool dispatch.
  2. Map `final.error` onto `RunRecord.error` so 402/401/5xx no longer becomes silent `error: null`.

Version

`0.22.0` minor. Additive surface (`tools`, `toolChoice`, `BackendErrorDetail`, `OpenAIChatTool`, `OpenAIChatToolChoice` exports). Behavior change is fail-loud-on-error — strictly more truthful telemetry, technically observable to consumers.

Test plan

  • Existing 235 tests pass unchanged
  • 21 new tests pass
  • `pnpm typecheck` clean
  • `pnpm build` clean
  • `biome check` clean
  • Downstream smoke: gtm-agent delegation eval reruns with this version + tools wired through the eval-backend factory

…fail loud

Two substrate fixes surfaced by the gtm-agent delegation eval smoke
(2026-05-24). The smoke ran composite 0.07 with delegate_research 0/3
because the eval's OpenAI-compat backend (a) had no way to advertise
MCP tools to the LLM and (b) silently swallowed the router's HTTP 402
free-tier denial into an empty finalText + null error.

createOpenAICompatibleBackend
- New `tools?: OpenAIChatTool[]` option — forwarded verbatim on every
  /chat/completions request when set. OpenAI Chat Completions tools[]
  shape; router proxies it to Anthropic, DeepSeek, Groq, etc.
- New `toolChoice?` option — 'auto' | 'none' | 'required' | function pin.
  Omitted by default; provider falls back to its own default.
- Streamed `tool_calls` deltas (OpenAI shape) and `tool_use`
  content blocks (Anthropic shape proxied by the router) are
  accumulated across SSE chunks and emitted as a single
  `tool_call` RuntimeStreamEvent per call. Args JSON-parsed when
  valid, raw string otherwise (truncation safety).
- Parallel tool calls, mixed text+tool streams, and routers that
  drop the terminal `finish_reason` chunk all flush cleanly.

Transport errors fail loud
- BackendTransportError now carries `status` + truncated upstream
  `body` (≤2 KiB). Non-success HTTP responses capture the response
  text before throwing so `free_tier_limit`, `invalid_api_key`, etc.
  surface in the error envelope.
- backend_error and final events both carry a typed
  `error: { kind: 'transport' | 'backend', message, status?, body? }`
  field. Consumers building a RunRecord MUST map final.error onto
  RunRecord.error — silent empty finalText hides credit exhaustion.
- Sanitized telemetry exposes `error.kind` + `error.status` by
  default; `error.body` is gated behind `includeControlPayloads`
  because it can echo user-visible text from a provider's error page.

Tests
- 11 new in backends-openai-tools.test.ts: request shape with/without
  tools, tool_choice variants, OpenAI streaming tool_calls (fragmented
  args, parallel calls, truncated JSON, no-finish-reason flush,
  mixed text+tool, single-chunk message.tool_calls), Anthropic
  tool_use content blocks (assembled + interleaved with text).
- 10 new in backends-fail-loud.test.ts: 402/401/403/404 with typed
  detail, 5xx after retry exhaustion, network failure (status=0),
  body truncation, non-transport custom-backend errors, success path
  leaves error undefined, BackendTransportError catchable directly.
- 235 prior tests unchanged.

Migration
- Consumers that silently treated empty finalText on transport failure
  as "agent produced nothing" will now see typed errors on
  final.error. Usually correct behavior; sometimes requires updating
  downstream error handling.
@tangletools tangletools merged commit bba2aeb into main May 24, 2026
1 check passed
@tangletools tangletools deleted the feat/openai-compat-tools-and-fail-loud-errors branch May 24, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants