Skip to content

Streaming timeouts: cover the pre-first-chunk window with headersMs and firstChunkMs #16538

Description

@zirkelc

Streaming timeouts: cover the pre‑first‑chunk window with headersMs and firstChunkMs

Summary

streamText/generateText support a timeout config with totalMs, stepMs, chunkMs, toolMs, and per‑tool tools overrides. For streaming, totalMs/stepMs guard the coarse outer bounds and chunkMs guards the gap between chunks. But there are two early‑stall windows that no dedicated, fast deadline covers:

  1. No response at all. The request is dispatched but the provider never returns response headers (connection queued/hung, provider overloaded). The internal await doStream(...) sits here, bounded only by the coarse totalMs/stepMs.
  2. Headers, but no content. The provider returns 200 immediately, then stalls before emitting the first content chunk (only keep‑alive/ping framing or an empty role prelude). chunkMs never arms in this window — see the reproduction below.

I'd like to propose two additional, independently‑tunable deadlines that close these windows:

  • headersMs: dispatch to response headers received.
  • firstChunkMs: stream start to first content chunk emitted.

The timeline

request                    RESPONSE                 first content        subsequent
dispatched                 headers received         chunk/token          chunks
   │                          │                        │                    │
   ├──── connect + send ─────►│                        │                    │
   │                          ├──── provider stalls ──►│──gap──gap──gap────►│
   │                          │   (keep-alives only)   │                    │
   │                          │                        │                    │
   │◄──── headersMs ─────────►│                        │                    │
   │                          │◄──── firstChunkMs ────►│                    │
   │                          │       (proposed)       │◄─── chunkMs ──────►│
   │                          │                        │   (existing)       │
   │◄──────────────────────────── stepMs / totalMs ───────────────────────►│
  • headersMs covers connect + request to response headers — i.e. the internal await doStream(...). Ends at headers.
  • firstChunkMs covers stream start to first content chunk. This is the window chunkMs cannot see, because chunkMs only measures gaps between chunks and never arms before the first one.
  • chunkMs (existing) covers gaps once content is flowing.
  • stepMs/totalMs (existing) remain the coarse outer bounds.

Why the existing knobs don't cover this

  • totalMs / stepMs are the outer bounds of a whole generation. For reasoning models they're intentionally large (a legitimate first token can take tens of seconds), so relying on them to catch a dead stream means waiting far too long before failing over. They do currently bound the dispatch→headers window (the merged abort signal is passed to doStream), but only at that coarse scale — hence headersMs.

  • chunkMs measures the delay between successive chunks. The chunk timer is only (re)armed from inside the stream transform, and that transform isn't invoked until the first content chunk surfaces — provider prelude parts (stream start, response metadata) are buffered upstream and don't reach it beforehand. So before the first content chunk there is no armed timer, and a stream that produces no content never trips chunkMs. The exact stall we want to catch is the one it's blind to.

Minimal reproduction

Deterministic, no network: a custom fetch returns 200 immediately, emits the role prelude plus an SSE keep-alive comment, then goes silent — never producing a content chunk. This is the "headers but no content" stall.

// repro.ts — `pnpm tsx repro.ts` (ai@7, @ai-sdk/openai@4)
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';

const enc = new TextEncoder();

/** 200 OK, role prelude + keep-alive comment, then silence. No content chunk. */
const preludeThenStall = (signal?: AbortSignal) =>
  new ReadableStream<Uint8Array>({
    start(c) {
      c.enqueue(enc.encode(
        `data: ${JSON.stringify({ id: 'x', object: 'chat.completion.chunk', choices: [{ index: 0, delta: { role: 'assistant' }, finish_reason: null }] })}\n\n`,
      ));
      c.enqueue(enc.encode(': keep-alive ping\n\n')); // SSE comment: resets any byte-level timer
      signal?.addEventListener('abort', () => { try { c.error(signal.reason); } catch {} });
    },
  });

const openai = createOpenAI({
  apiKey: 'test',
  fetch: async (_i, init) =>
    new Response(preludeThenStall(init?.signal ?? undefined), {
      status: 200,
      headers: { 'content-type': 'text/event-stream' },
    }),
});

const run = async (label: string, opts: Record<string, unknown>) => {
  const result = streamText({ model: openai.chat('gpt-4o'), prompt: 'hi', onError: () => {}, ...opts });
  const t0 = Date.now();
  const consume = (async () => {
    const parts: string[] = [];
    try { for await (const p of result.fullStream) parts.push(p.type); }
    catch (e) { return `aborted@${Date.now() - t0}ms (${(e as Error).name})`; }
    return `ended@${Date.now() - t0}ms parts=[${parts}]`;
  })();
  const guard = new Promise((r) => setTimeout(() => r(`HANG (nothing in 2000ms) — deadline never fired`), 2000));
  console.log(`${label.padEnd(12)} -> ${await Promise.race([consume, guard])}`);
};

await run('chunkMs:500', { timeout: { chunkMs: 500 } });
await run('totalMs:500', { timeout: { totalMs: 500 } });
process.exit(0);

Output:

chunkMs:500  -> HANG (nothing in 2000ms) — deadline never fired
totalMs:500  -> ended@546ms parts=[start,abort]
  • chunkMs:500 never fires — 2s elapse with no content, because the timer only measures gaps between content chunks and never arms before the first one. This is the exact stall it's blind to.
  • totalMs:500 is the only deadline that catches it, and only at its coarse bound (parts=[start, abort]). In production totalMs/stepMs are set high for slow reasoning generations, so relying on them here means waiting far too long to fail over.

firstChunkMs would fire in this window. Note the keep-alive comment: a byte-level transport timeout (undici bodyTimeout, or a hand-rolled "time to first byte" fetch wrapper) would be reset by the prelude/ping bytes and by the 200 headers, so it can't distinguish "stalled before content" from "actively streaming keep-alives" — only the SDK, which parses the SSE framing, can.

Why not just do this in a custom fetch / transport layer?

headersMs: possible at the transport layer, but awkward. A custom fetch can enforce a header deadline today (for example undici's headersTimeout via a custom dispatcher). But that's runtime‑specific boilerplate: Node/undici‑only, doesn't translate to Edge/Workers/browser fetch, and forces every consumer to hand‑roll a dispatcher just to bound "did the provider answer at all." An SDK‑level headersMs works uniformly across every provider and runtime. The argument here is ergonomics: it can be done below, but it shouldn't have to be.

firstChunkMs: genuinely cannot be done reliably at the transport layer. A transport‑level body/inactivity timeout (e.g. undici bodyTimeout) resets on any received bytes. Streaming providers emit keep‑alive/ping/comment lines and prelude events (and the 200 headers arrive immediately) before the first real content token. To the socket the response is "active," so a body‑inactivity timeout keeps resetting and never fires. Only the SDK parses the SSE framing and the provider's event schema, so only the SDK can distinguish a genuine content chunk from keep‑alive noise and enforce a true "time to first content" deadline. This is the deadline that has to live upstream.

Implementation notes (hooks that already exist)

  • The SDK already classifies content vs. non‑content parts via an isOutputChunkType map and tracks a hasReceivedOutputChunk flag in the stream transform. firstChunkMs can reuse that classification: arm at step start, disarm on the first part where isOutputChunkType is true.
  • The chunk/step timers are created with a shared setAbortTimeout helper and merged into the call's abort signal; headersMs/firstChunkMs can follow the same pattern (a dedicated AbortController armed with setTimeout, merged into the signal). headersMs arms before await doStream(...) and clears when it resolves.
  • Timers are set up inside the step loop, so per‑step re‑arming (below) is the natural default.

Proposed API

Extend the existing timeout object:

const result = streamText({
  model,
  prompt,
  timeout: {
    headersMs: 5_000,      // new: no response headers in 5s -> fail fast (provider not answering)
    firstChunkMs: 30_000,  // new: headers but no content chunk in 30s -> abort (stalled before first token)
    chunkMs: 30_000,       // existing: gap between chunks once content is flowing
    stepMs: 60_000,        // existing
    totalMs: 180_000,      // existing
    // toolMs / tools: existing per-tool timeouts (unchanged)
  },
});

Semantics:

  • Per step. In multi‑step / tool‑calling runs each step is its own underlying model call, so headersMs and firstChunkMs re‑arm per step (like stepMs/chunkMs), not just on the first.
  • First content chunk. firstChunkMs is satisfied by the first content‑bearing part (text/reasoning delta, tool‑input/tool‑call parts) — i.e. isOutputChunkType === true — not by keep‑alive/ping framing or an empty role prelude. This is the crux, and it's why it can't be a transport‑level "first byte" check.
  • On expiry. Abort the underlying request and reject with a timeout error, consistent with how stepMs/chunkMs abort today. Because these fire before any content is emitted, an abort here is safe to retry without duplicating output.
  • Opt‑in. Both default to unset (current behavior). Setting either only tightens the covered window.

Alternatives considered

  • Lower chunkMs. Doesn't help — it never arms before the first content chunk (see reproduction).
  • Lower stepMs/totalMs. Forces the outer bound down, killing legitimately slow reasoning generations. The point is to fail fast on dead streams while staying patient with slow but healthy ones — that needs separate, earlier deadlines.
  • Custom fetch for everything. Works for headersMs (with runtime‑specific effort) but cannot implement firstChunkMs reliably, because keep‑alives defeat transport‑level inactivity timers.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions