Streaming timeouts: cover the pre-first-chunk window with `headersMs` and `firstChunkMs`

# Streaming timeouts: cover the pre‑first‑chunk window with `headersMs` and `firstChunkMs`

## Summary

`streamText`/`generateText` support a `timeout` config with `totalMs`, `stepMs`, `chunkMs`, `toolMs`, and per‑tool `tools` overrides. For streaming, `totalMs`/`stepMs` guard the coarse outer bounds and `chunkMs` guards the gap between chunks. But there are two early‑stall windows that no *dedicated, fast* deadline covers:

1. **No response at all.** The request is dispatched but the provider never returns response headers (connection queued/hung, provider overloaded). The internal `await doStream(...)` sits here, bounded only by the coarse `totalMs`/`stepMs`.
2. **Headers, but no content.** The provider returns `200` immediately, then stalls before emitting the first content chunk (only keep‑alive/ping framing or an empty role prelude). `chunkMs` never arms in this window — see the reproduction below.

I'd like to propose two additional, independently‑tunable deadlines that close these windows:

- **`headersMs`**: dispatch to response headers received.
- **`firstChunkMs`**: stream start to first *content* chunk emitted.

## The timeline

```
request                    RESPONSE                 first content        subsequent
dispatched                 headers received         chunk/token          chunks
   │                          │                        │                    │
   ├──── connect + send ─────►│                        │                    │
   │                          ├──── provider stalls ──►│──gap──gap──gap────►│
   │                          │   (keep-alives only)   │                    │
   │                          │                        │                    │
   │◄──── headersMs ─────────►│                        │                    │
   │                          │◄──── firstChunkMs ────►│                    │
   │                          │       (proposed)       │◄─── chunkMs ──────►│
   │                          │                        │   (existing)       │
   │◄──────────────────────────── stepMs / totalMs ───────────────────────►│
```

- **`headersMs`** covers `connect + request` to `response headers` — i.e. the internal `await doStream(...)`. Ends at headers.
- **`firstChunkMs`** covers `stream start` to `first content chunk`. This is the window `chunkMs` cannot see, because `chunkMs` only measures gaps *between* chunks and never arms before the first one.
- **`chunkMs`** (existing) covers gaps once content is flowing.
- **`stepMs`/`totalMs`** (existing) remain the coarse outer bounds.

## Why the existing knobs don't cover this

- **`totalMs` / `stepMs`** are the *outer* bounds of a whole generation. For reasoning models they're intentionally large (a legitimate first token can take tens of seconds), so relying on them to catch a dead stream means waiting far too long before failing over. They *do* currently bound the dispatch→headers window (the merged abort signal is passed to `doStream`), but only at that coarse scale — hence `headersMs`.

- **`chunkMs`** measures the delay *between* successive chunks. The chunk timer is only (re)armed from inside the stream transform, and that transform isn't invoked until the first content chunk surfaces — provider prelude parts (stream start, response metadata) are buffered upstream and don't reach it beforehand. So before the first content chunk there is no armed timer, and a stream that produces no content never trips `chunkMs`. The exact stall we want to catch is the one it's blind to.

## Minimal reproduction

Deterministic, no network: a custom `fetch` returns `200` immediately, emits the role prelude plus an SSE keep-alive comment, then goes silent — never producing a content chunk. This is the "headers but no content" stall.

```ts
// repro.ts — `pnpm tsx repro.ts` (ai@7, @ai-sdk/openai@4)
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';

const enc = new TextEncoder();

/** 200 OK, role prelude + keep-alive comment, then silence. No content chunk. */
const preludeThenStall = (signal?: AbortSignal) =>
  new ReadableStream<Uint8Array>({
    start(c) {
      c.enqueue(enc.encode(
        `data: ${JSON.stringify({ id: 'x', object: 'chat.completion.chunk', choices: [{ index: 0, delta: { role: 'assistant' }, finish_reason: null }] })}\n\n`,
      ));
      c.enqueue(enc.encode(': keep-alive ping\n\n')); // SSE comment: resets any byte-level timer
      signal?.addEventListener('abort', () => { try { c.error(signal.reason); } catch {} });
    },
  });

const openai = createOpenAI({
  apiKey: 'test',
  fetch: async (_i, init) =>
    new Response(preludeThenStall(init?.signal ?? undefined), {
      status: 200,
      headers: { 'content-type': 'text/event-stream' },
    }),
});

const run = async (label: string, opts: Record<string, unknown>) => {
  const result = streamText({ model: openai.chat('gpt-4o'), prompt: 'hi', onError: () => {}, ...opts });
  const t0 = Date.now();
  const consume = (async () => {
    const parts: string[] = [];
    try { for await (const p of result.fullStream) parts.push(p.type); }
    catch (e) { return `aborted@${Date.now() - t0}ms (${(e as Error).name})`; }
    return `ended@${Date.now() - t0}ms parts=[${parts}]`;
  })();
  const guard = new Promise((r) => setTimeout(() => r(`HANG (nothing in 2000ms) — deadline never fired`), 2000));
  console.log(`${label.padEnd(12)} -> ${await Promise.race([consume, guard])}`);
};

await run('chunkMs:500', { timeout: { chunkMs: 500 } });
await run('totalMs:500', { timeout: { totalMs: 500 } });
process.exit(0);
```

Output:

```
chunkMs:500  -> HANG (nothing in 2000ms) — deadline never fired
totalMs:500  -> ended@546ms parts=[start,abort]
```

- **`chunkMs:500` never fires** — 2s elapse with no content, because the timer only measures gaps *between* content chunks and never arms before the first one. This is the exact stall it's blind to.
- **`totalMs:500` is the only deadline that catches it**, and only at its coarse bound (`parts=[start, abort]`). In production `totalMs`/`stepMs` are set high for slow reasoning generations, so relying on them here means waiting far too long to fail over.

`firstChunkMs` would fire in this window. Note the keep-alive comment: a byte-level transport timeout (undici `bodyTimeout`, or a hand-rolled "time to first byte" `fetch` wrapper) would be **reset** by the prelude/ping bytes and by the `200` headers, so it can't distinguish "stalled before content" from "actively streaming keep-alives" — only the SDK, which parses the SSE framing, can.

## Why not just do this in a custom `fetch` / transport layer?

**`headersMs`: possible at the transport layer, but awkward.** A custom `fetch` can enforce a header deadline today (for example undici's `headersTimeout` via a custom dispatcher). But that's runtime‑specific boilerplate: Node/undici‑only, doesn't translate to Edge/Workers/browser `fetch`, and forces every consumer to hand‑roll a dispatcher just to bound "did the provider answer at all." An SDK‑level `headersMs` works uniformly across every provider and runtime. The argument here is **ergonomics**: it *can* be done below, but it shouldn't have to be.

**`firstChunkMs`: genuinely cannot be done reliably at the transport layer.** A transport‑level body/inactivity timeout (e.g. undici `bodyTimeout`) resets on *any* received bytes. Streaming providers emit keep‑alive/ping/comment lines and prelude events (and the `200` headers arrive immediately) before the first real content token. To the socket the response is "active," so a body‑inactivity timeout keeps resetting and never fires. Only the SDK parses the SSE framing and the provider's event schema, so **only the SDK can distinguish a genuine content chunk from keep‑alive noise** and enforce a true "time to first content" deadline. This is the deadline that has to live upstream.

## Implementation notes (hooks that already exist)

- The SDK already classifies content vs. non‑content parts via an `isOutputChunkType` map and tracks a `hasReceivedOutputChunk` flag in the stream transform. `firstChunkMs` can reuse that classification: arm at step start, disarm on the first part where `isOutputChunkType` is `true`.
- The chunk/step timers are created with a shared `setAbortTimeout` helper and merged into the call's abort signal; `headersMs`/`firstChunkMs` can follow the same pattern (a dedicated `AbortController` armed with `setTimeout`, merged into the signal). `headersMs` arms before `await doStream(...)` and clears when it resolves.
- Timers are set up *inside* the step loop, so per‑step re‑arming (below) is the natural default.

## Proposed API

Extend the existing `timeout` object:

```ts
const result = streamText({
  model,
  prompt,
  timeout: {
    headersMs: 5_000,      // new: no response headers in 5s -> fail fast (provider not answering)
    firstChunkMs: 30_000,  // new: headers but no content chunk in 30s -> abort (stalled before first token)
    chunkMs: 30_000,       // existing: gap between chunks once content is flowing
    stepMs: 60_000,        // existing
    totalMs: 180_000,      // existing
    // toolMs / tools: existing per-tool timeouts (unchanged)
  },
});
```

Semantics:

- **Per step.** In multi‑step / tool‑calling runs each step is its own underlying model call, so `headersMs` and `firstChunkMs` re‑arm per step (like `stepMs`/`chunkMs`), not just on the first.
- **First *content* chunk.** `firstChunkMs` is satisfied by the first content‑bearing part (text/reasoning delta, tool‑input/tool‑call parts) — i.e. `isOutputChunkType === true` — not by keep‑alive/ping framing or an empty role prelude. This is the crux, and it's why it can't be a transport‑level "first byte" check.
- **On expiry.** Abort the underlying request and reject with a timeout error, consistent with how `stepMs`/`chunkMs` abort today. Because these fire *before* any content is emitted, an abort here is safe to retry without duplicating output.
- **Opt‑in.** Both default to unset (current behavior). Setting either only tightens the covered window.

## Alternatives considered

- **Lower `chunkMs`.** Doesn't help — it never arms before the first content chunk (see reproduction).
- **Lower `stepMs`/`totalMs`.** Forces the outer bound down, killing legitimately slow reasoning generations. The point is to fail fast on *dead* streams while staying patient with *slow but healthy* ones — that needs separate, earlier deadlines.
- **Custom `fetch` for everything.** Works for `headersMs` (with runtime‑specific effort) but *cannot* implement `firstChunkMs` reliably, because keep‑alives defeat transport‑level inactivity timers.

## Related

- #15430: streamText hangs when the fetch body stalls mid‑stream; `abortSignal` doesn't propagate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming timeouts: cover the pre-first-chunk window with `headersMs` and `firstChunkMs` #16538

Streaming timeouts: cover the pre‑first‑chunk window with `headersMs` and `firstChunkMs`

Summary

The timeline

Why the existing knobs don't cover this

Minimal reproduction

Why not just do this in a custom `fetch` / transport layer?

Implementation notes (hooks that already exist)

Proposed API

Alternatives considered

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Streaming timeouts: cover the pre-first-chunk window with headersMs and firstChunkMs #16538

Description

Streaming timeouts: cover the pre‑first‑chunk window with headersMs and firstChunkMs

Summary

The timeline

Why the existing knobs don't cover this

Minimal reproduction

Why not just do this in a custom fetch / transport layer?

Implementation notes (hooks that already exist)

Proposed API

Alternatives considered

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Streaming timeouts: cover the pre-first-chunk window with `headersMs` and `firstChunkMs` #16538

Streaming timeouts: cover the pre‑first‑chunk window with `headersMs` and `firstChunkMs`

Why not just do this in a custom `fetch` / transport layer?