OpenRouter backend + Anthropic Citations revamp + GenerationResult.usage by random-walks · Pull Request #7 · random-walks/citeformer

random-walks · 2026-04-25T20:17:32Z

Summary

Adds OpenRouter, Fireworks, and Together backends, revamps the Anthropic backend (prompt caching via cache_control on document blocks, real messages.stream() block-level streaming replacing the prior pseudo-stream, temperature no longer silently dropped, rich citation metadata preserved through to Citation), threads token usage + per-call cost through every API backend onto a new GenerationResult.usage field, uses cited_text as the NLI premise in verify() when populated (sharper signal on long docs), and rewrites the tier-honesty docs since "schema-tier vs logit-tier" is no longer the right framing for the API/local split (every modern provider's strict structured-outputs mode is real token-level constrained sampling now).

The headline win: FireworksBackend is true logit-tier on a hosted API — Fireworks's native response_format={"type":"grammar"} mode accepts citeformer's existing cite-id GBNF rule unchanged, so the same constraint that masks logits inside HFBackend runs inside the Fireworks runtime. The OpenAI-class refactor that enables this (extracted _build_response_format + _decode_response_text hooks) is reusable by any future backend with non-OpenAI response shapes.

Two ADRs document the schema_version 2 → 3 bump (ADR-012 for usage, ADR-013 for Citation rich attribution); both ship inside the same single bump rather than two consecutive bumps in one branch.

Doc-pin verification against current OpenRouter / Anthropic / OpenAI / Pydantic docs surfaced two real OpenRouter correctness issues, both fixed inside this branch: dropped the deprecated extra_body={"usage": {"include": true}} flag (no-op as of OR structured-outputs GA), and renamed TokenUsage.cost_usd → cost_credits (OR usage.cost is denominated in credits, not USD — the old label was actively misleading).

A new tests/integration/test_env_connectivity.py smoke (~$0.01 per full pass) probes every API backend live, asserts the structural §10.1 invariant against the real provider, and asserts last_usage populates with non-zero token counts. Live results from this branch: OpenAI / Anthropic / Anthropic-streaming / OpenRouter all PASSED (OR cost_credits lands on the response live); Fireworks / Together / Gemini / Mistral skipped cleanly with no key.

Invariant touched?

§10.3 output schemas — GenerationResult.schema_version bumped 2 → 3. Two shape changes ship together inside the single bump:

New optional usage: TokenUsage | None on GenerationResult (ADR-012).
Three new optional fields on Citation: cited_text: str | None, source_span: tuple[int, int] | None, document_title: str | None (ADR-013).

Ceremony:

tests/integration/test_schemas/test_generation_result_canonical_snapshot.yml regenerated with the new fields (all defaulting to null).
test_generation_result_schema_version_is_3 (was _is_2) bumped in both tests/integration/test_schemas.py and tests/unit/test_core.py.
Hypothesis fuzz default-version assert updated.
CHANGELOG Contracts (§10) block documents the bump and links both ADRs.
Pre-bump v2 serialisations deserialise cleanly into v3 (the new fields default to None); no migration shim needed.
Backend ABC unchanged — orchestrator pulls last_usage / last_rich_citations via getattr(...), so out-of-tree backends written against v0.1 keep working untouched.

Test plan

make lint green (ruff check + format --check both clean — 116 files formatted).
make test green (619 unit tests, was 499 at branch start; 4 schema integration tests).
make docs-build green (sphinx-build -W succeeds).
mypy src/citeformer strict — no issues found in 53 source files (was 51 at branch start).
Live env-connectivity smoke with make test-integration tests/integration/test_env_connectivity.py:
- test_connectivity_openai — PASSED (structural invariant holds, last_usage non-zero).
- test_connectivity_anthropic — PASSED.
- test_connectivity_anthropic_streaming_yields_chunks — PASSED (real per-block streaming, usage populated from get_final_message()).
- test_connectivity_openrouter — PASSED (cost_credits populated on the live response).
- test_connectivity_fireworks — SKIPPED (no FIREWORKS_API_KEY).
- test_connectivity_together — SKIPPED (no TOGETHER_API_KEY).
- test_connectivity_gemini — SKIPPED (no GEMINI_API_KEY).
- test_connectivity_mistral — SKIPPED (no MISTRAL_API_KEY).
Standalone live smoke against Anthropic + OpenAI + OpenRouter confirmed: real per-block streaming yields N chunks for N-sentence response, usage threading works through both generate() and stream().finalize(), every emitted cite id in [1..N].

What's in the box

Backends now: 10 (was 7 at PR start). Two enforcement loci, one GenerationResult:

Backend	Enforcement	Notes
`HFBackend`	In-process (XGrammar)	Existing flagship.
`LlamaCppBackend`	In-process (GBNF)	Existing.
`VLLMBackend`	In-process (XGrammar/llguidance)	Existing.
`FireworksBackend`	Provider-runtime (native GBNF)	NEW. Drops citeformer's `cite-id` grammar in unchanged via `response_format={"type":"grammar"}`.
`OpenAIBackend`	Provider-runtime (strict JSON)	Refactored: extracted `_build_response_format` + `_decode_response_text` hooks.
`AnthropicBackend`	Provider-native (Citations API)	Revamp: prompt caching, real `messages.stream()`, `cited_text`/`source_span` preservation.
`OpenRouterBackend`	Provider-runtime (per-upstream)	NEW. Multi-provider routing on OpenAI wire format. `provider.require_parameters: true` keeps strict mode end-to-end.
`TogetherBackend`	Provider-runtime (strict json_schema)	NEW. Strict structured outputs on Llama / Qwen / DeepSeek.
`GeminiBackend`	Provider-runtime (`response_schema`)	Existing.
`MistralBackend`	Provider-runtime (strict JSON)	Existing.

Highlights for reviewers

Where to look first:

src/citeformer/backends/fireworks.py — new file. Subclasses OpenAIBackend and overrides only _build_response_format (return {"type":"grammar","grammar":<GBNF>} from build_grammar) and _decode_response_text (passthrough — grammar mode returns plain text). The whole backend is ~130 lines because all the heavy lifting is inherited.
src/citeformer/backends/openai.py — refactor. New _build_response_format + _decode_response_text hooks make the response-shape pluggable for any future provider. Pure refactor for the existing call paths.
src/citeformer/backends/openrouter.py — new file. Subclasses OpenAIBackend and overrides only _augment_create_kwargs to inject extra_body routing fields.
src/citeformer/backends/together.py — new file. Even thinner subclass than OpenRouter — just defaults pointing at Together's base URL + TOGETHER_API_KEY pickup.
src/citeformer/backends/anthropic.py — substantial revamp. New _build_request helper centralises the request shape so generate() and stream() stay consistent. _flatten_blocks now takes an optional record= list parameter so rich-citation metadata is captured as a side-channel without changing the return type.
src/citeformer/citeformer.py — orchestrator changes are tiny: _pull_usage and _pull_rich_citations helpers do getattr(backend, ...) lookups so the Backend ABC stays untouched. _parse_citations zips the rich list onto parsed markers by index; length mismatch falls through silently.
src/citeformer/verify/entailment.py — score_entailment now uses citation.cited_text as the NLI premise when populated (sharper signal than the whole source content), falls back to Source.content otherwise. Each citation in a mixed-backend pipeline uses its sharpest premise independently.
docs/reference/architecture.md — the "Tiered enforcement" section was rewritten to reflect that strict structured-outputs is real token-level constrained sampling on every modern provider now (with cited per-provider docs). Honest distinction is "where the masking runs" (in-process vs provider-runtime), not "logit vs schema".
docs/decisions/012-generation-result-schema-v3.md and docs/decisions/013-citation-rich-attribution.md — full ADR write-ups for the two §10.3 shape changes that share the single bump.

Explicitly deferred to follow-up PRs:

Async surface (agenerate / astream) — large cross-cutting refactor, deserves its own PR.
Bedrock + Vertex AI backends — would round out the major hosted-LLM landscape but each has its own non-trivial auth + SDK story.
verify() fine-grain windowing — now that we preserve cited_text, we could optionally also score both the cited span AND a small surrounding context window for richer attribution semantics. Marginal vs the current win.
Provider-specific cost tables — for backends that don't expose cost directly (everyone except OpenRouter), a small pricing table per model would let usage carry an inferred cost_credits too. Not load-bearing.

🤖 Generated with Claude Code

…ult.usage **OpenRouter backend** (new) — `citeformer.backends.openrouter.OpenRouterBackend` under the new `[openrouter]` extra. Subclasses `OpenAIBackend` (OpenRouter is OpenAI wire-compatible) and adds a `_augment_create_kwargs` hook that threads `provider.require_parameters: true` (refuses to route to upstreams that drop strict-mode parameters), `models=[primary, *fallbacks]`, and `usage.include` (per-call USD cost) onto the request via `extra_body`. App-attribution `HTTP-Referer` / `X-Title` headers wired through `default_headers` from `app_url` / `app_name` constructor kwargs. **Anthropic backend revamp** (existing) — three load-bearing fixes: 1. `cache_control: {"type": "ephemeral"}` on every document block by default (`use_prompt_cache=True`); repeat-source RAG now bills cache-read prices on subsequent calls. 2. `temperature` is now honoured when supplied (was previously dropped silently — only `max_tokens` was extracted from `**options`). 3. Real `messages.stream()` block-level streaming via the SDK's stream context manager. Citation events only land at `content_block_stop`, so per-block is the natural granularity for the Citations API; the prior pseudo-stream (call generate, slice on `.!?`) is gone. Falls back to the non-streaming path when the client lacks a `stream` attribute (older SDKs / test stand-ins that mock only `create`). **GenerationResult.usage + ADR-012** — schema_version bumped 2 → 3 with new optional `usage: TokenUsage | None`. `TokenUsage` is a frozen pydantic model carrying `input_tokens`, `output_tokens`, optional `cache_creation_input_tokens` / `cache_read_input_tokens` (Anthropic prompt-caching), and `cost_usd` (OpenRouter exposes this directly). Each API backend (OpenAI, Anthropic, Gemini, Mistral, OpenRouter) populates `self.last_usage` at the end of `generate()`; the `Citeformer` orchestrator pulls it via `getattr(backend, "last_usage", None)` and threads it onto `GenerationResult.usage` for both `generate()` and `stream().finalize()`. Local backends leave it `None` — token accounting is meaningless when you control the runtime. The orchestrator uses `getattr` rather than a typed property on the `Backend` ABC so out-of-tree backends written against the v0.1 ABC keep working untouched. CHANGELOG entry + tier-honesty doc updates land in follow-up commits on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tes, tier-honesty docs **Cross-backend conformance test** (`tests/unit/test_backend_conformance.py`, 33 grid cells). Parametrised over MockBackend + all five API backends with fake clients — asserts the §10.1 / §10.3 contracts hold uniformly: every cite id in [1..N], empty-source rejection, marker styles propagate across all 4 shapes, `last_usage` populated on API backends (per ADR-012), and `stream().finalize().citations == generate().citations`. Runs in <1s without network. **Anthropic unit suite expansion** (10 new tests in `tests/unit/test_api_backends.py`): cache_control on/off, temperature threading (was silently dropped pre-revamp), real `messages.stream()` event handling (with a `_FakeAnthropicStream` context-manager stand-in), fallback-when-no-stream, and `last_usage` extraction (object + dict shapes, missing-field path). **OpenRouter unit suite** (`tests/unit/test_openrouter_backend.py`, 13 tests). `provider.require_parameters: true` default + opt-out, fallback model ordering, `usage.include` flag, app-attribution headers via `monkeypatch.setattr(openai_sdk, "OpenAI", ...)`, `OPENROUTER_API_KEY` env-var fallback, inheritance sanity (strict JSON schema + segment flattening still come from `OpenAIBackend` unchanged), and merging of caller-supplied `extra_body` with the routing block. **Tier-honesty docs revamp.** README, `docs/index.md`, `docs/reference/architecture.md` all framed the API/local split as "schema-tier vs logit-tier", but as of late 2025 every modern provider's strict structured-outputs mode is real token-level constrained sampling. The new framing is **where the masking runs**: in-process (HF / vLLM / llama.cpp) vs provider-runtime (everything else). Architecture doc has the per-provider table; README updated to "eight backends" with OpenRouter joining; index.md anchor reference updated to the new section id. **CHANGELOG entry** under [Unreleased] documents OpenRouter, Anthropic revamp, ADR-012 schema bump, conformance test, and the tier-honesty doc rewrite. Live smoke (Anthropic + OpenAI) confirmed end-to-end: real per-block streaming yielded 2 chunks for 2-sentence response, `usage` populated on `generate()` and stream-`finalize()`, every cite id in [1..3]. OpenRouter live smoke pending — `OPENROUTER_API_KEY` not yet in `.env`. Suite: 575 unit tests (was 499), ruff clean, mypy strict clean, docs- build green with `-W`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tivity test A docs-verification pass against current OpenRouter / Anthropic / OpenAI docs surfaced two real OpenRouter correctness issues; rich-citation metadata that the Anthropic Citations API has been returning all along was being discarded; and the branch needed a live env-connectivity smoke covering every API backend. **OpenRouter doc-pin fixes:** - Drop `extra_body={"usage": {"include": true}}`. Per OpenRouter's usage-accounting docs the flag is *deprecated and a no-op* — cost is returned on every response unconditionally as of structured-outputs GA. We were sending a meaningless flag. - Rename `TokenUsage.cost_usd` → `cost_credits`. OpenRouter `usage.cost` is denominated in OpenRouter credits, not USD. The old label was actively misleading. Renamed before any release ships, no migration needed. - Drop the now-meaningless `include_cost` constructor kwarg. - Docstring note on `anthropic/claude-sonnet-4.6` (OR, dot) vs `claude-sonnet-4-6` (native Anthropic, dash) — easy footgun. **Anthropic rich-citation preservation (ADR-013).** `Citation` gains three optional fields: `cited_text: str | None`, `source_span: tuple[int, int] | None`, `document_title: str | None`. Anthropic's Citations API returns this on every citation event; `AnthropicBackend._flatten_blocks` now records it onto a `last_rich_citations: list[dict]` instance attribute (one entry per marker emitted, in left-to-right order). The orchestrator pulls it via `getattr(backend, "last_rich_citations", None)` (mirrors the `last_usage` pattern) and zips it with the parsed marker list inside `_parse_citations`. Length mismatch falls through silently with the new fields left None — misaligned data is worse than no data. Other backends leave the new fields None — honest signalling that schema-tier providers don't have span-level attribution. The `Citation` change is the second shape change inside `schema_version 3` (the first was `usage` from ADR-012). Both ship together inside the single 2 → 3 bump rather than two consecutive bumps in one branch. Snapshot regenerated. **Env-connectivity test** (`tests/integration/test_env_connectivity.py`). Six `@pytest.mark.integration` tests — one per API backend, plus a dedicated Anthropic-streaming test. Each issues the smallest possible request (1 source, 80 max_tokens), asserts the structural §10.1 invariant against the live provider, and asserts `backend.last_usage` populates with non-zero token counts (live verification of the ADR-012 token-accounting contract). OpenRouter additionally asserts `cost_credits` lands on the response. Total cost across all 5 backends per full pass: ~$0.01. Live results from this branch (4 of 6 with keys present): - test_connectivity_openai PASSED - test_connectivity_anthropic PASSED - test_connectivity_anthropic_streaming_yields_chunks PASSED (real per-block streaming, usage from get_final_message) - test_connectivity_openrouter PASSED (cost_credits populates live) - test_connectivity_gemini SKIPPED (no key) - test_connectivity_mistral SKIPPED (no key) Suite: 585 unit tests (was 575) + 4 schema integration. Ruff clean, mypy strict clean, sphinx-build -W green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

**FireworksBackend (extra `fireworks`)** — the cleanest "true logit-tier on a hosted API" backend possible. Fireworks's `response_format={"type": "grammar", "grammar": "<GBNF>"}` mode accepts citeformer's existing `cite-id` GBNF rule UNCHANGED, so the same grammar that masks logits inside `HFBackend` runs inside the Fireworks runtime. Subclasses `OpenAIBackend` and overrides only `_build_response_format` (swap strict-JSON for grammar mode) and `_decode_response_text` (Fireworks returns plain text with markers, not segments JSON, so flattening is a no-op). Default model `accounts/fireworks/models/llama-v3p1-8b-instruct`. Env: `FIREWORKS_API_KEY`. **TogetherBackend (extra `together`)** — strict `json_schema` constrained decoding on Together's open-weight upstreams (Llama / Qwen / DeepSeek). OpenAI-wire-compatible — schema construction, segment flattening, streaming, and `last_usage` extraction all inherited unchanged. Default `meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`. Env: `TOGETHER_API_KEY`. **OpenAIBackend refactor — extracted two hooks.** `_build_response_format` and `_decode_response_text` factor out the OpenAI-specific bits so backends with non-OpenAI response shapes (Fireworks's grammar mode today, Together's regex mode some day) can swap them without touching `generate()`. Pure refactor, no behavioural change for existing backends. **verify() uses cited_text as NLI premise when populated** — uses the ADR-013 work. When `Citation.cited_text` is set (Anthropic Citations API path), `score_entailment` scores entailment against that span instead of the whole source content. Sharper signal, especially on long documents where the relevant assertion is buried past DeBERTa's 512-token horizon. Falls back to full source when cited_text is None (every backend except Anthropic today). Mixed-citation results work too — each citation uses the sharpest premise available to *it*. **Cross-backend conformance grid extended to 8 backends** (was 6). Fireworks's fake client introspects the GBNF payload and emits text with matching delimiters, simulating provider-side grammar-constrained sampling so the marker-style propagation grid runs against Fireworks just like the others. **Env-connectivity test extended** with `test_connectivity_fireworks` and `test_connectivity_together`. Both skip cleanly when the matching `*_API_KEY` is absent. Live results from this branch (4 of 8 with keys present): OpenAI / Anthropic / Anthropic-streaming / OpenRouter all PASSED; Fireworks / Together / Gemini / Mistral SKIPPED (no keys). Suite: **619 unit tests** (was 581) + 4 schema integration. Ruff clean, mypy strict clean (53 source files now), sphinx-build -W green. Documentation: - Backend count "eight" → "ten" across README + index.md. - README backend table grows two rows; install snippet adds two extras. - architecture.md tier table grows two rows; Fireworks called out as the cleanest "logit-tier on hosted API" path with the docs link. - backends/__init__.py docstring updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure formatter fixes that CI's `ruff format --check` flagged. Ran `uv run ruff format .` over the 8 files I edited in the OpenRouter + Fireworks + Together work — they got merged-multi-line-statement collapse and a couple of single-line continuations. No behavioural change. `make lint` (which runs both `ruff check` and `ruff format --check`) is now green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds OPENROUTER_API_KEY / FIREWORKS_API_KEY / TOGETHER_API_KEY / GEMINI_API_KEY / MISTRAL_API_KEY to the example file with a one-line hint per backend. Each line is commented out so copying to .env stays opt-in. Also broadens the surrounding comment to point at the new test_env_connectivity.py suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ine-grain windowing, cost tables ADR-014 (Accepted) — async surface (`agenerate` / `astream`) on `Backend` ABC + `Citeformer` orchestrator. Sensible to_thread defaults so every backend works in async code unchanged; native overrides land for OpenAI + Anthropic in the implementation commits that follow this one. ADR-015 (Decided, deferred) — Bedrock + Vertex AI backends. Both proxy to providers we already support directly (Anthropic, Gemini), both have non-trivial enterprise auth (SigV4, GCP service accounts), and existing backends already accept Bedrock/Vertex clients via `client=…` injection. Documented signals that would justify the work. ADR-016 (Decided, deferred) — fine-grain windowing in `verify()` (score cited_text + surrounding context window, take max entailment). The current cited_text-as-premise change (ADR-013) is itself uncalibrated; adding window hyperparameters before measurement is a tuning knob in search of a problem. Documented the calibration data that would unlock the work. ADR-017 (Decided, not planned) — provider-specific pricing tables to infer USD cost from token counts. Pricing changes constantly, the data is one multiply away from the token counts we already expose, OpenRouter solved it correctly by reporting cost server-side, and cached/batch pricing tiers compound any error. We surface tokens; consumers price. ADR-014 sets the design that the next 4 commits will implement. ADRs 015-017 lock in the deferral reasoning so future contributors don't quietly pick them up without weighing tradeoffs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements ADR-014. Adds parallel async surface to every layer: **Backend ABC** — `agenerate()` and `astream()` with `asyncio.to_thread` defaults. Every existing backend (HF, vLLM, llama.cpp, Mock, Gemini, Mistral, plus any out-of-tree backend) gains async support without any override. The defaults are correct but bounded by the sync SDK clients underneath — backends with native async clients override for genuine concurrency. **Citeformer orchestrator** — `agenerate()` returns a `GenerationResult` (mirrors `generate()` exactly, just awaited); `astream()` returns a new `AsyncStreamingResult` with `__aiter__` / `__anext__` / `await stream.finalize()` symmetry to the existing `StreamingResult`. Same parsing, rendering, usage threading, rich-citation metadata flow as the sync path. **OpenAIBackend native async** — `agenerate()` mirrors `generate()` but uses `self.async_client` (lazy `AsyncOpenAI`); `astream()` yields the same sentence-chunked output via `agenerate()`. Cascades to `OpenRouterBackend` / `FireworksBackend` / `TogetherBackend` since they subclass without touching the async path. The `_chunk_on_sentences` helper extracted so sync and async paths emit byte-identical chunks. **AnthropicBackend native async** — `agenerate()` uses `AsyncAnthropic`; `astream()` uses the SDK's `async with client.messages.stream(...)` context manager + `async for event in stream` + `await stream.get_final_message()`. New `_aiter_completed_blocks` helper mirrors the sync `_iter_completed_blocks` so the per-block citation extraction logic is shared via the `_block_from_event` factor-out. **Lazy property pattern for sync + async clients** on `OpenAIBackend` and `AnthropicBackend` — both `client` and `async_client` build on first access. Async-only callers don't pay the sync construction cost (no `OPENAI_API_KEY` needed in tests that inject only an `async_client=fake`); sync-only callers don't pay the async one. Subclasses (OpenRouter / Fireworks / Together) accept and forward `async_client` via `__init__`. Updated the two env-var-pickup tests that monkeypatched `openai.OpenAI` to force lazy construction via `_ = backend.client`. **ADRs 014-017** committed in the prior commit (26dfa1a) document the design here plus three deferral decisions (Bedrock/Vertex, fine-grain windowing, cost tables). Suite: **644 unit tests** (was 619) + 4 schema integration. 25 new async tests cover ABC defaults, orchestrator path, native overrides on OpenAI + Anthropic (including async streaming events with rich citations), lazy-client construction. Ruff check + format clean, mypy strict clean, sphinx-build -W green. Out of scope (separate PRs): native async for Gemini + Mistral (SDKs support it; just need parallel implementations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

random-walks and others added 8 commits April 25, 2026 15:45

random-walks merged commit fa6a810 into main Apr 25, 2026
7 checks passed

random-walks mentioned this pull request Apr 25, 2026

release: v0.3.0 #8

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenRouter backend + Anthropic Citations revamp + GenerationResult.usage#7

OpenRouter backend + Anthropic Citations revamp + GenerationResult.usage#7
random-walks merged 8 commits into
mainfrom
feat/openrouter-and-claude-revamp

random-walks commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

random-walks commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Invariant touched?

Test plan

What's in the box

Highlights for reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

random-walks commented Apr 25, 2026 •

edited

Loading