OpenRouter backend + Anthropic Citations revamp + GenerationResult.usage#7
Merged
Merged
Conversation
…ult.usage
**OpenRouter backend** (new) — `citeformer.backends.openrouter.OpenRouterBackend`
under the new `[openrouter]` extra. Subclasses `OpenAIBackend` (OpenRouter
is OpenAI wire-compatible) and adds a `_augment_create_kwargs` hook that
threads `provider.require_parameters: true` (refuses to route to upstreams
that drop strict-mode parameters), `models=[primary, *fallbacks]`, and
`usage.include` (per-call USD cost) onto the request via `extra_body`.
App-attribution `HTTP-Referer` / `X-Title` headers wired through
`default_headers` from `app_url` / `app_name` constructor kwargs.
**Anthropic backend revamp** (existing) — three load-bearing fixes:
1. `cache_control: {"type": "ephemeral"}` on every document block by
default (`use_prompt_cache=True`); repeat-source RAG now bills
cache-read prices on subsequent calls.
2. `temperature` is now honoured when supplied (was previously dropped
silently — only `max_tokens` was extracted from `**options`).
3. Real `messages.stream()` block-level streaming via the SDK's stream
context manager. Citation events only land at `content_block_stop`,
so per-block is the natural granularity for the Citations API; the
prior pseudo-stream (call generate, slice on `.!?`) is gone.
Falls back to the non-streaming path when the client lacks a `stream`
attribute (older SDKs / test stand-ins that mock only `create`).
**GenerationResult.usage + ADR-012** — schema_version bumped 2 → 3 with
new optional `usage: TokenUsage | None`. `TokenUsage` is a frozen
pydantic model carrying `input_tokens`, `output_tokens`, optional
`cache_creation_input_tokens` / `cache_read_input_tokens` (Anthropic
prompt-caching), and `cost_usd` (OpenRouter exposes this directly).
Each API backend (OpenAI, Anthropic, Gemini, Mistral, OpenRouter)
populates `self.last_usage` at the end of `generate()`; the
`Citeformer` orchestrator pulls it via `getattr(backend, "last_usage",
None)` and threads it onto `GenerationResult.usage` for both `generate()`
and `stream().finalize()`. Local backends leave it `None` — token
accounting is meaningless when you control the runtime.
The orchestrator uses `getattr` rather than a typed property on the
`Backend` ABC so out-of-tree backends written against the v0.1 ABC keep
working untouched.
CHANGELOG entry + tier-honesty doc updates land in follow-up commits on
this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tes, tier-honesty docs **Cross-backend conformance test** (`tests/unit/test_backend_conformance.py`, 33 grid cells). Parametrised over MockBackend + all five API backends with fake clients — asserts the §10.1 / §10.3 contracts hold uniformly: every cite id in [1..N], empty-source rejection, marker styles propagate across all 4 shapes, `last_usage` populated on API backends (per ADR-012), and `stream().finalize().citations == generate().citations`. Runs in <1s without network. **Anthropic unit suite expansion** (10 new tests in `tests/unit/test_api_backends.py`): cache_control on/off, temperature threading (was silently dropped pre-revamp), real `messages.stream()` event handling (with a `_FakeAnthropicStream` context-manager stand-in), fallback-when-no-stream, and `last_usage` extraction (object + dict shapes, missing-field path). **OpenRouter unit suite** (`tests/unit/test_openrouter_backend.py`, 13 tests). `provider.require_parameters: true` default + opt-out, fallback model ordering, `usage.include` flag, app-attribution headers via `monkeypatch.setattr(openai_sdk, "OpenAI", ...)`, `OPENROUTER_API_KEY` env-var fallback, inheritance sanity (strict JSON schema + segment flattening still come from `OpenAIBackend` unchanged), and merging of caller-supplied `extra_body` with the routing block. **Tier-honesty docs revamp.** README, `docs/index.md`, `docs/reference/architecture.md` all framed the API/local split as "schema-tier vs logit-tier", but as of late 2025 every modern provider's strict structured-outputs mode is real token-level constrained sampling. The new framing is **where the masking runs**: in-process (HF / vLLM / llama.cpp) vs provider-runtime (everything else). Architecture doc has the per-provider table; README updated to "eight backends" with OpenRouter joining; index.md anchor reference updated to the new section id. **CHANGELOG entry** under [Unreleased] documents OpenRouter, Anthropic revamp, ADR-012 schema bump, conformance test, and the tier-honesty doc rewrite. Live smoke (Anthropic + OpenAI) confirmed end-to-end: real per-block streaming yielded 2 chunks for 2-sentence response, `usage` populated on `generate()` and stream-`finalize()`, every cite id in [1..3]. OpenRouter live smoke pending — `OPENROUTER_API_KEY` not yet in `.env`. Suite: 575 unit tests (was 499), ruff clean, mypy strict clean, docs- build green with `-W`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tivity test
A docs-verification pass against current OpenRouter / Anthropic / OpenAI
docs surfaced two real OpenRouter correctness issues; rich-citation
metadata that the Anthropic Citations API has been returning all along
was being discarded; and the branch needed a live env-connectivity
smoke covering every API backend.
**OpenRouter doc-pin fixes:**
- Drop `extra_body={"usage": {"include": true}}`. Per OpenRouter's
usage-accounting docs the flag is *deprecated and a no-op* — cost is
returned on every response unconditionally as of structured-outputs
GA. We were sending a meaningless flag.
- Rename `TokenUsage.cost_usd` → `cost_credits`. OpenRouter
`usage.cost` is denominated in OpenRouter credits, not USD. The old
label was actively misleading. Renamed before any release ships, no
migration needed.
- Drop the now-meaningless `include_cost` constructor kwarg.
- Docstring note on `anthropic/claude-sonnet-4.6` (OR, dot) vs
`claude-sonnet-4-6` (native Anthropic, dash) — easy footgun.
**Anthropic rich-citation preservation (ADR-013).**
`Citation` gains three optional fields: `cited_text: str | None`,
`source_span: tuple[int, int] | None`, `document_title: str | None`.
Anthropic's Citations API returns this on every citation event;
`AnthropicBackend._flatten_blocks` now records it onto a
`last_rich_citations: list[dict]` instance attribute (one entry per
marker emitted, in left-to-right order). The orchestrator pulls it via
`getattr(backend, "last_rich_citations", None)` (mirrors the
`last_usage` pattern) and zips it with the parsed marker list inside
`_parse_citations`. Length mismatch falls through silently with the
new fields left None — misaligned data is worse than no data. Other
backends leave the new fields None — honest signalling that schema-tier
providers don't have span-level attribution.
The `Citation` change is the second shape change inside `schema_version
3` (the first was `usage` from ADR-012). Both ship together inside the
single 2 → 3 bump rather than two consecutive bumps in one branch.
Snapshot regenerated.
**Env-connectivity test** (`tests/integration/test_env_connectivity.py`).
Six `@pytest.mark.integration` tests — one per API backend, plus a
dedicated Anthropic-streaming test. Each issues the smallest possible
request (1 source, 80 max_tokens), asserts the structural §10.1
invariant against the live provider, and asserts `backend.last_usage`
populates with non-zero token counts (live verification of the
ADR-012 token-accounting contract). OpenRouter additionally asserts
`cost_credits` lands on the response. Total cost across all 5
backends per full pass: ~$0.01.
Live results from this branch (4 of 6 with keys present):
- test_connectivity_openai PASSED
- test_connectivity_anthropic PASSED
- test_connectivity_anthropic_streaming_yields_chunks PASSED (real
per-block streaming, usage from get_final_message)
- test_connectivity_openrouter PASSED (cost_credits populates live)
- test_connectivity_gemini SKIPPED (no key)
- test_connectivity_mistral SKIPPED (no key)
Suite: 585 unit tests (was 575) + 4 schema integration. Ruff clean,
mypy strict clean, sphinx-build -W green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**FireworksBackend (extra `fireworks`)** — the cleanest "true logit-tier on
a hosted API" backend possible. Fireworks's `response_format={"type":
"grammar", "grammar": "<GBNF>"}` mode accepts citeformer's existing
`cite-id` GBNF rule UNCHANGED, so the same grammar that masks logits
inside `HFBackend` runs inside the Fireworks runtime. Subclasses
`OpenAIBackend` and overrides only `_build_response_format` (swap
strict-JSON for grammar mode) and `_decode_response_text` (Fireworks
returns plain text with markers, not segments JSON, so flattening is a
no-op). Default model `accounts/fireworks/models/llama-v3p1-8b-instruct`.
Env: `FIREWORKS_API_KEY`.
**TogetherBackend (extra `together`)** — strict `json_schema` constrained
decoding on Together's open-weight upstreams (Llama / Qwen / DeepSeek).
OpenAI-wire-compatible — schema construction, segment flattening,
streaming, and `last_usage` extraction all inherited unchanged. Default
`meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo`. Env: `TOGETHER_API_KEY`.
**OpenAIBackend refactor — extracted two hooks.** `_build_response_format`
and `_decode_response_text` factor out the OpenAI-specific bits so
backends with non-OpenAI response shapes (Fireworks's grammar mode
today, Together's regex mode some day) can swap them without touching
`generate()`. Pure refactor, no behavioural change for existing
backends.
**verify() uses cited_text as NLI premise when populated** — uses the
ADR-013 work. When `Citation.cited_text` is set (Anthropic Citations
API path), `score_entailment` scores entailment against that span
instead of the whole source content. Sharper signal, especially on
long documents where the relevant assertion is buried past DeBERTa's
512-token horizon. Falls back to full source when cited_text is None
(every backend except Anthropic today). Mixed-citation results work
too — each citation uses the sharpest premise available to *it*.
**Cross-backend conformance grid extended to 8 backends** (was 6).
Fireworks's fake client introspects the GBNF payload and emits text
with matching delimiters, simulating provider-side grammar-constrained
sampling so the marker-style propagation grid runs against Fireworks
just like the others.
**Env-connectivity test extended** with `test_connectivity_fireworks`
and `test_connectivity_together`. Both skip cleanly when the matching
`*_API_KEY` is absent. Live results from this branch (4 of 8 with keys
present): OpenAI / Anthropic / Anthropic-streaming / OpenRouter all
PASSED; Fireworks / Together / Gemini / Mistral SKIPPED (no keys).
Suite: **619 unit tests** (was 581) + 4 schema integration. Ruff clean,
mypy strict clean (53 source files now), sphinx-build -W green.
Documentation:
- Backend count "eight" → "ten" across README + index.md.
- README backend table grows two rows; install snippet adds two extras.
- architecture.md tier table grows two rows; Fireworks called out as the
cleanest "logit-tier on hosted API" path with the docs link.
- backends/__init__.py docstring updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure formatter fixes that CI's `ruff format --check` flagged. Ran `uv run ruff format .` over the 8 files I edited in the OpenRouter + Fireworks + Together work — they got merged-multi-line-statement collapse and a couple of single-line continuations. No behavioural change. `make lint` (which runs both `ruff check` and `ruff format --check`) is now green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds OPENROUTER_API_KEY / FIREWORKS_API_KEY / TOGETHER_API_KEY / GEMINI_API_KEY / MISTRAL_API_KEY to the example file with a one-line hint per backend. Each line is commented out so copying to .env stays opt-in. Also broadens the surrounding comment to point at the new test_env_connectivity.py suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ine-grain windowing, cost tables ADR-014 (Accepted) — async surface (`agenerate` / `astream`) on `Backend` ABC + `Citeformer` orchestrator. Sensible to_thread defaults so every backend works in async code unchanged; native overrides land for OpenAI + Anthropic in the implementation commits that follow this one. ADR-015 (Decided, deferred) — Bedrock + Vertex AI backends. Both proxy to providers we already support directly (Anthropic, Gemini), both have non-trivial enterprise auth (SigV4, GCP service accounts), and existing backends already accept Bedrock/Vertex clients via `client=…` injection. Documented signals that would justify the work. ADR-016 (Decided, deferred) — fine-grain windowing in `verify()` (score cited_text + surrounding context window, take max entailment). The current cited_text-as-premise change (ADR-013) is itself uncalibrated; adding window hyperparameters before measurement is a tuning knob in search of a problem. Documented the calibration data that would unlock the work. ADR-017 (Decided, not planned) — provider-specific pricing tables to infer USD cost from token counts. Pricing changes constantly, the data is one multiply away from the token counts we already expose, OpenRouter solved it correctly by reporting cost server-side, and cached/batch pricing tiers compound any error. We surface tokens; consumers price. ADR-014 sets the design that the next 4 commits will implement. ADRs 015-017 lock in the deferral reasoning so future contributors don't quietly pick them up without weighing tradeoffs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements ADR-014. Adds parallel async surface to every layer: **Backend ABC** — `agenerate()` and `astream()` with `asyncio.to_thread` defaults. Every existing backend (HF, vLLM, llama.cpp, Mock, Gemini, Mistral, plus any out-of-tree backend) gains async support without any override. The defaults are correct but bounded by the sync SDK clients underneath — backends with native async clients override for genuine concurrency. **Citeformer orchestrator** — `agenerate()` returns a `GenerationResult` (mirrors `generate()` exactly, just awaited); `astream()` returns a new `AsyncStreamingResult` with `__aiter__` / `__anext__` / `await stream.finalize()` symmetry to the existing `StreamingResult`. Same parsing, rendering, usage threading, rich-citation metadata flow as the sync path. **OpenAIBackend native async** — `agenerate()` mirrors `generate()` but uses `self.async_client` (lazy `AsyncOpenAI`); `astream()` yields the same sentence-chunked output via `agenerate()`. Cascades to `OpenRouterBackend` / `FireworksBackend` / `TogetherBackend` since they subclass without touching the async path. The `_chunk_on_sentences` helper extracted so sync and async paths emit byte-identical chunks. **AnthropicBackend native async** — `agenerate()` uses `AsyncAnthropic`; `astream()` uses the SDK's `async with client.messages.stream(...)` context manager + `async for event in stream` + `await stream.get_final_message()`. New `_aiter_completed_blocks` helper mirrors the sync `_iter_completed_blocks` so the per-block citation extraction logic is shared via the `_block_from_event` factor-out. **Lazy property pattern for sync + async clients** on `OpenAIBackend` and `AnthropicBackend` — both `client` and `async_client` build on first access. Async-only callers don't pay the sync construction cost (no `OPENAI_API_KEY` needed in tests that inject only an `async_client=fake`); sync-only callers don't pay the async one. Subclasses (OpenRouter / Fireworks / Together) accept and forward `async_client` via `__init__`. Updated the two env-var-pickup tests that monkeypatched `openai.OpenAI` to force lazy construction via `_ = backend.client`. **ADRs 014-017** committed in the prior commit (26dfa1a) document the design here plus three deferral decisions (Bedrock/Vertex, fine-grain windowing, cost tables). Suite: **644 unit tests** (was 619) + 4 schema integration. 25 new async tests cover ABC defaults, orchestrator path, native overrides on OpenAI + Anthropic (including async streaming events with rich citations), lazy-client construction. Ruff check + format clean, mypy strict clean, sphinx-build -W green. Out of scope (separate PRs): native async for Gemini + Mistral (SDKs support it; just need parallel implementations). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds OpenRouter, Fireworks, and Together backends, revamps the Anthropic backend (prompt caching via
cache_controlon document blocks, realmessages.stream()block-level streaming replacing the prior pseudo-stream,temperatureno longer silently dropped, rich citation metadata preserved through toCitation), threads token usage + per-call cost through every API backend onto a newGenerationResult.usagefield, usescited_textas the NLI premise inverify()when populated (sharper signal on long docs), and rewrites the tier-honesty docs since "schema-tier vs logit-tier" is no longer the right framing for the API/local split (every modern provider's strict structured-outputs mode is real token-level constrained sampling now).The headline win:
FireworksBackendis true logit-tier on a hosted API — Fireworks's nativeresponse_format={"type":"grammar"}mode accepts citeformer's existingcite-idGBNF rule unchanged, so the same constraint that masks logits insideHFBackendruns inside the Fireworks runtime. The OpenAI-class refactor that enables this (extracted_build_response_format+_decode_response_texthooks) is reusable by any future backend with non-OpenAI response shapes.Two ADRs document the
schema_version2 → 3 bump (ADR-012 forusage, ADR-013 forCitationrich attribution); both ship inside the same single bump rather than two consecutive bumps in one branch.Doc-pin verification against current OpenRouter / Anthropic / OpenAI / Pydantic docs surfaced two real OpenRouter correctness issues, both fixed inside this branch: dropped the deprecated
extra_body={"usage": {"include": true}}flag (no-op as of OR structured-outputs GA), and renamedTokenUsage.cost_usd→cost_credits(ORusage.costis denominated in credits, not USD — the old label was actively misleading).A new
tests/integration/test_env_connectivity.pysmoke (~$0.01 per full pass) probes every API backend live, asserts the structural §10.1 invariant against the real provider, and assertslast_usagepopulates with non-zero token counts. Live results from this branch: OpenAI / Anthropic / Anthropic-streaming / OpenRouter all PASSED (ORcost_creditslands on the response live); Fireworks / Together / Gemini / Mistral skipped cleanly with no key.Invariant touched?
§10.3 output schemas —
GenerationResult.schema_versionbumped 2 → 3. Two shape changes ship together inside the single bump:usage: TokenUsage | NoneonGenerationResult(ADR-012).Citation:cited_text: str | None,source_span: tuple[int, int] | None,document_title: str | None(ADR-013).Ceremony:
tests/integration/test_schemas/test_generation_result_canonical_snapshot.ymlregenerated with the new fields (all defaulting tonull).test_generation_result_schema_version_is_3(was_is_2) bumped in bothtests/integration/test_schemas.pyandtests/unit/test_core.py.Contracts (§10)block documents the bump and links both ADRs.None); no migration shim needed.BackendABC unchanged — orchestrator pullslast_usage/last_rich_citationsviagetattr(...), so out-of-tree backends written against v0.1 keep working untouched.Test plan
make lintgreen (ruffcheck+format --checkboth clean — 116 files formatted).make testgreen (619 unit tests, was 499 at branch start; 4 schema integration tests).make docs-buildgreen (sphinx-build -Wsucceeds).mypy src/citeformerstrict — no issues found in 53 source files (was 51 at branch start).make test-integration tests/integration/test_env_connectivity.py:test_connectivity_openai— PASSED (structural invariant holds,last_usagenon-zero).test_connectivity_anthropic— PASSED.test_connectivity_anthropic_streaming_yields_chunks— PASSED (real per-block streaming,usagepopulated fromget_final_message()).test_connectivity_openrouter— PASSED (cost_creditspopulated on the live response).test_connectivity_fireworks— SKIPPED (noFIREWORKS_API_KEY).test_connectivity_together— SKIPPED (noTOGETHER_API_KEY).test_connectivity_gemini— SKIPPED (noGEMINI_API_KEY).test_connectivity_mistral— SKIPPED (noMISTRAL_API_KEY).usagethreading works through bothgenerate()andstream().finalize(), every emitted cite id in[1..N].What's in the box
Backends now: 10 (was 7 at PR start). Two enforcement loci, one
GenerationResult:HFBackendLlamaCppBackendVLLMBackendFireworksBackendcite-idgrammar in unchanged viaresponse_format={"type":"grammar"}.OpenAIBackend_build_response_format+_decode_response_texthooks.AnthropicBackendmessages.stream(),cited_text/source_spanpreservation.OpenRouterBackendprovider.require_parameters: truekeeps strict mode end-to-end.TogetherBackendGeminiBackendresponse_schema)MistralBackendHighlights for reviewers
Where to look first:
src/citeformer/backends/fireworks.py— new file. SubclassesOpenAIBackendand overrides only_build_response_format(return{"type":"grammar","grammar":<GBNF>}frombuild_grammar) and_decode_response_text(passthrough — grammar mode returns plain text). The whole backend is ~130 lines because all the heavy lifting is inherited.src/citeformer/backends/openai.py— refactor. New_build_response_format+_decode_response_texthooks make the response-shape pluggable for any future provider. Pure refactor for the existing call paths.src/citeformer/backends/openrouter.py— new file. SubclassesOpenAIBackendand overrides only_augment_create_kwargsto injectextra_bodyrouting fields.src/citeformer/backends/together.py— new file. Even thinner subclass than OpenRouter — just defaults pointing at Together's base URL +TOGETHER_API_KEYpickup.src/citeformer/backends/anthropic.py— substantial revamp. New_build_requesthelper centralises the request shape sogenerate()andstream()stay consistent._flatten_blocksnow takes an optionalrecord=list parameter so rich-citation metadata is captured as a side-channel without changing the return type.src/citeformer/citeformer.py— orchestrator changes are tiny:_pull_usageand_pull_rich_citationshelpers dogetattr(backend, ...)lookups so theBackendABC stays untouched._parse_citationszips the rich list onto parsed markers by index; length mismatch falls through silently.src/citeformer/verify/entailment.py—score_entailmentnow usescitation.cited_textas the NLI premise when populated (sharper signal than the whole source content), falls back toSource.contentotherwise. Each citation in a mixed-backend pipeline uses its sharpest premise independently.docs/reference/architecture.md— the "Tiered enforcement" section was rewritten to reflect that strict structured-outputs is real token-level constrained sampling on every modern provider now (with cited per-provider docs). Honest distinction is "where the masking runs" (in-process vs provider-runtime), not "logit vs schema".docs/decisions/012-generation-result-schema-v3.mdanddocs/decisions/013-citation-rich-attribution.md— full ADR write-ups for the two §10.3 shape changes that share the single bump.Explicitly deferred to follow-up PRs:
agenerate/astream) — large cross-cutting refactor, deserves its own PR.verify()fine-grain windowing — now that we preservecited_text, we could optionally also score both the cited span AND a small surrounding context window for richer attribution semantics. Marginal vs the current win.costdirectly (everyone except OpenRouter), a small pricing table per model would letusagecarry an inferredcost_creditstoo. Not load-bearing.🤖 Generated with Claude Code