Skip to content

feat(llm): multi-provider Open Responses support for user simulator#170

Open
itsarbit wants to merge 23 commits into
mainfrom
feat/responses-api
Open

feat(llm): multi-provider Open Responses support for user simulator#170
itsarbit wants to merge 23 commits into
mainfrom
feat/responses-api

Conversation

@itsarbit
Copy link
Copy Markdown
Contributor

@itsarbit itsarbit commented May 18, 2026

Summary

  • Extend OpenAILLM to accept optional base_url and api_key constructor kwargs, with empty-string coercion so YAML configs with blank values fall back to env vars
  • Register responses as a provider alias for OpenAILLM in LLM._get_provider
  • Add matching base_url and api_key fields to SimulationInput and EvaluationInput Pydantic models and forward them at both LLM(...) construction sites
  • Add evaluator_* overrides on EvaluationInput with per-group fallback semantics: when the evaluator endpoint differs from the simulator, shared base_url/api_key are NOT forwarded (prevents cross-endpoint credential leaks)
  • Surface OpenAI prompt-cache hits via per-instance counters on OpenAILLM and an end-of-run log line ("simulation LLM cache stats: N calls, X input tokens, Y cached (Z% hit rate)")
  • Documents the multi-provider Open Responses story and the existing OpenAI prompt-caching benefit; ships an Ollama swap example in the customer-service README

The OpenAI provider has run on /v1/responses since v0.3.0. This change unlocks pointing the user-simulator LLM at any Open Responses-conforming backend (OpenAI, Ollama, vLLM, NVIDIA NIM, Vercel AI Gateway, OpenRouter, LM Studio, Llama Stack) via flat top-level model, provider: responses, base_url, api_key keys in config.yaml.

Cache benefit, verified

Verified end-to-end by running the unmodified examples/customer-service config against gpt-5.1 (the arksim DEFAULT_MODEL), 7 scenarios x 2 conversations x 3 turns, with the same model used for both simulation and evaluation:

simulation LLM cache stats:  41 calls,  16,622 input tokens,      0 cached  (0.0% hit rate),  1,358 output tokens
evaluation LLM cache stats: 144 calls, 153,309 input tokens, 80,512 cached (52.5% hit rate), 13,075 output tokens

The evaluator phase realizes ~52% cache hits because its prompts share a large, stable rubric prefix across every evaluation call. The simulator phase, with the default max_turns: 3 and a short per-scenario prompt, builds prompts that average ~400 tokens per call (below OpenAI's 1,024-token cache floor) and so registers no hits in this short-run example. For multi-turn scenarios where the simulator prompt clears 1,024 tokens, the cache fires as shown below.

The realistic multi-turn shape is exercised by test_cache_hits_grow_across_simulator_turns, which mirrors arksim's [system + scenario + growing history + trigger] call pattern on gpt-5.1 and observes (cold-cache run):

Turn Per-turn hit rate
1 0.0%
2 86.0%
3 92.2%
4 89.2%
5 94.8%
Cumulative 73.9%

A separate per-model matrix (test_responses_provider_cache_matrix) sends the same ~2,054-token prefix four times to confirm caching wiring across the gpt-5 family. Picking two for context:

Model Cache hit rate (4-call matrix)
gpt-5.1 (arksim DEFAULT_MODEL) ~99.7%
gpt-5 ~98.1%

The matrix is opt-in (skipped without OPENAI_API_KEY) and costs roughly $0.01-0.05 per model per run.

Test plan

  • 13 unit tests for OpenAILLM (constructor + cache stats)
  • 4 unit tests for LLM._get_provider alias resolution
  • Tests for SimulationInput / EvaluationInput field declarations + evaluator override fallback
  • End-to-end YAML to SDK forwarding chain tests (tests/unit/test_responses_yaml_plumbing.py)
  • 3 opt-in integration tests against api.openai.com/v1/responses (smoke + explicit base_url + cache hit), skipped without OPENAI_API_KEY
  • Opt-in 8-model cache verification matrix (test_responses_provider_cache_matrix) against real OpenAI
  • Opt-in multi-turn growth test (test_cache_hits_grow_across_simulator_turns) verifying cache fires on arksim's real call shape
  • Real arksim simulation run end-to-end against gpt-5.1: 52.5% evaluator cache hit rate
  • make lint clean; 936 unit tests pass
  • Docs page added under Core Capabilities; Note in simulate-conversation.mdx cross-links
  • examples/customer-service/README.md updated with Ollama + OpenAI-evaluator swap snippet using the flat YAML schema
  • CHANGELOG entries under [Unreleased] > Added for both llm and evaluator scopes

@itsarbit itsarbit requested a review from a team as a code owner May 18, 2026 16:57
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 91.25000% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
arksim/llms/chat/utils.py 63.63% 4 Missing ⚠️
arksim/llms/chat/providers/openai.py 93.33% 2 Missing ⚠️
arksim/simulation_engine/simulator.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

itsarbit added 20 commits May 19, 2026 11:22
…overrides

- _evaluator_endpoint_differs now compares evaluator_provider/evaluator_base_url against shared peers, so a redundant override (same value) no longer drops the shared api_key and crashes the evaluator at run time.

- Add a WARNING when evaluator_api_key is set but the endpoint does not differ, so users notice when the override is silently ignored.

- Strip whitespace in the existing endpoint-split INFO log so empty / whitespace-only values surface as None instead of literal blanks.

- Drop dead `or settings.model` / `or settings.provider` fallbacks in the HTML report kwargs (evaluator_llm_kwargs already guarantees both keys).

- Reorder examples/customer-service/README.md so the Ollama prereqs (`ollama serve`, model pull, OPENAI_API_KEY) appear before `arksim simulate-evaluate`, preventing a top-to-bottom reader from hitting a connection error.

Adds 3 tests covering the new endpoint-differs branches and the discarded-override warning.
The _INVISIBLE_CHARS handling was added in 466d7af as part of the zero-nits
sweep, defending against zero-width characters from rich-text copy-paste
into config.yaml. Probability of organic occurrence is near-zero per the
original adversary review; the feature is generic string handling unrelated
to the Open Responses scope of this PR. Keep _norm_endpoint_value with
plain str.strip() only.
@itsarbit itsarbit force-pushed the feat/responses-api branch from 1f545da to d808ff5 Compare May 19, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant