feat(llm): multi-provider Open Responses support for user simulator by itsarbit · Pull Request #170 · arklexai/arksim

itsarbit · 2026-05-18T16:57:05Z

Summary

Extend OpenAILLM to accept optional base_url and api_key constructor kwargs, with empty-string coercion so YAML configs with blank values fall back to env vars
Register responses as a provider alias for OpenAILLM in LLM._get_provider
Add matching base_url and api_key fields to SimulationInput and EvaluationInput Pydantic models and forward them at both LLM(...) construction sites
Add evaluator_* overrides on EvaluationInput with per-group fallback semantics: when the evaluator endpoint differs from the simulator, shared base_url/api_key are NOT forwarded (prevents cross-endpoint credential leaks)
Surface OpenAI prompt-cache hits via per-instance counters on OpenAILLM and an end-of-run log line ("simulation LLM cache stats: N calls, X input tokens, Y cached (Z% hit rate)")
Documents the multi-provider Open Responses story and the existing OpenAI prompt-caching benefit; ships an Ollama swap example in the customer-service README

The OpenAI provider has run on /v1/responses since v0.3.0. This change unlocks pointing the user-simulator LLM at any Open Responses-conforming backend (OpenAI, Ollama, vLLM, NVIDIA NIM, Vercel AI Gateway, OpenRouter, LM Studio, Llama Stack) via flat top-level model, provider: responses, base_url, api_key keys in config.yaml.

Cache benefit, verified

Verified end-to-end by running the unmodified examples/customer-service config against gpt-5.1 (the arksim DEFAULT_MODEL), 7 scenarios x 2 conversations x 3 turns, with the same model used for both simulation and evaluation:

simulation LLM cache stats:  41 calls,  16,622 input tokens,      0 cached  (0.0% hit rate),  1,358 output tokens
evaluation LLM cache stats: 144 calls, 153,309 input tokens, 80,512 cached (52.5% hit rate), 13,075 output tokens

The evaluator phase realizes ~52% cache hits because its prompts share a large, stable rubric prefix across every evaluation call. The simulator phase, with the default max_turns: 3 and a short per-scenario prompt, builds prompts that average ~400 tokens per call (below OpenAI's 1,024-token cache floor) and so registers no hits in this short-run example. For multi-turn scenarios where the simulator prompt clears 1,024 tokens, the cache fires as shown below.

The realistic multi-turn shape is exercised by test_cache_hits_grow_across_simulator_turns, which mirrors arksim's [system + scenario + growing history + trigger] call pattern on gpt-5.1 and observes (cold-cache run):

Turn	Per-turn hit rate
1	0.0%
2	86.0%
3	92.2%
4	89.2%
5	94.8%
Cumulative	73.9%

A separate per-model matrix (test_responses_provider_cache_matrix) sends the same ~2,054-token prefix four times to confirm caching wiring across the gpt-5 family. Picking two for context:

Model	Cache hit rate (4-call matrix)
`gpt-5.1` (arksim DEFAULT_MODEL)	~99.7%
`gpt-5`	~98.1%

The matrix is opt-in (skipped without OPENAI_API_KEY) and costs roughly $0.01-0.05 per model per run.

Test plan

codecov · 2026-05-18T16:58:41Z

Codecov Report

❌ Patch coverage is 91.25000% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
arksim/llms/chat/utils.py	63.63%	4 Missing ⚠️
arksim/llms/chat/providers/openai.py	93.33%	2 Missing ⚠️
arksim/simulation_engine/simulator.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…overrides

…ential leak

…overrides - _evaluator_endpoint_differs now compares evaluator_provider/evaluator_base_url against shared peers, so a redundant override (same value) no longer drops the shared api_key and crashes the evaluator at run time. - Add a WARNING when evaluator_api_key is set but the endpoint does not differ, so users notice when the override is silently ignored. - Strip whitespace in the existing endpoint-split INFO log so empty / whitespace-only values surface as None instead of literal blanks. - Drop dead `or settings.model` / `or settings.provider` fallbacks in the HTML report kwargs (evaluator_llm_kwargs already guarantees both keys). - Reorder examples/customer-service/README.md so the Ollama prereqs (`ollama serve`, model pull, OPENAI_API_KEY) appear before `arksim simulate-evaluate`, preventing a top-to-bottom reader from hitting a connection error. Adds 3 tests covering the new endpoint-differs branches and the discarded-override warning.

…d rail

The _INVISIBLE_CHARS handling was added in 466d7af as part of the zero-nits sweep, defending against zero-width characters from rich-text copy-paste into config.yaml. Probability of organic occurrence is near-zero per the original adversary review; the feature is generic string handling unrelated to the Open Responses scope of this PR. Keep _norm_endpoint_value with plain str.strip() only.

itsarbit requested a review from a team as a code owner May 18, 2026 16:57

itsarbit added 20 commits May 19, 2026 11:22

feat(llm): allow base_url and api_key on OpenAILLM constructor

46bdfc6

feat(llm): register responses/open_responses provider aliases

68874cb

refactor(llm): drop open_responses alias, keep responses

200f372

test(integration): add opt-in Open Responses smoke against OpenAI

e4f6722

test(integration): override smoke model via env, document retry cost

478ca8e

docs: add user-simulator-on-open-responses page

e62bb20

docs: polish user-simulator-on-open-responses page

7b8d7e8

docs: link Open Responses user-simulator page into nav

99bae64

docs: use relative link for user-simulator-on-open-responses

0ed8ca7

docs(examples): show Ollama swap for user simulator

ce2b8ea

docs: changelog entry for Open Responses LLM provider

47d8d77

feat(llm): plumb base_url/api_key from YAML to LLM construction

ef7c89b

docs: align Open Responses examples with flat YAML schema

0921734

fix(llm): strip whitespace, wrap missing-credentials error

99c5957

docs: scope Open Responses to simulator, fix vendor accuracy

670f937

feat(evaluator): allow simulator/evaluator LLM split via evaluator_* …

c0bf468

…overrides

fix(evaluator): per-group LLM fallback to prevent cross-endpoint cred…

e3f98df

…ential leak

fix(evaluator): broader whitespace normalization and field-tuple guar…

b83c362

…d rail

itsarbit force-pushed the feat/responses-api branch from 1f545da to d808ff5 Compare May 19, 2026 18:24

itsarbit added 3 commits May 19, 2026 15:21

feat(llm): surface cache-hit stats and verify the cache benefit

a304200

test(integration): add multi-model cache verification matrix

942f214

test(integration): verify cache hits grow across simulator turns

d18b163

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): multi-provider Open Responses support for user simulator#170

feat(llm): multi-provider Open Responses support for user simulator#170
itsarbit wants to merge 23 commits into
mainfrom
feat/responses-api

itsarbit commented May 18, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

itsarbit commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Cache benefit, verified

Test plan

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

itsarbit commented May 18, 2026 •

edited

Loading

codecov Bot commented May 18, 2026 •

edited

Loading