feat(llm): multi-provider Open Responses support for user simulator#170
Open
itsarbit wants to merge 23 commits into
Open
feat(llm): multi-provider Open Responses support for user simulator#170itsarbit wants to merge 23 commits into
itsarbit wants to merge 23 commits into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…overrides - _evaluator_endpoint_differs now compares evaluator_provider/evaluator_base_url against shared peers, so a redundant override (same value) no longer drops the shared api_key and crashes the evaluator at run time. - Add a WARNING when evaluator_api_key is set but the endpoint does not differ, so users notice when the override is silently ignored. - Strip whitespace in the existing endpoint-split INFO log so empty / whitespace-only values surface as None instead of literal blanks. - Drop dead `or settings.model` / `or settings.provider` fallbacks in the HTML report kwargs (evaluator_llm_kwargs already guarantees both keys). - Reorder examples/customer-service/README.md so the Ollama prereqs (`ollama serve`, model pull, OPENAI_API_KEY) appear before `arksim simulate-evaluate`, preventing a top-to-bottom reader from hitting a connection error. Adds 3 tests covering the new endpoint-differs branches and the discarded-override warning.
The _INVISIBLE_CHARS handling was added in 466d7af as part of the zero-nits sweep, defending against zero-width characters from rich-text copy-paste into config.yaml. Probability of organic occurrence is near-zero per the original adversary review; the feature is generic string handling unrelated to the Open Responses scope of this PR. Keep _norm_endpoint_value with plain str.strip() only.
1f545da to
d808ff5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OpenAILLMto accept optionalbase_urlandapi_keyconstructor kwargs, with empty-string coercion so YAML configs with blank values fall back to env varsresponsesas a provider alias forOpenAILLMinLLM._get_providerbase_urlandapi_keyfields toSimulationInputandEvaluationInputPydantic models and forward them at bothLLM(...)construction sitesevaluator_*overrides onEvaluationInputwith per-group fallback semantics: when the evaluator endpoint differs from the simulator, sharedbase_url/api_keyare NOT forwarded (prevents cross-endpoint credential leaks)OpenAILLMand an end-of-run log line ("simulation LLM cache stats: N calls, X input tokens, Y cached (Z% hit rate)")The OpenAI provider has run on
/v1/responsessince v0.3.0. This change unlocks pointing the user-simulator LLM at any Open Responses-conforming backend (OpenAI, Ollama, vLLM, NVIDIA NIM, Vercel AI Gateway, OpenRouter, LM Studio, Llama Stack) via flat top-levelmodel,provider: responses,base_url,api_keykeys inconfig.yaml.Cache benefit, verified
Verified end-to-end by running the unmodified
examples/customer-serviceconfig againstgpt-5.1(the arksim DEFAULT_MODEL), 7 scenarios x 2 conversations x 3 turns, with the same model used for both simulation and evaluation:The evaluator phase realizes ~52% cache hits because its prompts share a large, stable rubric prefix across every evaluation call. The simulator phase, with the default
max_turns: 3and a short per-scenario prompt, builds prompts that average ~400 tokens per call (below OpenAI's 1,024-token cache floor) and so registers no hits in this short-run example. For multi-turn scenarios where the simulator prompt clears 1,024 tokens, the cache fires as shown below.The realistic multi-turn shape is exercised by
test_cache_hits_grow_across_simulator_turns, which mirrors arksim's[system + scenario + growing history + trigger]call pattern ongpt-5.1and observes (cold-cache run):A separate per-model matrix (
test_responses_provider_cache_matrix) sends the same ~2,054-token prefix four times to confirm caching wiring across the gpt-5 family. Picking two for context:gpt-5.1(arksim DEFAULT_MODEL)gpt-5The matrix is opt-in (skipped without
OPENAI_API_KEY) and costs roughly $0.01-0.05 per model per run.Test plan
OpenAILLM(constructor + cache stats)LLM._get_provideralias resolutionSimulationInput/EvaluationInputfield declarations + evaluator override fallbacktests/unit/test_responses_yaml_plumbing.py)api.openai.com/v1/responses(smoke + explicit base_url + cache hit), skipped withoutOPENAI_API_KEYtest_responses_provider_cache_matrix) against real OpenAItest_cache_hits_grow_across_simulator_turns) verifying cache fires on arksim's real call shapegpt-5.1: 52.5% evaluator cache hit ratemake lintclean; 936 unit tests passsimulate-conversation.mdxcross-linksexamples/customer-service/README.mdupdated with Ollama + OpenAI-evaluator swap snippet using the flat YAML schema[Unreleased] > Addedfor bothllmandevaluatorscopes