fix: omit unsupported GPT-5 temperature overrides#95
Conversation
f294981 to
86ae430
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates request payload construction in p2m to avoid sending unsupported non-default temperature values for GPT-5.x deployments (notably Azure deployment-name models), and also extends the viewer’s metrics pipeline/UI to show policy-violation rates split by behavior permissibility (allowed vs blocked requests) by loading judge taxonomy data.
Changes:
- Omit non-default
temperatureoverrides for GPT-5.x models when building LiteLLM chat + Responses API payloads (while preserving explicittemperature=1and custom temps for non–GPT-5 models). - Add server-side metric aggregation for policy-violation outcomes split by behavior permissibility, and surface it in the viewer run page UI for both prompt and audit tabs.
- Add viewer-side taxonomy loading helpers + tests covering permissibility aggregation and taxonomy loading behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| viewer/src/routes/suite/[suite_id]/[run_id]/+page.svelte | Renders new “Allowed/Blocked requests failed” summary cards for prompts and audits. |
| viewer/src/lib/types.ts | Extends RunMetrics/AuditRunMetrics types with permissibility-split fields. |
| viewer/src/lib/server/metrics.ts | Computes policy-violation aggregates split by permissibility from node judgments. |
| viewer/src/lib/server/data.ts | Loads behaviors from taxonomy and threads them into metrics computation + view models. |
| viewer/src/lib/server/artifacts.ts | Adds helpers to load judge taxonomy from artifacts/config/run directory. |
| tests/test_viewer_server_artifacts.py | Adds Node harness tests for permissibility metrics + taxonomy loading. |
| tests/test_model_client.py | Adds focused tests for GPT-5 temperature omission/retention behavior. |
| p2m/core/model_client.py | Implements GPT-5 temperature override omission in payload builders. |
| {#if data.metrics.policyViolationOnPermissible || data.metrics.policyViolationOnNotPermissible} | ||
| {@const promptPerm = data.metrics.policyViolationOnPermissible} | ||
| {@const promptNotPerm = data.metrics.policyViolationOnNotPermissible} | ||
| {#if (promptPerm?.count ?? 0) + (promptNotPerm?.count ?? 0) > 0} | ||
| <div class="mb-4 grid gap-3 sm:grid-cols-2" title="Per-behavior judgments aggregated across prompts. Denominator is judgments the judge marked relevant for that behavior."> |
| const rawTaxonomyPath = typeof judge?.taxonomy_path === 'string' ? judge.taxonomy_path : null; | ||
| if (!rawTaxonomyPath) return null; | ||
|
|
||
| const resolved = path.resolve(rawTaxonomyPath); | ||
| return readJsonFile<Taxonomy>(resolved, { missingOk: true }); |
| temperature = _temperature_for_payload(model, resolved_options.temperature) | ||
| if temperature is not None: | ||
| payload["temperature"] = temperature |
| artifacts: Record<string, unknown> | null | ||
| ): Taxonomy | null { | ||
| const systematize = readObject(artifacts?.systematize); | ||
| const artifactTaxonomyPath = typeof systematize?.path === 'string' ? systematize.path : null; |
|
@jakepresent — Build triage on the internal UX-testing chat feedback (5/28–5/29) flagged this PR as the central fix for #50. With #150 merged (literal The PR is on old Full triage rollup in session artifact: |
Summary
Validation
Bug bash context
During setup smoke, GPT-5.4 Azure deployments rejected config temperatures like 0.0/0.2 because they only accept the provider default temperature. This keeps existing configs from failing when the selected deployment is GPT-5.x.