Skip to content

fix: omit unsupported GPT-5 temperature overrides#95

Open
jakepresent wants to merge 2 commits into
mainfrom
jakepresent/gpt5-temperature-normalization
Open

fix: omit unsupported GPT-5 temperature overrides#95
jakepresent wants to merge 2 commits into
mainfrom
jakepresent/gpt5-temperature-normalization

Conversation

@jakepresent

Copy link
Copy Markdown
Collaborator

Summary

  • omit non-default temperature from GPT-5.x LiteLLM payloads, including Azure deployment names like azure/gpt-5.4-1
  • keep explicit default temperature=1 and keep custom temperatures for non-GPT-5 models
  • cover chat payload normalization with focused model-client tests

Validation

  • python -m pytest tests/test_model_client.py -q
  • python -m pytest tests/test_model_client.py tests/test_runtime_modes.py -q

Bug bash context

During setup smoke, GPT-5.4 Azure deployments rejected config temperatures like 0.0/0.2 because they only accept the provider default temperature. This keeps existing configs from failing when the selected deployment is GPT-5.x.

@jakepresent jakepresent requested a review from changliu2 May 26, 2026 18:35
@jakepresent jakepresent force-pushed the jakepresent/gpt5-temperature-normalization branch 2 times, most recently from f294981 to 86ae430 Compare May 26, 2026 18:56
@AaronAspinwall123 AaronAspinwall123 requested a review from Copilot May 26, 2026 19:31

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates request payload construction in p2m to avoid sending unsupported non-default temperature values for GPT-5.x deployments (notably Azure deployment-name models), and also extends the viewer’s metrics pipeline/UI to show policy-violation rates split by behavior permissibility (allowed vs blocked requests) by loading judge taxonomy data.

Changes:

  • Omit non-default temperature overrides for GPT-5.x models when building LiteLLM chat + Responses API payloads (while preserving explicit temperature=1 and custom temps for non–GPT-5 models).
  • Add server-side metric aggregation for policy-violation outcomes split by behavior permissibility, and surface it in the viewer run page UI for both prompt and audit tabs.
  • Add viewer-side taxonomy loading helpers + tests covering permissibility aggregation and taxonomy loading behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
viewer/src/routes/suite/[suite_id]/[run_id]/+page.svelte Renders new “Allowed/Blocked requests failed” summary cards for prompts and audits.
viewer/src/lib/types.ts Extends RunMetrics/AuditRunMetrics types with permissibility-split fields.
viewer/src/lib/server/metrics.ts Computes policy-violation aggregates split by permissibility from node judgments.
viewer/src/lib/server/data.ts Loads behaviors from taxonomy and threads them into metrics computation + view models.
viewer/src/lib/server/artifacts.ts Adds helpers to load judge taxonomy from artifacts/config/run directory.
tests/test_viewer_server_artifacts.py Adds Node harness tests for permissibility metrics + taxonomy loading.
tests/test_model_client.py Adds focused tests for GPT-5 temperature omission/retention behavior.
p2m/core/model_client.py Implements GPT-5 temperature override omission in payload builders.

Comment on lines +923 to +927
{#if data.metrics.policyViolationOnPermissible || data.metrics.policyViolationOnNotPermissible}
{@const promptPerm = data.metrics.policyViolationOnPermissible}
{@const promptNotPerm = data.metrics.policyViolationOnNotPermissible}
{#if (promptPerm?.count ?? 0) + (promptNotPerm?.count ?? 0) > 0}
<div class="mb-4 grid gap-3 sm:grid-cols-2" title="Per-behavior judgments aggregated across prompts. Denominator is judgments the judge marked relevant for that behavior.">
Comment on lines +579 to +583
const rawTaxonomyPath = typeof judge?.taxonomy_path === 'string' ? judge.taxonomy_path : null;
if (!rawTaxonomyPath) return null;

const resolved = path.resolve(rawTaxonomyPath);
return readJsonFile<Taxonomy>(resolved, { missingOk: true });
Comment thread p2m/core/model_client.py
Comment on lines +553 to +555
temperature = _temperature_for_payload(model, resolved_options.temperature)
if temperature is not None:
payload["temperature"] = temperature
artifacts: Record<string, unknown> | null
): Taxonomy | null {
const systematize = readObject(artifacts?.systematize);
const artifactTaxonomyPath = typeof systematize?.path === 'string' ? systematize.path : null;
@changliu2

Copy link
Copy Markdown
Collaborator

@jakepresent — Build triage on the internal UX-testing chat feedback (5/28–5/29) flagged this PR as the central fix for #50. With #150 merged (literal temperature: 0.7 gone from example YAMLs), this becomes the durable fix so future GPT-5 SKUs don't paper-cut users with custom temperatures.

The PR is on old p2m/ paths; needs rebase onto current main (which has the assert_eval/ rename). Could you rebase? If not, I can do a path-rename pass on a side branch for you to cherry-pick.

Full triage rollup in session artifact: build-triage-final-rollup.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants