Feature request: GA Realtime audio models gpt-realtime-translate & gpt-realtime-whisper

Aloha Lou,

First, thank you for AIProxySwift and for the GA realtime migration. I've got `gpt-realtime-2` running on a real device through it (device to AIProxy to OpenAI, DeviceCheck passing) and it works great.

I'm trying to use the two new GA audio models OpenAI shipped on 2026-05-07, and I hit a couple of spots where the SDK can't quite express them yet. I did the digging so this is hopefully a quick read. I'm not sending a PR (I'd rather you own the design on this one), but I'm very happy to test any branch on real hardware, translation especially.

I verified everything below against AIProxySwift 0.153.0 on Xcode 26.x, on a real device with DeviceCheck passing. The line references point at the `0.153.0` tag.

## Where things stand

| Model | Status in 0.153.0 | What I'm asking for |
| --- | --- | --- |
| `gpt-realtime-2` | Works today via `realtimeSession(model:)` | Nothing (optional reasoning-effort knob) |
| `gpt-realtime-whisper` | Partly there | A `delay` field on `InputAudioTranscription`, plus a confirmed transcription-session connect path |
| `gpt-realtime-translate` | Not expressible yet | A dedicated translation session (separate endpoint plus the `session.*` events) |

Docs I leaned on are the [OpenAI announcement](https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/), the [realtime-translation guide](https://developers.openai.com/api/docs/guides/realtime-translation), and the [realtime-transcription guide](https://developers.openai.com/api/docs/guides/realtime-transcription).

## What already works (so you can skip re-checking it)

I confirmed these so they're off your plate:

- `gpt-realtime-2` runs over the existing path. `realtimeSession(model:)` takes a free-form model string and connects to `/v1/realtime?model=…` ([`OpenAIService.swift#L271-L286`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIService.swift#L271-L286)).
- The input transcription model is free-form, so `gpt-4o-transcribe`, `gpt-4o-mini-transcribe`, and `gpt-realtime-whisper` can already be set as a session's input transcription model ([`OpenAIRealtimeSessionConfiguration.swift#L120-L129`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeSessionConfiguration.swift#L120-L129)).
- `SessionType.transcription` exists ([`#L545-L548`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeSessionConfiguration.swift#L545-L548)), and the transcript-delta server events already decode (`conversation.item.input_audio_transcription.delta/.completed/.segment`, [`OpenAIRealtimeMessage.swift#L107-L114`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeMessage.swift#L107-L114)).

## 1. A dedicated translation session for `gpt-realtime-translate` (my main ask)

This is the one I'm most excited about. I already have live speech translation (70+ input languages, 13 output) working on the web, and I want to bring it to iOS through AIProxy. It runs on a separate endpoint with its own `session.`-prefixed event family, so the current realtime path can't reach it.

Here's why it can't reach it today:

- The realtime URL is hardcoded to `/v1/realtime?model=\(model)` ([`OpenAIService.swift#L271-L286`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIService.swift#L271-L286)), and translation needs `/v1/realtime/translations?model=gpt-realtime-translate`.
- The input event emits `input_audio_buffer.append` ([`OpenAIRealtimeInputAudioBufferAppend.swift#L10-L11`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeInputAudioBufferAppend.swift#L10-L11)), and translation uses the `session.`-prefixed variant.
- The decoder ([`OpenAIRealtimeMessage.swift#L54-L118`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeMessage.swift#L54-L118)) doesn't model the translation server events.

Here's the protocol from the OpenAI docs ([guide](https://developers.openai.com/api/docs/guides/realtime-translation), [cookbook](https://developers.openai.com/cookbook/examples/voice_solutions/realtime_translation_guide), [model](https://developers.openai.com/api/docs/models/gpt-realtime-translate)):

- Connect to `wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate`.
- Set the target language once the socket opens.
  ```json
  { "type": "session.update",
    "session": { "audio": { "output": { "language": "es" } } } }
  ```
- Client to server uses `session.input_audio_buffer.append` (24 kHz PCM16), plus `session.close` to flush and end.
- Server to client sends `session.output_audio.delta` (200 ms PCM16), `session.output_transcript.delta`, `session.input_transcript.delta`, and `session.closed`.
- No `response.create`, no conversation or turn lifecycle, no tools, voice, or prompt.

A minimal surface that would unblock me (just sketching, the shape is your call):

```swift
// New entry point that targets the translations endpoint.
public func realtimeTranslationSession(
    model: String = "gpt-realtime-translate",
    configuration: OpenAIRealtimeTranslationSessionConfiguration,
    logLevel: AIProxyLogLevel
) async throws -> OpenAIRealtimeSession

// Config carries only what translation supports.
public struct OpenAIRealtimeTranslationSessionConfiguration {
    public var outputLanguage: String        // session.audio.output.language
    public var inputAudioTranscription: InputAudioTranscription?  // e.g. gpt-realtime-whisper
}

// New client event: type = "session.input_audio_buffer.append"
// New decoded server events: session.output_audio.delta,
//   session.output_transcript.delta, session.input_transcript.delta, session.closed
```

## 2. A `delay` knob and transcription connect path for `gpt-realtime-whisper`

This is OpenAI's recommended low-latency streaming transcription model. Most of the plumbing is already there (see "what already works"), so two gaps remain.

Gap A is the missing `delay`. `InputAudioTranscription` exposes `model`, `language`, and `prompt`, but not `delay` ([`#L120-L129`](https://github.com/lzell/AIProxySwift/blob/0.153.0/Sources/AIProxy/OpenAI/OpenAIRealtimeSessionConfiguration.swift#L120-L129)). `gpt-realtime-whisper` tunes its latency and accuracy through `audio.input.transcription.delay` (`minimal`, `low`, `medium`, `high`, `xhigh`), and right now there's no way to set it ([transcription guide](https://developers.openai.com/api/docs/guides/realtime-transcription), [sessions reference](https://developers.openai.com/api/reference/resources/realtime/subresources/sessions/)).

```swift
public struct InputAudioTranscription {
    public var language: String?
    public var model: String?
    public var prompt: String?
    public var delay: Delay?   // new, encodeIfPresent at audio.input.transcription.delay
    public enum Delay: String, Encodable, Sendable {
        case minimal, low, medium, high, xhigh
    }
}
```

Gap B is the transcription-only connect path. A pure transcription session is `session.type = "transcription"`, but `realtimeSession(model:)` always appends `?model=…`, whereas OpenAI's transcription sessions connect with `?intent=transcription`. Could you confirm (or expose) the intended connect path for a transcription-only session through AIProxy? Per the docs, `turn_detection` has to be `null` and `prompt` isn't supported for `gpt-realtime-whisper`.

## 3. Optional reasoning-effort knob for `gpt-realtime-2`

`gpt-realtime-2`'s headline feature is configurable reasoning effort, but `OpenAIRealtimeSessionConfiguration` has no field for it, so callers get the default. This is low priority, only worth it if and when the realtime session schema for it settles ([model](https://developers.openai.com/api/docs/models/gpt-realtime-2)).

## Small README nit (stale realtime example)

The realtime snippet in `README.md` still uses `whisper-1` and the pre-GA `modalities:` arg (the one the SDK now warns about and remaps to `outputModalities`), around `README.md:1385` and `:1388`. The file-transcription example already uses `gpt-4o-mini-transcribe`, so it's just the realtime snippet that's out of step. A one-line refresh would stop folks copy-pasting the legacy model.

## Happy to help

I've got a working `gpt-realtime-2` + AIProxy app on real hardware, so I can test any of this on-device (translation especially) and send you traces and logs. Glad to validate a branch whenever it's useful; just tell me what would help.

Thanks again,
Ray Fernando


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: GA Realtime audio models gpt-realtime-translate & gpt-realtime-whisper #283

Where things stand

What already works (so you can skip re-checking it)

1. A dedicated translation session for `gpt-realtime-translate` (my main ask)

2. A `delay` knob and transcription connect path for `gpt-realtime-whisper`

3. Optional reasoning-effort knob for `gpt-realtime-2`

Small README nit (stale realtime example)

Happy to help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Status in 0.153.0	What I'm asking for
`gpt-realtime-2`	Works today via `realtimeSession(model:)`	Nothing (optional reasoning-effort knob)
`gpt-realtime-whisper`	Partly there	A `delay` field on `InputAudioTranscription`, plus a confirmed transcription-session connect path
`gpt-realtime-translate`	Not expressible yet	A dedicated translation session (separate endpoint plus the `session.*` events)

Feature request: GA Realtime audio models gpt-realtime-translate & gpt-realtime-whisper #283

Description

Where things stand

What already works (so you can skip re-checking it)

1. A dedicated translation session for gpt-realtime-translate (my main ask)

2. A delay knob and transcription connect path for gpt-realtime-whisper

3. Optional reasoning-effort knob for gpt-realtime-2

Small README nit (stale realtime example)

Happy to help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. A dedicated translation session for `gpt-realtime-translate` (my main ask)

2. A `delay` knob and transcription connect path for `gpt-realtime-whisper`

3. Optional reasoning-effort knob for `gpt-realtime-2`