Aloha Lou,
First, thank you for AIProxySwift and for the GA realtime migration. I've got gpt-realtime-2 running on a real device through it (device to AIProxy to OpenAI, DeviceCheck passing) and it works great.
I'm trying to use the two new GA audio models OpenAI shipped on 2026-05-07, and I hit a couple of spots where the SDK can't quite express them yet. I did the digging so this is hopefully a quick read. I'm not sending a PR (I'd rather you own the design on this one), but I'm very happy to test any branch on real hardware, translation especially.
I verified everything below against AIProxySwift 0.153.0 on Xcode 26.x, on a real device with DeviceCheck passing. The line references point at the 0.153.0 tag.
Where things stand
| Model |
Status in 0.153.0 |
What I'm asking for |
gpt-realtime-2 |
Works today via realtimeSession(model:) |
Nothing (optional reasoning-effort knob) |
gpt-realtime-whisper |
Partly there |
A delay field on InputAudioTranscription, plus a confirmed transcription-session connect path |
gpt-realtime-translate |
Not expressible yet |
A dedicated translation session (separate endpoint plus the session.* events) |
Docs I leaned on are the OpenAI announcement, the realtime-translation guide, and the realtime-transcription guide.
What already works (so you can skip re-checking it)
I confirmed these so they're off your plate:
1. A dedicated translation session for gpt-realtime-translate (my main ask)
This is the one I'm most excited about. I already have live speech translation (70+ input languages, 13 output) working on the web, and I want to bring it to iOS through AIProxy. It runs on a separate endpoint with its own session.-prefixed event family, so the current realtime path can't reach it.
Here's why it can't reach it today:
Here's the protocol from the OpenAI docs (guide, cookbook, model):
- Connect to
wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate.
- Set the target language once the socket opens.
{ "type": "session.update",
"session": { "audio": { "output": { "language": "es" } } } }
- Client to server uses
session.input_audio_buffer.append (24 kHz PCM16), plus session.close to flush and end.
- Server to client sends
session.output_audio.delta (200 ms PCM16), session.output_transcript.delta, session.input_transcript.delta, and session.closed.
- No
response.create, no conversation or turn lifecycle, no tools, voice, or prompt.
A minimal surface that would unblock me (just sketching, the shape is your call):
// New entry point that targets the translations endpoint.
public func realtimeTranslationSession(
model: String = "gpt-realtime-translate",
configuration: OpenAIRealtimeTranslationSessionConfiguration,
logLevel: AIProxyLogLevel
) async throws -> OpenAIRealtimeSession
// Config carries only what translation supports.
public struct OpenAIRealtimeTranslationSessionConfiguration {
public var outputLanguage: String // session.audio.output.language
public var inputAudioTranscription: InputAudioTranscription? // e.g. gpt-realtime-whisper
}
// New client event: type = "session.input_audio_buffer.append"
// New decoded server events: session.output_audio.delta,
// session.output_transcript.delta, session.input_transcript.delta, session.closed
2. A delay knob and transcription connect path for gpt-realtime-whisper
This is OpenAI's recommended low-latency streaming transcription model. Most of the plumbing is already there (see "what already works"), so two gaps remain.
Gap A is the missing delay. InputAudioTranscription exposes model, language, and prompt, but not delay (#L120-L129). gpt-realtime-whisper tunes its latency and accuracy through audio.input.transcription.delay (minimal, low, medium, high, xhigh), and right now there's no way to set it (transcription guide, sessions reference).
public struct InputAudioTranscription {
public var language: String?
public var model: String?
public var prompt: String?
public var delay: Delay? // new, encodeIfPresent at audio.input.transcription.delay
public enum Delay: String, Encodable, Sendable {
case minimal, low, medium, high, xhigh
}
}
Gap B is the transcription-only connect path. A pure transcription session is session.type = "transcription", but realtimeSession(model:) always appends ?model=…, whereas OpenAI's transcription sessions connect with ?intent=transcription. Could you confirm (or expose) the intended connect path for a transcription-only session through AIProxy? Per the docs, turn_detection has to be null and prompt isn't supported for gpt-realtime-whisper.
3. Optional reasoning-effort knob for gpt-realtime-2
gpt-realtime-2's headline feature is configurable reasoning effort, but OpenAIRealtimeSessionConfiguration has no field for it, so callers get the default. This is low priority, only worth it if and when the realtime session schema for it settles (model).
Small README nit (stale realtime example)
The realtime snippet in README.md still uses whisper-1 and the pre-GA modalities: arg (the one the SDK now warns about and remaps to outputModalities), around README.md:1385 and :1388. The file-transcription example already uses gpt-4o-mini-transcribe, so it's just the realtime snippet that's out of step. A one-line refresh would stop folks copy-pasting the legacy model.
Happy to help
I've got a working gpt-realtime-2 + AIProxy app on real hardware, so I can test any of this on-device (translation especially) and send you traces and logs. Glad to validate a branch whenever it's useful; just tell me what would help.
Thanks again,
Ray Fernando
Aloha Lou,
First, thank you for AIProxySwift and for the GA realtime migration. I've got
gpt-realtime-2running on a real device through it (device to AIProxy to OpenAI, DeviceCheck passing) and it works great.I'm trying to use the two new GA audio models OpenAI shipped on 2026-05-07, and I hit a couple of spots where the SDK can't quite express them yet. I did the digging so this is hopefully a quick read. I'm not sending a PR (I'd rather you own the design on this one), but I'm very happy to test any branch on real hardware, translation especially.
I verified everything below against AIProxySwift 0.153.0 on Xcode 26.x, on a real device with DeviceCheck passing. The line references point at the
0.153.0tag.Where things stand
gpt-realtime-2realtimeSession(model:)gpt-realtime-whisperdelayfield onInputAudioTranscription, plus a confirmed transcription-session connect pathgpt-realtime-translatesession.*events)Docs I leaned on are the OpenAI announcement, the realtime-translation guide, and the realtime-transcription guide.
What already works (so you can skip re-checking it)
I confirmed these so they're off your plate:
gpt-realtime-2runs over the existing path.realtimeSession(model:)takes a free-form model string and connects to/v1/realtime?model=…(OpenAIService.swift#L271-L286).gpt-4o-transcribe,gpt-4o-mini-transcribe, andgpt-realtime-whispercan already be set as a session's input transcription model (OpenAIRealtimeSessionConfiguration.swift#L120-L129).SessionType.transcriptionexists (#L545-L548), and the transcript-delta server events already decode (conversation.item.input_audio_transcription.delta/.completed/.segment,OpenAIRealtimeMessage.swift#L107-L114).1. A dedicated translation session for
gpt-realtime-translate(my main ask)This is the one I'm most excited about. I already have live speech translation (70+ input languages, 13 output) working on the web, and I want to bring it to iOS through AIProxy. It runs on a separate endpoint with its own
session.-prefixed event family, so the current realtime path can't reach it.Here's why it can't reach it today:
/v1/realtime?model=\(model)(OpenAIService.swift#L271-L286), and translation needs/v1/realtime/translations?model=gpt-realtime-translate.input_audio_buffer.append(OpenAIRealtimeInputAudioBufferAppend.swift#L10-L11), and translation uses thesession.-prefixed variant.OpenAIRealtimeMessage.swift#L54-L118) doesn't model the translation server events.Here's the protocol from the OpenAI docs (guide, cookbook, model):
wss://api.openai.com/v1/realtime/translations?model=gpt-realtime-translate.{ "type": "session.update", "session": { "audio": { "output": { "language": "es" } } } }session.input_audio_buffer.append(24 kHz PCM16), plussession.closeto flush and end.session.output_audio.delta(200 ms PCM16),session.output_transcript.delta,session.input_transcript.delta, andsession.closed.response.create, no conversation or turn lifecycle, no tools, voice, or prompt.A minimal surface that would unblock me (just sketching, the shape is your call):
2. A
delayknob and transcription connect path forgpt-realtime-whisperThis is OpenAI's recommended low-latency streaming transcription model. Most of the plumbing is already there (see "what already works"), so two gaps remain.
Gap A is the missing
delay.InputAudioTranscriptionexposesmodel,language, andprompt, but notdelay(#L120-L129).gpt-realtime-whispertunes its latency and accuracy throughaudio.input.transcription.delay(minimal,low,medium,high,xhigh), and right now there's no way to set it (transcription guide, sessions reference).Gap B is the transcription-only connect path. A pure transcription session is
session.type = "transcription", butrealtimeSession(model:)always appends?model=…, whereas OpenAI's transcription sessions connect with?intent=transcription. Could you confirm (or expose) the intended connect path for a transcription-only session through AIProxy? Per the docs,turn_detectionhas to benullandpromptisn't supported forgpt-realtime-whisper.3. Optional reasoning-effort knob for
gpt-realtime-2gpt-realtime-2's headline feature is configurable reasoning effort, butOpenAIRealtimeSessionConfigurationhas no field for it, so callers get the default. This is low priority, only worth it if and when the realtime session schema for it settles (model).Small README nit (stale realtime example)
The realtime snippet in
README.mdstill useswhisper-1and the pre-GAmodalities:arg (the one the SDK now warns about and remaps tooutputModalities), aroundREADME.md:1385and:1388. The file-transcription example already usesgpt-4o-mini-transcribe, so it's just the realtime snippet that's out of step. A one-line refresh would stop folks copy-pasting the legacy model.Happy to help
I've got a working
gpt-realtime-2+ AIProxy app on real hardware, so I can test any of this on-device (translation especially) and send you traces and logs. Glad to validate a branch whenever it's useful; just tell me what would help.Thanks again,
Ray Fernando