feat(bidi/openai): support image input via BidiImageInputEvent#2327
feat(bidi/openai): support image input via BidiImageInputEvent#2327cagataycali wants to merge 1 commit into
Conversation
OpenAI's Realtime API supports image inputs on gpt-realtime models via input_image content blocks on conversation.item.create. The Strands BidiOpenAIRealtimeModel previously raised ValueError for BidiImageInputEvent — only Gemini Live handled images. This change adds parity with Gemini Live by: - Dispatching BidiImageInputEvent in send() to a new _send_image_content() method. - Encoding the image as a data: URL (data:<mime>;base64,<b64>) and emitting a single conversation.item.create event with role=user and an input_image content block. - Not auto-triggering response.create — the caller commits when ready (matching audio/buffered semantics rather than text/auto-reply). Updates the existing test_send_edge_cases case that previously asserted ValueError to instead verify the wire format of the conversation.item.create event and confirm no implicit response.create is emitted.
|
/strands review |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| image_calls = [json.loads(call[0][0]) for call in mock_ws.send.call_args_list] | ||
| image_creates = [m for m in image_calls if m.get("type") == "conversation.item.create"] | ||
| assert len(image_creates) == 1, "expected exactly one conversation.item.create for image" | ||
| image_item = image_creates[0].get("item", {}) |
There was a problem hiding this comment.
Issue: Per-field assertions on image_item could miss unexpected or regressed fields. Since the entire dict is deterministic, a single equality check would be more robust.
Suggestion:
assert image_item == {
"type": "message",
"role": "user",
"content": [{"type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}"}],
}This catches any unexpected fields that might be added in the future and is more concise.
|
Assessment: Approve Clean, well-scoped addition that brings OpenAI Realtime to parity with Gemini Live for image input. The implementation follows existing patterns exactly ( Minor suggestion
Nice contribution — fills a real gap with minimal, focused code. 👍 |
Summary
Adds image-input support to
BidiOpenAIRealtimeModel, bringing it to parity withBidiGeminiLiveModelwhich already handlesBidiImageInputEvent.Previously, sending a
BidiImageInputEventto the OpenAI realtime model raisedValueError: content not supported. OpenAI's Realtime API ongpt-realtimemodels does accept image inputs asinput_imagecontent blocks onconversation.item.create- this PR wires that up.Motivation
We're building voice agents where the user says "look at me" / "what's on my whiteboard?" and we want to inject a captured camera frame straight into the realtime model's multimodal context, then let it reply in audio.
Gemini Live works out of the box. For OpenAI we had to monkeypatch
BidiOpenAIRealtimeModel.send()from outside the SDK - this PR upstreams the fix.Changes
src/strands/experimental/bidi/models/openai_realtime.pyBidiImageInputEvent.send()to a new_send_image_content()method._send_image_content()encodes the image as adata:URL (data:<mime>;base64,<b64>) and emits a singleconversation.item.createevent withrole=userand aninput_imagecontent block.response.create— caller decides when to commit (mirrors the audio/buffered semantics rather than text/auto-reply, which matches typical streaming-vision use cases where you send N frames then a follow-up text/audio prompt).tests/strands/experimental/bidi/models/test_openai_realtime.pyBidiImageInputEventraisesValueErroris replaced with one that verifies the new wire format and that no implicitresponse.createis sent.Wire format
{ "type": "conversation.item.create", "item": { "type": "message", "role": "user", "content": [ { "type": "input_image", "image_url": "data:image/jpeg;base64,<...>" } ] } }This matches OpenAI's documented format for image inputs on
gpt-realtime.Tests
The broader
tests/strands/experimental/bidi/suite passes except for unrelated pre-existing failures from optional deps (aws_sdk_bedrock_runtimefor Nova Sonic,pyaudiofor audio I/O) — none touched by this change.Tenets alignment
agent.send(BidiImageInputEvent(...))Just Works on OpenAI now.input_imageblock +data:URL.Checklist
_send_audio_content/_send_text_content).hatch fmt --formatter) clean.hatch fmt --linter) clean.ValueError.