feat(bidi/openai): support image input via BidiImageInputEvent by cagataycali · Pull Request #2327 · strands-agents/sdk-python

cagataycali · 2026-05-26T00:03:13Z

Summary

Adds image-input support to BidiOpenAIRealtimeModel, bringing it to parity with BidiGeminiLiveModel which already handles BidiImageInputEvent.

Previously, sending a BidiImageInputEvent to the OpenAI realtime model raised ValueError: content not supported. OpenAI's Realtime API on gpt-realtime models does accept image inputs as input_image content blocks on conversation.item.create - this PR wires that up.

Motivation

We're building voice agents where the user says "look at me" / "what's on my whiteboard?" and we want to inject a captured camera frame straight into the realtime model's multimodal context, then let it reply in audio.

Gemini Live works out of the box. For OpenAI we had to monkeypatch BidiOpenAIRealtimeModel.send() from outside the SDK - this PR upstreams the fix.

Changes

src/strands/experimental/bidi/models/openai_realtime.py

Import BidiImageInputEvent.
Dispatch it in send() to a new _send_image_content() method.
_send_image_content() encodes the image as a data: URL (data:<mime>;base64,<b64>) and emits a single conversation.item.create event with role=user and an input_image content block.
Does not auto-trigger response.create — caller decides when to commit (mirrors the audio/buffered semantics rather than text/auto-reply, which matches typical streaming-vision use cases where you send N frames then a follow-up text/audio prompt).

tests/strands/experimental/bidi/models/test_openai_realtime.py

The existing case that asserted BidiImageInputEvent raises ValueError is replaced with one that verifies the new wire format and that no implicit response.create is sent.

Wire format

{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_image",
        "image_url": "data:image/jpeg;base64,<...>"
      }
    ]
  }
}

This matches OpenAI's documented format for image inputs on gpt-realtime.

Tests

hatch fmt --formatter
hatch fmt --linter           # All checks passed!
hatch test tests/strands/experimental/bidi/models/test_openai_realtime.py
# 27 passed in 5.21s

The broader tests/strands/experimental/bidi/ suite passes except for unrelated pre-existing failures from optional deps (aws_sdk_bedrock_runtime for Nova Sonic, pyaudio for audio I/O) — none touched by this change.

Tenets alignment

Composability: matches Gemini Live's behavior, same event type, same shape.
The obvious path is the happy path: agent.send(BidiImageInputEvent(...)) Just Works on OpenAI now.
Embrace common standards: uses OpenAI's own documented input_image block + data: URL.

Checklist

Code follows existing patterns (mirrors _send_audio_content / _send_text_content).
Formatter (hatch fmt --formatter) clean.
Linter + mypy (hatch fmt --linter) clean.
Unit tests updated and passing.
No public API changes other than removing the previous ValueError.

OpenAI's Realtime API supports image inputs on gpt-realtime models via input_image content blocks on conversation.item.create. The Strands BidiOpenAIRealtimeModel previously raised ValueError for BidiImageInputEvent — only Gemini Live handled images. This change adds parity with Gemini Live by: - Dispatching BidiImageInputEvent in send() to a new _send_image_content() method. - Encoding the image as a data: URL (data:<mime>;base64,<b64>) and emitting a single conversation.item.create event with role=user and an input_image content block. - Not auto-triggering response.create — the caller commits when ready (matching audio/buffered semantics rather than text/auto-reply). Updates the existing test_send_edge_cases case that previously asserted ValueError to instead verify the wire format of the conversation.item.create event and confirm no implicit response.create is emitted.

mkmeral · 2026-05-26T01:20:58Z

/strands review

codecov · 2026-05-26T01:21:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-26T01:23:50Z

+    image_calls = [json.loads(call[0][0]) for call in mock_ws.send.call_args_list]
+    image_creates = [m for m in image_calls if m.get("type") == "conversation.item.create"]
+    assert len(image_creates) == 1, "expected exactly one conversation.item.create for image"
+    image_item = image_creates[0].get("item", {})


Issue: Per-field assertions on image_item could miss unexpected or regressed fields. Since the entire dict is deterministic, a single equality check would be more robust.

Suggestion:

assert image_item == { "type": "message", "role": "user", "content": [{"type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}"}], }

This catches any unexpected fields that might be added in the future and is more concise.

github-actions · 2026-05-26T01:23:51Z

Assessment: Approve

Clean, well-scoped addition that brings OpenAI Realtime to parity with Gemini Live for image input. The implementation follows existing patterns exactly (_send_audio_content / _send_text_content) and the behavioral choice of not auto-triggering response.create is consistent with Gemini and well-motivated for streaming vision use cases.

Minor suggestion

Testing: One inline comment suggesting collapsing per-field assertions into a single dict equality check for robustness against unexpected fields.

Nice contribution — fills a real gap with minimal, focused code. 👍

github-actions Bot added the size/s label May 26, 2026

cagataycali requested a deployment to manual-approval May 26, 2026 00:03 — with GitHub Actions Waiting

github-actions Bot added the strands-running label May 26, 2026

github-actions Bot reviewed May 26, 2026

View reviewed changes

github-actions Bot removed the strands-running label May 26, 2026

mkmeral approved these changes May 26, 2026

View reviewed changes

mkmeral enabled auto-merge (squash) May 26, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bidi/openai): support image input via BidiImageInputEvent#2327

feat(bidi/openai): support image input via BidiImageInputEvent#2327
cagataycali wants to merge 1 commit into
strands-agents:mainfrom
cagataycali:feat/openai-realtime-image-input

cagataycali commented May 26, 2026 •

edited

Loading

Uh oh!

mkmeral commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026

Uh oh!

github-actions Bot May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cagataycali commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Wire format

Tests

Tenets alignment

Checklist

Uh oh!

mkmeral commented May 26, 2026

Uh oh!

codecov Bot commented May 26, 2026

Codecov Report

Uh oh!

github-actions Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cagataycali commented May 26, 2026 •

edited

Loading