Skip to content

feat(bidi/openai): support image input via BidiImageInputEvent#2327

Open
cagataycali wants to merge 1 commit into
strands-agents:mainfrom
cagataycali:feat/openai-realtime-image-input
Open

feat(bidi/openai): support image input via BidiImageInputEvent#2327
cagataycali wants to merge 1 commit into
strands-agents:mainfrom
cagataycali:feat/openai-realtime-image-input

Conversation

@cagataycali
Copy link
Copy Markdown
Contributor

@cagataycali cagataycali commented May 26, 2026

Summary

Adds image-input support to BidiOpenAIRealtimeModel, bringing it to parity with BidiGeminiLiveModel which already handles BidiImageInputEvent.

Previously, sending a BidiImageInputEvent to the OpenAI realtime model raised ValueError: content not supported. OpenAI's Realtime API on gpt-realtime models does accept image inputs as input_image content blocks on conversation.item.create - this PR wires that up.

Motivation

We're building voice agents where the user says "look at me" / "what's on my whiteboard?" and we want to inject a captured camera frame straight into the realtime model's multimodal context, then let it reply in audio.

Gemini Live works out of the box. For OpenAI we had to monkeypatch BidiOpenAIRealtimeModel.send() from outside the SDK - this PR upstreams the fix.

Changes

src/strands/experimental/bidi/models/openai_realtime.py

  • Import BidiImageInputEvent.
  • Dispatch it in send() to a new _send_image_content() method.
  • _send_image_content() encodes the image as a data: URL (data:<mime>;base64,<b64>) and emits a single conversation.item.create event with role=user and an input_image content block.
  • Does not auto-trigger response.create — caller decides when to commit (mirrors the audio/buffered semantics rather than text/auto-reply, which matches typical streaming-vision use cases where you send N frames then a follow-up text/audio prompt).

tests/strands/experimental/bidi/models/test_openai_realtime.py

  • The existing case that asserted BidiImageInputEvent raises ValueError is replaced with one that verifies the new wire format and that no implicit response.create is sent.

Wire format

{
  "type": "conversation.item.create",
  "item": {
    "type": "message",
    "role": "user",
    "content": [
      {
        "type": "input_image",
        "image_url": "data:image/jpeg;base64,<...>"
      }
    ]
  }
}

This matches OpenAI's documented format for image inputs on gpt-realtime.

Tests

hatch fmt --formatter
hatch fmt --linter           # All checks passed!
hatch test tests/strands/experimental/bidi/models/test_openai_realtime.py
# 27 passed in 5.21s

The broader tests/strands/experimental/bidi/ suite passes except for unrelated pre-existing failures from optional deps (aws_sdk_bedrock_runtime for Nova Sonic, pyaudio for audio I/O) — none touched by this change.

Tenets alignment

  • Composability: matches Gemini Live's behavior, same event type, same shape.
  • The obvious path is the happy path: agent.send(BidiImageInputEvent(...)) Just Works on OpenAI now.
  • Embrace common standards: uses OpenAI's own documented input_image block + data: URL.

Checklist

  • Code follows existing patterns (mirrors _send_audio_content / _send_text_content).
  • Formatter (hatch fmt --formatter) clean.
  • Linter + mypy (hatch fmt --linter) clean.
  • Unit tests updated and passing.
  • No public API changes other than removing the previous ValueError.

OpenAI's Realtime API supports image inputs on gpt-realtime models via
input_image content blocks on conversation.item.create. The Strands
BidiOpenAIRealtimeModel previously raised ValueError for
BidiImageInputEvent — only Gemini Live handled images.

This change adds parity with Gemini Live by:
- Dispatching BidiImageInputEvent in send() to a new
  _send_image_content() method.
- Encoding the image as a data: URL (data:<mime>;base64,<b64>) and
  emitting a single conversation.item.create event with role=user and
  an input_image content block.
- Not auto-triggering response.create — the caller commits when ready
  (matching audio/buffered semantics rather than text/auto-reply).

Updates the existing test_send_edge_cases case that previously asserted
ValueError to instead verify the wire format of the conversation.item.create
event and confirm no implicit response.create is emitted.
@mkmeral
Copy link
Copy Markdown
Contributor

mkmeral commented May 26, 2026

/strands review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

image_calls = [json.loads(call[0][0]) for call in mock_ws.send.call_args_list]
image_creates = [m for m in image_calls if m.get("type") == "conversation.item.create"]
assert len(image_creates) == 1, "expected exactly one conversation.item.create for image"
image_item = image_creates[0].get("item", {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Per-field assertions on image_item could miss unexpected or regressed fields. Since the entire dict is deterministic, a single equality check would be more robust.

Suggestion:

assert image_item == {
    "type": "message",
    "role": "user",
    "content": [{"type": "input_image", "image_url": f"data:image/jpeg;base64,{image_b64}"}],
}

This catches any unexpected fields that might be added in the future and is more concise.

@github-actions
Copy link
Copy Markdown

Assessment: Approve

Clean, well-scoped addition that brings OpenAI Realtime to parity with Gemini Live for image input. The implementation follows existing patterns exactly (_send_audio_content / _send_text_content) and the behavioral choice of not auto-triggering response.create is consistent with Gemini and well-motivated for streaming vision use cases.

Minor suggestion
  • Testing: One inline comment suggesting collapsing per-field assertions into a single dict equality check for robustness against unexpected fields.

Nice contribution — fills a real gap with minimal, focused code. 👍

@mkmeral mkmeral enabled auto-merge (squash) May 26, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants