fix(browser_planner): gate multimodal content on per-model vision support#251
fix(browser_planner): gate multimodal content on per-model vision support#251haroldship wants to merge 4 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughGates multimodal prompts using a per-model capability check; BrowserPlannerAgent receives an effective vision flag, conditionally attaches images, detects vision-rejection responses, disables vision, and retries text-only. Tests cover model- and global-level gating and prompt formatting. ChangesPer-model vision capability detection and gating
Sequence Diagram(s)sequenceDiagram
participant Factory as create()
participant Settings as settings
participant ModelCheck as model_supports_vision()
participant Agent as BrowserPlannerAgent
participant LLM as chain.ainvoke
Factory->>Settings: read advanced_features.use_vision
Factory->>ModelCheck: check dyna_model capability
ModelCheck->>Factory: return enable_vision?
Factory->>Agent: __init__(use_vision_effective=...)
Agent->>Agent: store use_vision_effective
Agent->>LLM: chain.ainvoke(data with img if use_vision_effective)
LLM-->>Agent: returns or raises vision rejection
Agent->>Agent: on rejection set use_vision_effective=False, remove img
Agent->>LLM: chain.ainvoke(text-only data)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cuga/sdk_core/tests/test_sdk_token_usage_tracker.py`:
- Around line 86-118: The test currently creates a local ActivityTracker and
compares it with the tracker inside TokenUsageTracker, but the SDK should expose
a shared tracker on the agent; update the test to use the agent's shared tracker
instead of creating a new one. Specifically, remove the local `tracker =
ActivityTracker()` and instead obtain the tracker from the agent (e.g., `tracker
= agent._activity_tracker` or after building callbacks assert
`token_tracker.tracker is agent._activity_tracker`), reset `tracker.prompts =
[]`, then proceed to call `tracker.collect_prompt(...)` and assert prompts; also
replace the identity assertion `token_tracker.tracker is tracker` with
`token_tracker.tracker is agent._activity_tracker` (or equivalent) to match the
fixed SDK behavior.
In `@src/cuga/sdk.py`:
- Around line 1346-1366: The ActivityTracker is recreated each time in
_build_callbacks(), preventing accumulation across calls; initialize
self._activity_tracker once in __init__ (create ActivityTracker() there) and
then change _build_callbacks() to use that instance (e.g.,
TokenUsageTracker(self._activity_tracker)) so all invocations from
_apply_callbacks(), invoke(), stream() and graph creation share the same
tracker; if needed, lazily create self._activity_tracker inside
_build_callbacks() only when it is None to preserve backward compatibility.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aeaf6429-3498-49e7-814c-ae256d66bb68
📒 Files selected for processing (5)
src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.pysrc/cuga/backend/llm/utils/helpers.pysrc/cuga/sdk.pysrc/cuga/sdk_core/tests/test_sdk_token_usage_tracker.pytests/unit/test_load_prompt_with_image_vision_gate.py
…n support When the global `advanced_features.use_vision = true` setting is on, `load_prompt_with_image` would unconditionally wrap the browser planner's human message as a list-typed multimodal `[image_url, text]` block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with `400 messages[1].content must be a string`, killing hybrid-mode web-agent runs before the first step (issue #214). Add a per-model `enable_vision` flag (resolved by `model_supports_vision`, defaulting to `True` to preserve existing behavior for vision-capable models) and gate both the prompt template structure and the runtime image attachment on `use_vision AND model_supports_vision`. Operators whose LLM endpoint rejects multimodal content can set `enable_vision = false` on the relevant `agent.*.model` config to force string-only user content. Closes #214
819d5b5 to
9b7bad8
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| """ | ||
| if model_config is None: | ||
| return True | ||
| return bool(getattr(model_config, 'enable_vision', True)) |
There was a problem hiding this comment.
so this checks if the enable_vision is given from model config? not like based on model e.g. claude 4.6 supports vision why do we need to add flag
There was a problem hiding this comment.
Good question — the short answer is the bug isn't actually about model capability; it's about endpoint capability.
The 400 in #214 came from a setup where the model itself does support vision (Llama 4 Maverick served via WatsonX, which is multimodal-capable), but the openai-compatible gateway in front of it doesn't accept multimodal content blocks — it rejects messages[1].content if it's a list. So a name-based allowlist ("claude-4.6 → multimodal OK") would have given the wrong answer here: it would have said yes, sent the list-typed payload, and produced the same 400.
The reason for the model name failing as a signal:
- Same model, multiple endpoints — e.g. WatsonX-via-openai-compatible vs WatsonX native. One accepts multimodal, the other doesn't.
- Provider names drift fast (litellm prefixes, openrouter aliases, custom
model_namestrings). A registry of "these names support vision" would need constant upkeep and would still miss the gateway dimension. - Silently downgrading vision when we think the model doesn't support it would mask configuration mistakes — e.g. someone pointing at the wrong endpoint and not noticing because we quietly stripped images.
The enable_vision flag defaults to True, so anyone whose endpoint already works keeps working with no change. The only people who need to set it are those with a non-multimodal gateway — and for them, an explicit opt-out is more discoverable than "why did my screenshots stop being sent?". Open to a follow-up that adds a name-based default if that's useful for the common case, but I'd want it to be additive (a hint, not authoritative) so we can still override per endpoint.
There was a problem hiding this comment.
so can we try catch error when image doesn't work and simply then try without
There was a problem hiding this comment.
Good idea — implemented in 24ef14f. BrowserPlannerAgent.run() now attempts the multimodal call first; if it raises with a vision-rejection signature (content must be a string, image_url, vision, etc.) and we actually attached an image, it logs a warning, flips self.use_vision_effective = False so the next turn doesn't repeat the rejection, and retries the current turn text-only (also flipping the use_vision template var so the prompt's image block stays in sync). Unrelated exceptions re-raise unchanged. The static enable_vision flag stays as a fast opt-out for configs where you already know the endpoint is text-only — but it's no longer required for correctness.
…ent-multimodal-content
|
Merged latest
|
|
Local stability test ( |
…al payload Per review feedback, the per-model ``enable_vision`` flag isn't always known ahead of time (especially when the same model is fronted by different openai-compatible gateways). Catch vision-rejection errors at call time, log a warning, disable vision on this agent for subsequent calls, and retry the current invocation with a text-only payload. Detection is a string-based heuristic against the exception message (``content must be a string``, ``image_url``, ``vision``, etc.); unrelated exceptions are re-raised unchanged.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py (1)
24-31: Consider monitoring vision-rejection patterns in production logs.The marker set is intentionally heuristic but may not cover all provider-specific rejection messages. Monitor warnings from line 88-91 in production to identify new patterns and expand this tuple as needed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py` around lines 24 - 31, The tuple _VISION_REJECTION_MARKERS is a heuristic list of vision-rejection phrases and should be monitored and expanded over time; update the code around the warning emission that references _VISION_REJECTION_MARKERS (the warning block that currently logs when a vision rejection is detected) to emit a structured production metric/event (and include the raw provider message) whenever a match occurs so you can aggregate and inspect new rejection patterns in logs/monitoring; use that telemetry to periodically extend _VISION_REJECTION_MARKERS with newly observed phrases.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py`:
- Around line 88-91: The warning in browser_planner_agent.py currently logs only
the exception type in the logger.warning call in BrowserPlannerAgent (the
logger.warning call around the vision rejection handling); update that log to
include the exception message as well (e.g., include str(exc) or format as
f"{type(exc).__name__}: {exc}") so the actual rejection reason/API error is
captured alongside the type for diagnosing entries added to
_VISION_REJECTION_MARKERS. Ensure you modify the logger.warning invocation that
currently uses type(exc).__name__ to include the exception message in the same
structured/logged message.
- Around line 34-37: Change the function signature for
_looks_like_vision_rejection to accept Exception instead of BaseException to
avoid catching system-level exceptions; update the parameter type in the def
line and any internal references (msg = str(exc).lower()) remains the same, and
ensure uses of _VISION_REJECTION_MARKERS are unchanged so the heuristic still
checks markers in the exception message.
- Around line 75-80: ActivityTracker is a singleton with a shared mutable list
images that's read in browser_planner_agent (the tracker variable) without
synchronization, risking race conditions between len(tracker.images) and
tracker.images[-1]; fix by adding thread-safety inside ActivityTracker:
introduce a threading.Lock (or RLock) and wrap all mutations (e.g.,
collect_image) and reads with that lock, and expose a safe accessor like
ActivityTracker.get_last_image() or ActivityTracker.snapshot_images() that
returns a stable copy or the last image under the lock; then update
browser_planner_agent to call that accessor (e.g., img =
tracker.get_last_image()) and only attach data['img'] if the accessor returns a
non-None value.
---
Nitpick comments:
In
`@src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py`:
- Around line 24-31: The tuple _VISION_REJECTION_MARKERS is a heuristic list of
vision-rejection phrases and should be monitored and expanded over time; update
the code around the warning emission that references _VISION_REJECTION_MARKERS
(the warning block that currently logs when a vision rejection is detected) to
emit a structured production metric/event (and include the raw provider message)
whenever a match occurs so you can aggregate and inspect new rejection patterns
in logs/monitoring; use that telemetry to periodically extend
_VISION_REJECTION_MARKERS with newly observed phrases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d821a190-00a6-443f-9bbe-d4c1c8100d73
📒 Files selected for processing (1)
src/cuga/backend/cuga_graph/nodes/browser/browser_planner_agent/browser_planner_agent.py
sami-marreed
left a comment
There was a problem hiding this comment.
Thanks for the #214 fix and the rebase.
Scope looks correct (helpers + browser planner + unit tests). The old CodeRabbit threads on sdk.py and test_sdk_token_usage_tracker.py are out of scope for this PR — fine to leave resolved.
Design LGTM: per-model enable_vision plus runtime multimodal→text-only fallback is the right approach for gateways that reject list-typed content. The retry in BrowserPlannerAgent.run() addresses the try/catch concern.
Requesting two small fixes on browser_planner_agent.py (inline). Once those are in, good to merge from my side.
| """ | ||
| if model_config is None: | ||
| return True | ||
| return bool(getattr(model_config, 'enable_vision', True)) |
There was a problem hiding this comment.
Design LGTM. Explicit enable_vision on the model config is the right signal when the failure is the endpoint/gateway rejecting multimodal content, not the model lacking vision. Default True keeps existing setups unchanged.
No change needed in helpers.py — the runtime fallback in BrowserPlannerAgent.run() covers the try/catch / retry-without-image case.
There was a problem hiding this comment.
Thanks — agreed, helpers.py stays as the fast opt-out via enable_vision = false, and the runtime try/catch in BrowserPlannerAgent.run() handles the case where the gateway/endpoint rejects multimodal content despite the model supporting it.
| def _looks_like_vision_rejection(exc: BaseException) -> bool: | ||
| """Heuristic: did the endpoint reject the request because of image content?""" | ||
| msg = str(exc).lower() | ||
| return any(marker in msg for marker in _VISION_REJECTION_MARKERS) |
There was a problem hiding this comment.
Please change _looks_like_vision_rejection(exc: BaseException) to exc: Exception. run() already catches Exception; BaseException is too broad (e.g. KeyboardInterrupt, SystemExit) and shouldn't be part of this path.
There was a problem hiding this comment.
Done in be8a969 — signature is now _looks_like_vision_rejection(exc: Exception).
| logger.warning( | ||
| "Vision request rejected ({}); disabling vision for this agent and retrying text-only.", | ||
| type(exc).__name__, | ||
| ) |
There was a problem hiding this comment.
Please include str(exc) in the vision-rejection warning, not only type(exc).__name__. We need the actual API error text to validate and extend _VISION_REJECTION_MARKERS over time.
There was a problem hiding this comment.
Done in be8a969 — exc is now interpolated into the warning alongside the type name, so the raw API error text is captured for future updates to _VISION_REJECTION_MARKERS.
…ge read Review feedback follow-ups: - ``_looks_like_vision_rejection`` now takes ``exc: Exception`` (was ``BaseException``) — interpreter-control exceptions (``KeyboardInterrupt``, ``SystemExit``) were never meant to flow through this heuristic, and ``run()`` already catches ``Exception``. - Vision-rejection warning now includes ``str(exc)`` alongside the type, so the actual API error text is captured for validating/extending ``_VISION_REJECTION_MARKERS`` over time. - Image attach is now an atomic try/except snapshot on ``tracker.images[-1]`` instead of length-check-then-index, so concurrent ``collect_image`` calls on the shared ``ActivityTracker`` singleton can't cause an ``IndexError`` or stale-index attachment between the two reads.
Bug fix
Fixes #214
Summary
When
advanced_features.use_vision = true(the default),load_prompt_with_imageunconditionally builds the browser planner's user message as a list-typed multimodal[image_url, text]block. LLM endpoints that don't accept multimodal content (e.g. WatsonX served via an openai-compatible gateway) then reject the request with400 messages[1].content must be a string, killing hybrid-mode web-agent runs before the first step.This change introduces a per-model
enable_visionflag (resolved by the newmodel_supports_visionhelper, defaulting toTrueto preserve existing behavior for vision-capable models). Both the prompt template structure (load_prompt_with_image) and the runtime image attachment (BrowserPlannerAgent.run) now gate onsettings.advanced_features.use_vision AND model_supports_vision(model_config). Operators whose LLM endpoint rejects multimodal content can setenable_vision = falseon the relevantagent.*.modelconfig (e.g.agent.planner.model) to force string-only user content.Why a per-model flag
The bug surfaces when the endpoint (not the model) cannot deserialize multimodal content blocks. Detecting endpoint capabilities at runtime is brittle, and silently downgrading vision would mask misconfigurations. An explicit
enable_vision = falseopt-out is small, discoverable, and additive — existing model configs continue to work unchanged because the default isTrue.Testing
Rebased onto
main(69b7e40a) — branch contains a single commit9b7bad82.New regression tests in
tests/unit/test_load_prompt_with_image_vision_gate.py(7 tests):model_supports_visiondefaults:Noneconfig and missing-attribute config both default toTrue; explicitenable_vision=Falseandenable_vision=Trueare honored.use_vision=True(the regression for [Bug]: Web agent in hybrid mode fails with "messages[1].content must be a string" (400 invalid_request_error) #214).image_url+textitems) when both global setting and per-model flag areTrue.Falseregardless of per-model flag.Test results (against this branch, rebased on
main)tests/unit/test_load_prompt_with_image_vision_gate.py(new)src/cuga/backend/tools_env/registry/tests/src/cuga/backend/cuga_graph/nodes/api/variables_manager/tests/src/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/(non-e2b)tests/unit/test_task_analyzer_app_matching.pytests/unit/test_cuga_lite_bind_tools.pysrc/cuga/backend/cuga_graph/nodes/cuga_lite/executors/tests/test_e2b_lite.pyRuntimeError: e2b-code-interpreter package not installed(env-only; CI runs these withuv sync --extra e2bpersrc/scripts/run_tests.sh)src/cuga/sdk_core/tests/(live LLM integration)ruff check(touched files)ruff format --check(touched files)--type committed --base main)Aggregate: 211 passing across the unit-test subset relevant to this change, with the only failures being env/network issues that also fail on
main.Summary by CodeRabbit