Skip to content

Add Runloop rollout processor integration#454

Draft
tode-rl wants to merge 4 commits into
eval-protocol:mainfrom
tode-rl:runloop-rollout-processor
Draft

Add Runloop rollout processor integration#454
tode-rl wants to merge 4 commits into
eval-protocol:mainfrom
tode-rl:runloop-rollout-processor

Conversation

@tode-rl
Copy link
Copy Markdown

@tode-rl tode-rl commented Jun 2, 2026

Summary

  • Add RunloopRolloutProcessor as a first-class pytest rollout processor that provisions or attaches to a Runloop Devbox, exposes a tunnel, starts a remote rollout server, and delegates row processing to RemoteRolloutProcessor.
  • Add the optional runloop extra for runloop-api-client without changing core dependencies.
  • Add unit tests, documentation, .env guidance, and a minimal Runloop remote rollout example.

Notes

  • No changes to default rollout behavior.
  • The processor reuses the existing RemoteRolloutProcessor and Fireworks tracing protocol after Runloop setup.
  • Live Runloop smoke testing is optional and is not required for normal CI.

Verification

  • uv sync --extra runloop --extra dev
  • uv run pytest tests/pytest/test_runloop_rollout_processor.py
  • uv run ruff check eval_protocol/pytest/runloop_rollout_processor.py tests/pytest/test_runloop_rollout_processor.py examples/runloop_remote_rollout/server.py examples/runloop_remote_rollout/test_eval.py eval_protocol/pytest/init.py eval_protocol/init.py
  • uv run pytest --collect-only examples/runloop_remote_rollout/test_eval.py

Live smoke test not run: FIREWORKS_API_KEY and RUNLOOP_BLUEPRINT_ID are not set in this environment.


Note

Low Risk
New optional integration behind an extra dependency; rollout execution still delegates to RemoteRolloutProcessor without changing default processors.

Overview
Adds Runloop as an optional rollout path: a new RunloopRolloutProcessor provisions or attaches to a Runloop Devbox, tunnels a user-supplied server command, then hands off to the existing RemoteRolloutProcessor so /init and Fireworks tracing behavior stay unchanged.

Install via eval-protocol[runloop] (runloop-api-client). Public exports, .env.example, integration docs, unit tests, and a minimal FastAPI example (optional live smoke when RUNLOOP_BLUEPRINT_ID is set) document the flow. Devboxes created from a blueprint can shut down on cleanup; attached devboxes are left running.

Reviewed by Cursor Bugbot for commit d0d1ce2. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit d0d1ce2. Configure here.

poll_interval=self._poll_interval,
timeout_seconds=self._timeout_seconds,
include_payloads=self._include_payloads,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial setup leaks Devboxes

Medium Severity

setup only skips work when _remote_processor is set, so a failed run can leave an owned Devbox running and a retry can call create_from_blueprint_id again. @evaluation_test calls setup before its acleanup finally, so startup errors (e.g. TimeoutError from _wait_for_server_startup) often never trigger cleanup.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d0d1ce2. Configure here.

response.read(1)
return
except urllib.error.HTTPError:
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup treats HTTP errors ready

Medium Severity

_wait_for_server_startup returns immediately on any urllib.error.HTTPError, including 502/503 from the Runloop tunnel before the rollout server responds. Setup can finish while the app is still down, so later /init calls fail after the startup wait already succeeded.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d0d1ce2. Configure here.

@tode-rl tode-rl marked this pull request as draft June 2, 2026 21:46
@tode-rl
Copy link
Copy Markdown
Author

tode-rl commented Jun 2, 2026

Scope

Runloop rollout processor integration, focused on Cursor Bugbot findings and the Runloop blueprint setup path.

Fixed

  • Made setup transactional: owned Devboxes are shut down if setup fails before RemoteRolloutProcessor is created, while attached existing Devboxes are preserved. [eval_protocol/pytest/runloop_rollout_processor.py]
  • Changed startup polling to retry 5xx tunnel/server responses instead of treating them as ready; non-5xx responses still indicate an app is answering. [eval_protocol/pytest/runloop_rollout_processor.py]
  • Added a blueprint creation helper and docs for empty Runloop accounts that do not yet have a blueprint. [examples/runloop_remote_rollout/create_blueprint.py]

Tests added

  • Added regression coverage for owned Devbox cleanup on startup failure, existing Devbox preservation, and 5xx startup retry behavior. [tests/pytest/test_runloop_rollout_processor.py]

Verification:

  • typecheck: uv run basedpyright eval_protocol/pytest/runloop_rollout_processor.py examples/runloop_remote_rollout/create_blueprint.py
  • tests: uv run pytest tests/pytest/test_runloop_rollout_processor.py (10 passed)
  • lint: uv run ruff check eval_protocol/pytest/runloop_rollout_processor.py tests/pytest/test_runloop_rollout_processor.py examples/runloop_remote_rollout/server.py examples/runloop_remote_rollout/test_eval.py examples/runloop_remote_rollout/create_blueprint.py eval_protocol/pytest/init.py eval_protocol/init.py
  • example collection: uv run pytest --collect-only examples/runloop_remote_rollout/test_eval.py
  • helper smoke: uv run python examples/runloop_remote_rollout/create_blueprint.py --help

Live Runloop smoke still not run because FIREWORKS_API_KEY and RUNLOOP_BLUEPRINT_ID are not set in this environment.

tode-rl and others added 2 commits June 2, 2026 15:02
Use the correct blueprint build_context type and add narrowing assertions in new test coverage.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the branch focused on Runloop rollout changes only.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tode-rl
Copy link
Copy Markdown
Author

tode-rl commented Jun 2, 2026

Verification:

  • typecheck: pass (uv run basedpyright eval_protocol/pytest/runloop_rollout_processor.py tests/pytest/test_runloop_rollout_processor.py examples/runloop_remote_rollout/create_blueprint.py tests/pytest/test_runloop_remote_rollout_blueprint.py)
  • tests: 15 passed / 0 failed (uv run pytest tests/pytest/test_runloop_rollout_processor.py tests/pytest/test_runloop_remote_rollout_blueprint.py); 1 example smoke test collected
  • lint: clean (uv run ruff check ... on touched Python files)
  • prettier: clean / Python formatter clean (uv run ruff format --check ... after formatting 1 in-scope file)

Scope

Add Runloop rollout processor integration (from PR #454 title/body).

Fixed

  • Replaced the example blueprint Docker build-context path with Runloop blueprint file_mounts for /home/user/rollout/server.py plus setup commands, so the fixed-name blueprint contains the local GET / and POST /init server used by the example. [examples/runloop_remote_rollout/create_blueprint.py:104]
  • Fixed fixed-name blueprint reuse by resolving SDK blueprint wrappers through get_info() before checking status/create time. [examples/runloop_remote_rollout/create_blueprint.py:65]
  • Aligned the example server command/docs with the mounted blueprint path. [examples/runloop_remote_rollout/constants.py:6]

Tests added

  • Added blueprint helper tests for mounted / and /init server contents, wrapper-based idempotency, existing blueprint reuse, and file_mounts creation parameters. [tests/pytest/test_runloop_remote_rollout_blueprint.py:39]

Skipped / deferred

  • Live Runloop smoke test was not run; it requires real RUNLOOP_API_KEY and FIREWORKS_API_KEY credentials.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant