Design: model NeMo Gym environments as an RL system — Environment / Agent-Harness / Sandbox contracts

## Summary

Converge on **system-architecture contracts** for NeMo Gym so environments match how RL systems are modeled, and **benchmarks and agent servers compose by addition** — not via per-pair wrappers like `anyswe`.

Three roles on three orthogonal axes:

- **Environment** — MDP authority (spec, state, reward). Usually realized by one **resources server** (`seed_session → SessionDescriptor`, `verify`, optional MCP tools / `step`, per-task state), but can be **formed from several** — e.g. a task/verify RS plus a sandbox broker RS, or a facade RS that delegates. Environment is the role; resources server(s) are the implementation.
- **Agent Server** — hosts an **Agent**: policy model, tools, and orchestration (planning, value/reward models, nested control, blackbox CLIs, …). Implemented as a **`responses_api_agents/` server** (e.g. `claude_code_agent`, `simple_agent`). **Independent of which Environment** it is wired to in YAML. Distinct server type from Environment and Sandbox — not a resources server.
- **Sandbox** — substrate and isolation. The `nemo_gym/sandbox/` provider seam today; optionally a **sandbox resources server (broker)** that holds live handles and hands out serializable references.

The motivating mismatch: PR #1738 parks SWE **grading and world spec inside the agent server** (`responses_api_agents/swe_env/`, inline `verify_task`, no `resources_servers/swe_bench/`). The target restores the Environment as a first-class resources server while keeping **blackbox agent servers like `claude_code_agent` where they belong** — connected to the Environment through an **interface-agnostic `seed_session` descriptor**, not an `anyswe`-style wrapper.

## Motivation / context

- Parent: #1249 (decouple SWE infra from agent harnesses).
- PR #1738 converges `anyswe` onto SWE grading code but **grades inline in the agent server**, **deletes `resources_servers/swe_env/`**, and keeps that code under `responses_api_agents/swe_env/` instead of the Environment RS. That dissolves the Environment server and the `verified:` marker.
- Complementary (out of scope here): ansubramania's *Blackbox agent integration* design doc — training/RL-data half (token-id capture, trajectory stitching, capture store).

## Key design decisions

- **Agent servers stay under `responses_api_agents/`.** Blackbox CLIs (Claude Code, Codex, …) are hosted by agent servers (e.g. `claude_code_agent`). They connect to an Environment via **`seed_session` / `verify`**, not by being re-hosted inside a generic run-in-box wrapper agent.
- **`seed_session` returns a SessionDescriptor** — placement topology, optional `SandboxSpec` / box reference, optional MCP connection info, optional model egress hints. The agent server binds to the Environment from this descriptor; it does not hardcode docker, OpenSandbox, or SWE-bench wiring.
- **Only data crosses HTTP.** Sandbox **spec** and (when brokered) a **box reference/token** — never a live handle (`SandboxHandle.raw`).
- **Environment = semantics + substrate (when the task has a world).** The resources server owns reward and task definition; the sandbox realizes state and transitions (repo/filesystem for SWE). Reward stays in `verify`; the box carries no reward authority.
- **Hermetic grading (the twin).** `verify()` always grades in its **own fresh box**, independent of where the agent server ran the episode.
- **Composition, not a cross-product.** A run names agent server × environment × sandbox backend × topology; combinations add. Drop the `anyswe` wrapper-per-(agent × benchmark) pattern.

### Sandboxing topologies

| Topology | What's isolated | Typical attachment | Example |
|---|---|---|---|
| **A — None** | nothing | agent server on host; terminal `verify` | math, MCQ |
| **B — Env-sandboxed** | env world/state | agent server outside; **MCP tools** into env box | MCP weather, env-owned DB |
| **C — Agent-in-env** | world + agent execution, one box | agent server **in-box** per descriptor | SWE-bench + Claude Code |
| **D — Whole interaction** | full episode boundary | outer orchestrator | untrusted composite agent |

Reference pair: **`claude_code_agent` + `swe_bench`** — Environment returns topology **C** + per-instance `SandboxSpec`; agent server runs Claude Code inside the box, then POSTs to `/verify`. Same agent server + **`example_mcp_weather`** demonstrates topology **B** (MCP from descriptor).

## Tracked work items

- [ ] **SessionDescriptor contract** — extend `seed_session` response: `placement.topology`, optional `sandbox.{spec, ref}`, optional `mcp`, optional `egress`. Document agent-server placement bindings (host / MCP / in-box).
- [ ] **`resources_servers/swe_bench`** — Environment RS: `build_spec` from task row, `seed_session`, `verify` (fresh eval box). Recover `verified:` marker and `/verify` endpoint.
- [ ] **Relocate SWE domain code to Environment RS** — move harnesses, parsing, `verify_task`, etc. out of `responses_api_agents/swe_env/` into **`resources_servers/swe_bench/`** (private modules: `harnesses/`, `parsing/`, `verify_task.py`, …). No top-level `nemo_gym/swe/` unless a second RS or non-HTTP consumer appears later.
- [ ] **In-box binding in agent server** — topology C in `claude_code_agent` first: acquire box via `nemo_gym/sandbox`, apply descriptor egress, exec agent, pass harvest to `/verify`. Agent-specific (CLI vs loopback HTTP differs per agent). Extract shared helpers into `nemo_gym/sandbox/` only if a second agent duplicates the same pattern — no separate `nemo_gym/placement/` package for the prototype.
- [ ] **Sandbox broker RS (optional)** — `resources_servers/sandbox/` (or equivalent): hold handles, expose acquire/exec/upload/download/release by reference; backends = local docker/apptainer + remote OpenSandbox/cloud. Client-side `RemoteSandboxProvider` implementing `SandboxProvider`.
- [ ] **Configuration & placement** — document WHO/WHAT/WHERE: env supplies world spec; agent server declares intrinsic defaults; driver sets deployment policy (backend, topology override, limits). Precedence: run config > agent defaults; world spec = env ⊕ agent default.
- [ ] **Topology D / nesting** — nested-container capability as broker/provider property validated against spec.
- [ ] **Publish design note in `fern/`** after team review.

## Open questions

- **`step()` vs terminal-only?** First-class `step()` for env-driven benchmarks, or terminal `verify` + side channels? Blackbox in-box agent servers force terminal grading; native agent servers may use `step`.
- **Box lifecycle owner for topology C with a broker.** Environment-owned state-box + agent server drives by reference (RL-canonical) vs agent-owned box (today's #1738, no broker). Prefer env-owned when broker exists; agent-owned as no-broker fallback.
- **Sandbox broker failure domain** — orphan reaping, TTL keyed by rollout id, reconnect if consumer dies mid-rollout.
- **Two-box cost** — working + eval boxes at scale; pool/image cache on broker.
- **One rollout identity** — session cookie, MCP token, box ref, env state, reaper TTL as one first-class object.
- **`anyswe` fate** — stepping-stone only; dissolve into `swe_bench` RS + descriptor-driven `claude_code_agent` (and other agent servers), not the long-term home.

## Out of scope

Training / RL-data half: model-call recording under rollout join key, dialect conversion, token-id/logprob capture, capture store, trajectory stitching — see *Blackbox agent integration*. Plugs in after these boundaries exist.

## Related

#1249 (parent), #1738 (convergence PR), #1677 / #1572 (swe_env), #1682 (MCP RS), #1377 (sandbox provider factory), #1707 (provider config decoupling).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design: model NeMo Gym environments as an RL system — Environment / Agent-Harness / Sandbox contracts #1792

Summary

Motivation / context

Key design decisions

Sandboxing topologies

Tracked work items

Open questions

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Topology	What's isolated	Typical attachment	Example
A — None	nothing	agent server on host; terminal `verify`	math, MCQ
B — Env-sandboxed	env world/state	agent server outside; MCP tools into env box	MCP weather, env-owned DB
C — Agent-in-env	world + agent execution, one box	agent server in-box per descriptor	SWE-bench + Claude Code
D — Whole interaction	full episode boundary	outer orchestrator	untrusted composite agent

Uh oh!

Design: model NeMo Gym environments as an RL system — Environment / Agent-Harness / Sandbox contracts #1792

Description

Summary

Motivation / context

Key design decisions

Sandboxing topologies

Tracked work items

Open questions

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions