Skip to content

Design: model NeMo Gym environments as an RL system — Environment / Agent-Harness / Sandbox contracts #1792

Description

@ffrujeri

Summary

Converge on system-architecture contracts for NeMo Gym so environments match how RL systems are modeled, and benchmarks and agent servers compose by addition — not via per-pair wrappers like anyswe.

Three roles on three orthogonal axes:

  • Environment — MDP authority (spec, state, reward). Usually realized by one resources server (seed_session → SessionDescriptor, verify, optional MCP tools / step, per-task state), but can be formed from several — e.g. a task/verify RS plus a sandbox broker RS, or a facade RS that delegates. Environment is the role; resources server(s) are the implementation.
  • Agent Server — hosts an Agent: policy model, tools, and orchestration (planning, value/reward models, nested control, blackbox CLIs, …). Implemented as a responses_api_agents/ server (e.g. claude_code_agent, simple_agent). Independent of which Environment it is wired to in YAML. Distinct server type from Environment and Sandbox — not a resources server.
  • Sandbox — substrate and isolation. The nemo_gym/sandbox/ provider seam today; optionally a sandbox resources server (broker) that holds live handles and hands out serializable references.

The motivating mismatch: PR #1738 parks SWE grading and world spec inside the agent server (responses_api_agents/swe_env/, inline verify_task, no resources_servers/swe_bench/). The target restores the Environment as a first-class resources server while keeping blackbox agent servers like claude_code_agent where they belong — connected to the Environment through an interface-agnostic seed_session descriptor, not an anyswe-style wrapper.

Motivation / context

Key design decisions

  • Agent servers stay under responses_api_agents/. Blackbox CLIs (Claude Code, Codex, …) are hosted by agent servers (e.g. claude_code_agent). They connect to an Environment via seed_session / verify, not by being re-hosted inside a generic run-in-box wrapper agent.
  • seed_session returns a SessionDescriptor — placement topology, optional SandboxSpec / box reference, optional MCP connection info, optional model egress hints. The agent server binds to the Environment from this descriptor; it does not hardcode docker, OpenSandbox, or SWE-bench wiring.
  • Only data crosses HTTP. Sandbox spec and (when brokered) a box reference/token — never a live handle (SandboxHandle.raw).
  • Environment = semantics + substrate (when the task has a world). The resources server owns reward and task definition; the sandbox realizes state and transitions (repo/filesystem for SWE). Reward stays in verify; the box carries no reward authority.
  • Hermetic grading (the twin). verify() always grades in its own fresh box, independent of where the agent server ran the episode.
  • Composition, not a cross-product. A run names agent server × environment × sandbox backend × topology; combinations add. Drop the anyswe wrapper-per-(agent × benchmark) pattern.

Sandboxing topologies

Topology What's isolated Typical attachment Example
A — None nothing agent server on host; terminal verify math, MCQ
B — Env-sandboxed env world/state agent server outside; MCP tools into env box MCP weather, env-owned DB
C — Agent-in-env world + agent execution, one box agent server in-box per descriptor SWE-bench + Claude Code
D — Whole interaction full episode boundary outer orchestrator untrusted composite agent

Reference pair: claude_code_agent + swe_bench — Environment returns topology C + per-instance SandboxSpec; agent server runs Claude Code inside the box, then POSTs to /verify. Same agent server + example_mcp_weather demonstrates topology B (MCP from descriptor).

Tracked work items

  • SessionDescriptor contract — extend seed_session response: placement.topology, optional sandbox.{spec, ref}, optional mcp, optional egress. Document agent-server placement bindings (host / MCP / in-box).
  • resources_servers/swe_bench — Environment RS: build_spec from task row, seed_session, verify (fresh eval box). Recover verified: marker and /verify endpoint.
  • Relocate SWE domain code to Environment RS — move harnesses, parsing, verify_task, etc. out of responses_api_agents/swe_env/ into resources_servers/swe_bench/ (private modules: harnesses/, parsing/, verify_task.py, …). No top-level nemo_gym/swe/ unless a second RS or non-HTTP consumer appears later.
  • In-box binding in agent server — topology C in claude_code_agent first: acquire box via nemo_gym/sandbox, apply descriptor egress, exec agent, pass harvest to /verify. Agent-specific (CLI vs loopback HTTP differs per agent). Extract shared helpers into nemo_gym/sandbox/ only if a second agent duplicates the same pattern — no separate nemo_gym/placement/ package for the prototype.
  • Sandbox broker RS (optional)resources_servers/sandbox/ (or equivalent): hold handles, expose acquire/exec/upload/download/release by reference; backends = local docker/apptainer + remote OpenSandbox/cloud. Client-side RemoteSandboxProvider implementing SandboxProvider.
  • Configuration & placement — document WHO/WHAT/WHERE: env supplies world spec; agent server declares intrinsic defaults; driver sets deployment policy (backend, topology override, limits). Precedence: run config > agent defaults; world spec = env ⊕ agent default.
  • Topology D / nesting — nested-container capability as broker/provider property validated against spec.
  • Publish design note in fern/ after team review.

Open questions

  • step() vs terminal-only? First-class step() for env-driven benchmarks, or terminal verify + side channels? Blackbox in-box agent servers force terminal grading; native agent servers may use step.
  • Box lifecycle owner for topology C with a broker. Environment-owned state-box + agent server drives by reference (RL-canonical) vs agent-owned box (today's Converge #1677 swe_env decoupling into anyswe + verify on SWE-bench Verified (#1249) #1738, no broker). Prefer env-owned when broker exists; agent-owned as no-broker fallback.
  • Sandbox broker failure domain — orphan reaping, TTL keyed by rollout id, reconnect if consumer dies mid-rollout.
  • Two-box cost — working + eval boxes at scale; pool/image cache on broker.
  • One rollout identity — session cookie, MCP token, box ref, env state, reaper TTL as one first-class object.
  • anyswe fate — stepping-stone only; dissolve into swe_bench RS + descriptor-driven claude_code_agent (and other agent servers), not the long-term home.

Out of scope

Training / RL-data half: model-call recording under rollout join key, dialect conversion, token-id/logprob capture, capture store, trajectory stitching — see Blackbox agent integration. Plugs in after these boundaries exist.

Related

#1249 (parent), #1738 (convergence PR), #1677 / #1572 (swe_env), #1682 (MCP RS), #1377 (sandbox provider factory), #1707 (provider config decoupling).

Metadata

Metadata

Assignees

Labels

agentscore-infraHelpful infrastructureenv-infraInfra related to creating new environmentsneeds-designGoal is clear, but needs input on technical design

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions