You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Converge on system-architecture contracts for NeMo Gym so environments match how RL systems are modeled, and benchmarks and agent servers compose by addition — not via per-pair wrappers like anyswe.
Three roles on three orthogonal axes:
Environment — MDP authority (spec, state, reward). Usually realized by one resources server (seed_session → SessionDescriptor, verify, optional MCP tools / step, per-task state), but can be formed from several — e.g. a task/verify RS plus a sandbox broker RS, or a facade RS that delegates. Environment is the role; resources server(s) are the implementation.
Agent Server — hosts an Agent: policy model, tools, and orchestration (planning, value/reward models, nested control, blackbox CLIs, …). Implemented as a responses_api_agents/ server (e.g. claude_code_agent, simple_agent). Independent of which Environment it is wired to in YAML. Distinct server type from Environment and Sandbox — not a resources server.
Sandbox — substrate and isolation. The nemo_gym/sandbox/ provider seam today; optionally a sandbox resources server (broker) that holds live handles and hands out serializable references.
The motivating mismatch: PR #1738 parks SWE grading and world spec inside the agent server (responses_api_agents/swe_env/, inline verify_task, no resources_servers/swe_bench/). The target restores the Environment as a first-class resources server while keeping blackbox agent servers like claude_code_agent where they belong — connected to the Environment through an interface-agnostic seed_session descriptor, not an anyswe-style wrapper.
Agent servers stay under responses_api_agents/. Blackbox CLIs (Claude Code, Codex, …) are hosted by agent servers (e.g. claude_code_agent). They connect to an Environment via seed_session / verify, not by being re-hosted inside a generic run-in-box wrapper agent.
seed_session returns a SessionDescriptor — placement topology, optional SandboxSpec / box reference, optional MCP connection info, optional model egress hints. The agent server binds to the Environment from this descriptor; it does not hardcode docker, OpenSandbox, or SWE-bench wiring.
Only data crosses HTTP. Sandbox spec and (when brokered) a box reference/token — never a live handle (SandboxHandle.raw).
Environment = semantics + substrate (when the task has a world). The resources server owns reward and task definition; the sandbox realizes state and transitions (repo/filesystem for SWE). Reward stays in verify; the box carries no reward authority.
Hermetic grading (the twin).verify() always grades in its own fresh box, independent of where the agent server ran the episode.
Composition, not a cross-product. A run names agent server × environment × sandbox backend × topology; combinations add. Drop the anyswe wrapper-per-(agent × benchmark) pattern.
Sandboxing topologies
Topology
What's isolated
Typical attachment
Example
A — None
nothing
agent server on host; terminal verify
math, MCQ
B — Env-sandboxed
env world/state
agent server outside; MCP tools into env box
MCP weather, env-owned DB
C — Agent-in-env
world + agent execution, one box
agent server in-box per descriptor
SWE-bench + Claude Code
D — Whole interaction
full episode boundary
outer orchestrator
untrusted composite agent
Reference pair: claude_code_agent + swe_bench — Environment returns topology C + per-instance SandboxSpec; agent server runs Claude Code inside the box, then POSTs to /verify. Same agent server + example_mcp_weather demonstrates topology B (MCP from descriptor).
resources_servers/swe_bench — Environment RS: build_spec from task row, seed_session, verify (fresh eval box). Recover verified: marker and /verify endpoint.
Relocate SWE domain code to Environment RS — move harnesses, parsing, verify_task, etc. out of responses_api_agents/swe_env/ into resources_servers/swe_bench/ (private modules: harnesses/, parsing/, verify_task.py, …). No top-level nemo_gym/swe/ unless a second RS or non-HTTP consumer appears later.
In-box binding in agent server — topology C in claude_code_agent first: acquire box via nemo_gym/sandbox, apply descriptor egress, exec agent, pass harvest to /verify. Agent-specific (CLI vs loopback HTTP differs per agent). Extract shared helpers into nemo_gym/sandbox/ only if a second agent duplicates the same pattern — no separate nemo_gym/placement/ package for the prototype.
Sandbox broker RS (optional) — resources_servers/sandbox/ (or equivalent): hold handles, expose acquire/exec/upload/download/release by reference; backends = local docker/apptainer + remote OpenSandbox/cloud. Client-side RemoteSandboxProvider implementing SandboxProvider.
Configuration & placement — document WHO/WHAT/WHERE: env supplies world spec; agent server declares intrinsic defaults; driver sets deployment policy (backend, topology override, limits). Precedence: run config > agent defaults; world spec = env ⊕ agent default.
Topology D / nesting — nested-container capability as broker/provider property validated against spec.
Publish design note in fern/ after team review.
Open questions
step() vs terminal-only? First-class step() for env-driven benchmarks, or terminal verify + side channels? Blackbox in-box agent servers force terminal grading; native agent servers may use step.
Sandbox broker failure domain — orphan reaping, TTL keyed by rollout id, reconnect if consumer dies mid-rollout.
Two-box cost — working + eval boxes at scale; pool/image cache on broker.
One rollout identity — session cookie, MCP token, box ref, env state, reaper TTL as one first-class object.
anyswe fate — stepping-stone only; dissolve into swe_bench RS + descriptor-driven claude_code_agent (and other agent servers), not the long-term home.
Out of scope
Training / RL-data half: model-call recording under rollout join key, dialect conversion, token-id/logprob capture, capture store, trajectory stitching — see Blackbox agent integration. Plugs in after these boundaries exist.
Summary
Converge on system-architecture contracts for NeMo Gym so environments match how RL systems are modeled, and benchmarks and agent servers compose by addition — not via per-pair wrappers like
anyswe.Three roles on three orthogonal axes:
seed_session → SessionDescriptor,verify, optional MCP tools /step, per-task state), but can be formed from several — e.g. a task/verify RS plus a sandbox broker RS, or a facade RS that delegates. Environment is the role; resources server(s) are the implementation.responses_api_agents/server (e.g.claude_code_agent,simple_agent). Independent of which Environment it is wired to in YAML. Distinct server type from Environment and Sandbox — not a resources server.nemo_gym/sandbox/provider seam today; optionally a sandbox resources server (broker) that holds live handles and hands out serializable references.The motivating mismatch: PR #1738 parks SWE grading and world spec inside the agent server (
responses_api_agents/swe_env/, inlineverify_task, noresources_servers/swe_bench/). The target restores the Environment as a first-class resources server while keeping blackbox agent servers likeclaude_code_agentwhere they belong — connected to the Environment through an interface-agnosticseed_sessiondescriptor, not ananyswe-style wrapper.Motivation / context
anysweonto SWE grading code but grades inline in the agent server, deletesresources_servers/swe_env/, and keeps that code underresponses_api_agents/swe_env/instead of the Environment RS. That dissolves the Environment server and theverified:marker.Key design decisions
responses_api_agents/. Blackbox CLIs (Claude Code, Codex, …) are hosted by agent servers (e.g.claude_code_agent). They connect to an Environment viaseed_session/verify, not by being re-hosted inside a generic run-in-box wrapper agent.seed_sessionreturns a SessionDescriptor — placement topology, optionalSandboxSpec/ box reference, optional MCP connection info, optional model egress hints. The agent server binds to the Environment from this descriptor; it does not hardcode docker, OpenSandbox, or SWE-bench wiring.SandboxHandle.raw).verify; the box carries no reward authority.verify()always grades in its own fresh box, independent of where the agent server ran the episode.anyswewrapper-per-(agent × benchmark) pattern.Sandboxing topologies
verifyReference pair:
claude_code_agent+swe_bench— Environment returns topology C + per-instanceSandboxSpec; agent server runs Claude Code inside the box, then POSTs to/verify. Same agent server +example_mcp_weatherdemonstrates topology B (MCP from descriptor).Tracked work items
seed_sessionresponse:placement.topology, optionalsandbox.{spec, ref}, optionalmcp, optionalegress. Document agent-server placement bindings (host / MCP / in-box).resources_servers/swe_bench— Environment RS:build_specfrom task row,seed_session,verify(fresh eval box). Recoververified:marker and/verifyendpoint.verify_task, etc. out ofresponses_api_agents/swe_env/intoresources_servers/swe_bench/(private modules:harnesses/,parsing/,verify_task.py, …). No top-levelnemo_gym/swe/unless a second RS or non-HTTP consumer appears later.claude_code_agentfirst: acquire box vianemo_gym/sandbox, apply descriptor egress, exec agent, pass harvest to/verify. Agent-specific (CLI vs loopback HTTP differs per agent). Extract shared helpers intonemo_gym/sandbox/only if a second agent duplicates the same pattern — no separatenemo_gym/placement/package for the prototype.resources_servers/sandbox/(or equivalent): hold handles, expose acquire/exec/upload/download/release by reference; backends = local docker/apptainer + remote OpenSandbox/cloud. Client-sideRemoteSandboxProviderimplementingSandboxProvider.fern/after team review.Open questions
step()vs terminal-only? First-classstep()for env-driven benchmarks, or terminalverify+ side channels? Blackbox in-box agent servers force terminal grading; native agent servers may usestep.anyswefate — stepping-stone only; dissolve intoswe_benchRS + descriptor-drivenclaude_code_agent(and other agent servers), not the long-term home.Out of scope
Training / RL-data half: model-call recording under rollout join key, dialect conversion, token-id/logprob capture, capture store, trajectory stitching — see Blackbox agent integration. Plugs in after these boundaries exist.
Related
#1249 (parent), #1738 (convergence PR), #1677 / #1572 (swe_env), #1682 (MCP RS), #1377 (sandbox provider factory), #1707 (provider config decoupling).