diff --git a/fern/versions/latest/pages/about/concepts/key-terminology.mdx b/fern/versions/latest/pages/about/concepts/key-terminology.mdx
index 0a90334638..c69943494d 100644
--- a/fern/versions/latest/pages/about/concepts/key-terminology.mdx
+++ b/fern/versions/latest/pages/about/concepts/key-terminology.mdx
@@ -63,7 +63,11 @@ The FastAPI service (under `resources_servers/`) that holds per-task state, expo
**Agent Server (Responses API Agent)**
-The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the harness that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own.
+The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the **agent harness** that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own.
+
+**Harness (disambiguation)**
+
+“Harness” is overloaded in the agent-eval community. In Gym docs it usually means **agent harness** (orchestration in an agent server). In [SWE-bench](https://www.swebench.com/SWE-bench/reference/harness/), it often means the **grading pipeline** (`swebench.harness`). See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology).
**Model Server (Responses API Model)**
diff --git a/fern/versions/latest/pages/evaluation/index.mdx b/fern/versions/latest/pages/evaluation/index.mdx
index 850ddd0cff..412c359025 100644
--- a/fern/versions/latest/pages/evaluation/index.mdx
+++ b/fern/versions/latest/pages/evaluation/index.mdx
@@ -40,6 +40,10 @@ Harness changes often affect metrics, and some models are better tuned to use sp
NeMo Gym treats an agent as model plus harness. The model server stays stateless; the agent server owns the loop that calls the model, routes tool calls, manages conversation state, and asks the resources server to verify the final attempt.
+
+Outside Gym, “harness” often means the SWE-bench **grading pipeline** (`swebench.harness`) rather than agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology).
+
+
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx
new file mode 100644
index 0000000000..c99929d2f6
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx
@@ -0,0 +1,250 @@
+---
+title: "Claude Code Agent — Protocol Stack & Data Contracts"
+description: "How claude_code_agent, model servers, PR #1627 /v1/messages, and NeMoGymResponse fit together."
+---
+This engineering note summarizes the protocol stack behind the `claude_code_agent` harness: which Gym entities participate in a rollout, which data contracts apply at each hop, what [PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added, and where RL metadata lives.
+
+## The four Gym server types
+
+An environment decomposes into four concepts. Each maps to a FastAPI server type:
+
+| Concept | Component | Key endpoints |
+| --- | --- | --- |
+| Dataset | JSONL rows | `responses_create_params` per task |
+| Agent harness | `responses_api_agents/` | `POST /run`, `POST /v1/responses` |
+| Verifier + state | `resources_servers/` | `POST /seed_session`, `POST /verify` |
+| Model | `responses_api_models/` | `POST /v1/responses`, `/v1/chat/completions`, `/v1/messages` |
+
+The **Claude Code CLI** (`claude -p`) is *not* a Gym server. It is a black-box subprocess spawned by `claude_code_agent` that only speaks the **Anthropic Messages API**.
+
+## Gym's canonical data contracts
+
+Gym standardizes on the **OpenAI Responses API shape** (with NeMo-specific extensions). Two types form the request/response pair:
+
+| Type | Role |
+| --- | --- |
+| `NeMoGymResponseCreateParamsNonStreaming` | **Request** — input messages, tools, sampling params (from dataset JSONL) |
+| `NeMoGymResponse` | **Response** — accumulated trajectory in `output[]`, plus `usage` |
+
+`NeMoGymResponse` is **not** owned exclusively by models or agents. It is Gym's **shared trajectory contract**:
+
+- **Model servers** produce it on `POST /v1/responses`
+- **Agents** produce it on `POST /v1/responses`
+- **Resources servers** consume it in `POST /verify` (`BaseVerifyRequest.response`)
+- **Rollout harness** reads it from agent `/run` results
+
+The trajectory building block is **`NeMoGymResponse.output[]`** — a sequence of messages, tool calls, tool results, and reasoning items.
+
+### Training variants (`*ForTraining`)
+
+RL metadata lives on **individual output items**, not on the top-level response envelope:
+
+```python
+class TokenIDLogProbMixin(BaseModel):
+ prompt_token_ids: List[int]
+ generation_token_ids: List[int]
+ generation_log_probs: List[float]
+```
+
+Training subclasses (`NeMoGymResponseOutputMessageForTraining`, etc.) mix this in. A rollout JSONL row with RL data looks like:
+
+```json
+{
+ "output": [
+ {
+ "type": "message",
+ "role": "assistant",
+ "content": [...],
+ "prompt_token_ids": [1, 2, 3],
+ "generation_token_ids": [4, 5, 6],
+ "generation_log_probs": [-0.1, -0.2, -0.3]
+ }
+ ]
+}
+```
+
+## Alternate wire formats (not the Gym trajectory contract)
+
+Model servers expose **three** HTTP endpoints. Only one returns `NeMoGymResponse` on the wire:
+
+| Endpoint | Wire format | Gym trajectory? |
+| --- | --- | --- |
+| `POST /v1/responses` | `NeMoGymResponse` JSON | **Yes** |
+| `POST /v1/chat/completions` | `NeMoGymChatCompletion` JSON | No — one chat turn; converted internally or by caller |
+| `POST /v1/messages` | Anthropic Message JSON or SSE | No — foreign protocol adapter ([PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627)) |
+
+`NeMoGymChatCompletion` is a **backend/wire format** (one assistant turn in `choices[0].message`). `vllm_model` uses it internally: `responses()` converts to chat params, calls `chat_completions()`, then converts back to `NeMoGymResponse.output[]`.
+
+Agents like `simple_agent` call model `POST /v1/responses` directly and never see chat completion. `harbor_agent` calls chat completions directly and converts its trajectory to `NeMoGymResponse` output items at the end.
+
+## What PR #1627 added (and did not add)
+
+[PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added a **third spoke** on model servers — not changes to the agent:
+
+**Before:** `SimpleResponsesAPIModel` exposed `/v1/chat/completions` and `/v1/responses` only.
+
+**After:** Every model server also exposes `POST /v1/messages` with a default handler that:
+
+1. Converts Anthropic request → `NeMoGymResponseCreateParams`
+2. Delegates to the server's own `responses()` → internal `NeMoGymResponse`
+3. Converts `NeMoGymResponse` → Anthropic response (JSON or synthesized SSE)
+
+**Not in PR #1627:**
+
+- `claude_code_agent` itself (from #1336) — already had `model_server` ref and `_resolve_base_url()`
+- Default `reasoning_gym_claude_code_agent.yaml` — still points at real Anthropic API
+- RL side-channel plumbing — converter explicitly drops token IDs before Anthropic conversion
+
+## End-to-end rollout flow (linear)
+
+This is the full stack when using `reasoning_gym_claude_code_agent_model_server.yaml` + a model server (e.g. `vllm_model`):
+
+```mermaid
+flowchart TD
+ RC["ng_collect_rollouts"]
+ RUN["agent POST /run"]
+ SEED["resources_server POST /seed_session"]
+ RESP["agent POST /v1/responses"]
+ CLI["claude -p subprocess"]
+ MSG["model_server POST /v1/messages"]
+ MSRESP["model_server responses()"]
+ BACKEND["inference backend POST /v1/chat/completions"]
+ NOTE["↻ repeat for each Claude LLM turn"]
+ PARSE["agent builds NeMoGymResponse from stream-json"]
+ VERIFY["resources_server POST /verify"]
+ OUT["rollout JSONL"]
+
+ RC -->|"NeMoGymResponseCreateParams"| RUN
+ RUN --> SEED --> RESP
+ RESP -->|"shell + ANTHROPIC_BASE_URL"| CLI
+ CLI -->|"Anthropic Messages SSE"| MSG
+ MSG --> MSRESP --> BACKEND
+ BACKEND --> MSG --> CLI
+ CLI --> NOTE
+ NOTE --> PARSE
+ PARSE -->|"NeMoGymResponse"| VERIFY
+ VERIFY --> OUT
+```
+
+### Message types at each hop
+
+| Step | From → To | Format |
+| --- | --- | --- |
+| 1 | Rollout → agent `/run` | Task row with `responses_create_params` |
+| 2 | Agent `/run` → resources `/seed_session` | Same task row |
+| 3 | Agent `/run` → agent `/v1/responses` | `NeMoGymResponseCreateParamsNonStreaming` |
+| 4 | Agent → Claude subprocess | Shell env (`ANTHROPIC_BASE_URL`, etc.) |
+| 5 | Claude ↔ model `/v1/messages` | **Anthropic Messages** (many turns) |
+| 6 | Inside model server | Anthropic → `NeMoGymResponse` → Anthropic (internal) |
+| 7 | Model server ↔ vLLM | OpenAI Chat Completions (internal) |
+| 8 | Claude → agent | **stream-json stdout** (full session) |
+| 9 | Agent `/v1/responses` return | **`NeMoGymResponse`** (episode-level, for scoring) |
+| 10 | Agent → resources `/verify` | Task row + `NeMoGymResponse` |
+
+## Two NeMoGymResponse lifetimes (model-server path)
+
+This is a common source of confusion. On the model-server path there are **two separate** `NeMoGymResponse` objects:
+
+### Per-turn (internal, inside model server)
+
+Each Claude LLM call triggers:
+
+```
+Anthropic request → NeMoGymResponseCreateParams → responses() → NeMoGymResponse
+ → stripped → Anthropic response back to Claude
+```
+
+This object can carry RL fields when `return_token_id_information=True`, but Claude never sees them and Gym rollouts do not receive them today.
+
+### Episode-level (what Gym scoring uses)
+
+After Claude finishes the full session, the agent parses stream-json and **constructs one** `NeMoGymResponse` in `claude_code_agent.responses()`. That is what `/verify` reads.
+
+Today this episode-level response uses plain `NeMoGymResponseOutputMessage` items — **no RL fields**, even if the model server produced them per turn.
+
+## Direct Anthropic path (shorter)
+
+With the default `reasoning_gym_claude_code_agent.yaml` (`anthropic_base_url: null`), steps involving the Gym model server drop out:
+
+```mermaid
+flowchart TD
+ RC["ng_collect_rollouts"]
+ RUN["agent POST /run"]
+ SEED["resources_server POST /seed_session"]
+ RESP["agent POST /v1/responses"]
+ CLI["claude -p subprocess"]
+ ANTH["api.anthropic.com POST /v1/messages"]
+ VERIFY["resources_server POST /verify"]
+ OUT["rollout JSONL"]
+
+ RC --> RUN --> SEED --> RESP --> CLI
+ CLI <-->|"Anthropic Messages"| ANTH
+ CLI -->|"stream-json"| RESP
+ RESP -->|"NeMoGymResponse"| VERIFY --> OUT
+```
+
+PR #1627 is **invisible** on this path.
+
+## Why Anthropic format for Claude?
+
+Claude Code CLI is hard-wired to `POST /v1/messages`. It cannot call Gym's `/v1/responses` or OpenAI chat completions. When you point Claude at a Gym model server, the server must **speak Anthropic on the wire** even though it implements `responses()` internally.
+
+Think of `/v1/messages` as a **protocol adapter**:
+
+```
+Claude (USB-C / Anthropic) ↔ Gym model server adapter ↔ vLLM (HDMI / Chat Completions)
+```
+
+Gym's rollout pipeline only cares about the final **`NeMoGymResponse`** the agent builds from stream-json — not the per-turn Anthropic exchanges.
+
+## RL metadata: where it exists and where it is lost
+
+| Location | RL fields present? |
+| --- | --- |
+| `vllm_model` internal `NeMoGymResponse` (`return_token_id_information=True`) | Yes — on `*ForTraining` output items |
+| Model server `POST /v1/responses` wire response | Yes (when configured) |
+| Model server `POST /v1/messages` wire response | **No** — stripped in `responses_to_anthropic_response()` |
+| `claude_code_agent` episode `NeMoGymResponse` | **No** — `parse_stream_json()` builds plain messages |
+| Resources server `/verify` | Reads text from `NeMoGymResponse.output[]`; RL fields unused for scoring |
+
+The planned RL path (not yet wired for Claude Code) would **side-channel** per-turn token IDs from the model server's internal `NeMoGymResponse` and merge them into the agent's episode-level `NeMoGymResponse` as existing `*ForTraining` types — not invent a new schema.
+
+## Protocol layers on a model server
+
+```mermaid
+flowchart TD
+ subgraph gym["Gym trajectory layer"]
+ NR["NeMoGymResponse\n(building block for rollouts / verify)"]
+ end
+
+ subgraph convert["Conversion (inside model server or agent)"]
+ C["AnthropicConverter / VLLMConverter"]
+ end
+
+ subgraph wire["Backend wire formats"]
+ CC["NeMoGymChatCompletion\n/v1/chat/completions"]
+ AM["Anthropic Message\n/v1/messages"]
+ OR["OpenAI native Response\nopenai_model upstream"]
+ end
+
+ NR --> C
+ C --> CC
+ C --> AM
+ C --> OR
+```
+
+## Config cheat sheet
+
+| Config | Model path | PR #1627 involved? |
+| --- | --- | --- |
+| `claude_code_agent/configs/claude_code_agent.yaml` (template) | Direct Anthropic | No |
+| `reasoning_gym_claude_code_agent.yaml` | Direct Anthropic via env vars | No |
+| `reasoning_gym_claude_code_agent_model_server.yaml` + `vllm_model.yaml` | Claude → Gym model `/v1/messages` → vLLM | **Yes** |
+
+## Key takeaways
+
+1. **`NeMoGymResponse.output[]` is the trajectory building block** — shared across models, agents, verifiers, and rollouts.
+2. **`POST /v1/responses` is the Gym contract boundary** — not `/v1/messages` or `/v1/chat/completions`.
+3. **PR #1627 adds an Anthropic adapter on model servers** so Claude Code can target any Gym backend without a separate proxy process.
+4. **The agent wraps Claude as a black box** — Gym HTTP stops at `/v1/responses`; Claude's multi-turn loop uses Anthropic internally.
+5. **RL metadata is schema-ready** (`*ForTraining` types) but **not yet plumbed** through the Claude Code + `/v1/messages` path.
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx
new file mode 100644
index 0000000000..2c43260968
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx
@@ -0,0 +1,103 @@
+---
+title: "Harness Terminology"
+description: "How the agent-eval and SWE-bench communities use “harness,” and how NeMo Gym disambiguates agent orchestration from benchmark grading."
+---
+**“Harness” is overloaded.** In agentic coding and SWE-bench discussions, the same word often means either (a) the agent-side orchestration that runs a model on tasks, or (b) the benchmark-side pipeline that grades patches. There is no single community-wide definition — context disambiguates. This note maps the dominant usages and the vocabulary NeMo Gym uses to keep them separate.
+
+## Three meanings in the wild
+
+| Context | “Harness” usually means | Agent included? | Example |
+| --- | --- | --- | --- |
+| **SWE-bench docs / `swebench.harness`** | Grading pipeline: Docker, apply patch, run tests, report resolved | No | `python -m swebench.harness.run_evaluation` |
+| **SWE-bench leaderboard / blogs** | Agent orchestration: prompt, tool loop, patch extraction | Yes | “mini-SWE-agent + Claude 4.5 Opus” |
+| **NeMo Gym docs** | Agent server orchestration around the model | Yes | `claude_code_agent`, OpenHands, `simple_agent` |
+| **Generic ML eval** | Benchmark runner infrastructure | Maybe | “evaluation harnesses” in NeMo Evaluator |
+
+The collision is sharpest in SWE-bench: **official docs** call the grader “the harness,” while **leaderboard rows** read like “harness + model” where harness is the agent stack.
+
+## SWE-bench: two halves of one eval
+
+The [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) documents **`swebench.harness`** — the module that:
+
+1. Prepares per-instance Docker images
+2. Applies a `model_patch` from a predictions file
+3. Runs the repository test suite
+4. Grades pass/fail and aggregates `% resolved`
+
+The submission contract is intentionally narrow: produce JSONL with `instance_id` and `model_patch`; the harness scores it. Community tutorials often describe this as a unit-test-shaped split:
+
+- **Arrange** — harness prepares the instance environment (images, checkout, deps)
+- **Act** — *your* agent edits the repo and emits a patch (not part of `swebench.harness`)
+- **Assert** — harness applies the patch, runs hidden tests, reports resolved/unresolved
+
+So in SWE-bench **technical** vocabulary, harness ≈ **Environment + grading authority**. The agent is external.
+
+What upstream bundles (without always naming it cleanly) is **task world + grading** inside one pipeline. What it does **not** model is **how** the agent produced the patch — but leaderboard prose often treats “harness + model” as the evaluated product anyway.
+
+## Leaderboard and product language
+
+The [SWE-bench leaderboard](https://www.swebench.com/) reports results as combinations like:
+
+- **“bash only”** — a specific agent-side setup across models
+- **“mini-SWE-agent + Claude 4.5 Opus”** — harness (orchestration) + model
+
+That usage matches NeMo Gym’s [SWE RL case study](/infrastructure/engineering-notes/swe-rl-case-study): *a harness is a system prompt plus orchestration to execute one attempt at the task.* Here harness ≈ **agent harness**, not `run_evaluation`.
+
+When reading papers, blog posts, or vendor announcements, assume **harness = agent-side** unless the text explicitly points at Docker grading or `swebench.harness`.
+
+## NeMo Gym vocabulary (intentional split)
+
+Gym separates roles that colloquial “harness” often merges:
+
+| Gym term | Role | SWE-bench analogue |
+| --- | --- | --- |
+| **Task** | One dataset row / instance (`SweTask`, `TaskPublic`) | `django__django-13741` |
+| **Benchmark** | Fixed eval product: split, metric, protocol, baselines | SWE-bench Verified |
+| **Environment** | Resources server: `seed_session`, `verify`, state, tools | `swe_bench` RS |
+| **Agent harness** | Agent server: multi-step loop, tools, when to stop | Claude Code, OpenHands, mini-SWE-agent |
+| **Model** | Stateless inference | vLLM / OpenAI endpoint |
+
+See [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks) and the [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) note for how `SessionDescriptor`, topology C, and hermetic `verify` fit in.
+
+### The awkward `harnesses/` directory
+
+Under [`resources_servers/swe_bench/harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses), **`harness` means something else again**: benchmark-**family eval plugins** owned by the Environment — provision recipes and grading adapters keyed by dataset family (e.g. SWE-bench vs multilingual). They wrap pieces of upstream `swebench.harness`; they are **not** agent orchestration.
+
+| Name in repo | Meaning | Prefer saying |
+| --- | --- | --- |
+| Agent server / `responses_api_agents/` | Agent harness | **agent harness** or **agent server** |
+| `swebench.harness` (upstream) | Official grading pipeline | **SWE-bench eval harness** or **grading pipeline** |
+| `swe_bench/harnesses/` | Environment eval plugins | **benchmark-family plugin** or **eval plugin** (when disambiguation matters) |
+| `swe_bench` RS overall | MDP authority for SWE tasks | **Environment** or **resources server** |
+
+We keep `harness.py` / `harnesses/` in `swe_bench` to align with upstream module naming (`swebench.harness`) and prior `swe_env` convention — not because Gym equates “harness” with agent orchestration.
+
+## Practical guidance
+
+**When writing docs or PRs**
+
+- Say **agent harness** (or **agent server**) for orchestration in `responses_api_agents/`.
+- Say **SWE-bench eval harness** or **grading pipeline** for `swebench.harness.run_evaluation`.
+- Say **Environment** / **`swe_bench` resources server** for `seed_session` + `verify` + hermetic grading.
+- Say **benchmark** for published eval products (Verified, Lite), not for the server binary.
+- Say **task** / **instance** for one problem row — not “environment” and not “harness.”
+
+**When comparing to leaderboard numbers**
+
+- Identify both **model** and **agent harness** (leaderboard row).
+- Confirm **benchmark split** and **grading protocol** (Environment config, `verified:` marker).
+- Do not conflate upstream `swebench.harness` with the agent named on the leaderboard.
+
+**When designing new environments**
+
+- Keep **grading authority** on the resources server (Environment).
+- Keep **orchestration** on the agent server (agent harness).
+- Use **benchmark-family plugins** only for dataset-specific provision/grade logic — not for agent loops.
+
+## Related reading
+
+- [Evaluation — agent harness](/evaluation#agent-harness) — Gym’s primary “harness” definition
+- [Key Terminology — Agent Server](/about/concepts/key-terminology#architecture-terms) — architecture glossary
+- [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) — Task / Benchmark / Environment split for SWE
+- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — harness + model on the leaderboard
+- [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) — upstream grading pipeline
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
index ce48f79b15..eb92e8bf31 100644
--- a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
@@ -25,6 +25,24 @@ Infrastructure challenges and deployment topology for SWE RL training.
swe-rl case-study
+
+How “harness” is used in SWE-bench vs agent eval vs NeMo Gym — and recommended naming.
+
+terminology swe-bench
+
+
+
+Environment resources server for SWE-bench: session descriptors, topology C, and hermetic verify.
+
+swe-bench architecture
+
+
+
+Responses API, `/v1/messages`, and rollout data contracts for `claude_code_agent`.
+
+claude-code api-design
+
+
Why NeMo Gym uses aiohttp instead of httpx for async HTTP.
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx
new file mode 100644
index 0000000000..b3a97cc340
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx
@@ -0,0 +1,267 @@
+---
+title: "SWE-bench Environment Server"
+description: "Restoring the Environment as a resources server — seed_session descriptors, topology C, and hermetic verify for SWE-bench."
+---
+This engineering note documents the **`swe_bench` resources server**: what problem it solves, how it differs from earlier SWE integrations in Gym, and how to run evaluation with a black-box agent server such as `claude_code_agent`.
+
+
+This server ships with `verified: false` — it is a working prototype, not yet baselined on gold patches. See [Adding a Benchmark](/contribute/environments/adding-a-benchmark) for the path to `verified: true`.
+
+
+## Background: why a separate Environment server?
+
+Earlier SWE convergence work ([PR #1738](https://github.com/NVIDIA-NeMo/Gym/pull/1738)) moved grading and sandbox spec **into the agent server** (`responses_api_agents/swe_env/`, inline `verify_task`). That pattern works for a single bundled agent, but it breaks composability:
+
+- **Black-box agent servers** (Claude Code, OpenHands, Harbor, …) should not import SWE grading code or choose docker vs OpenSandbox themselves.
+- **The Environment** should own task authority: sandbox spec, benchmark grading, and the `verified:` marker on the resources server.
+- **Agents** should connect through a small HTTP contract (`seed_session` → run → `verify`), not an `anyswe`-style wrapper per agent.
+
+The `swe_bench` resources server restores that boundary. Grading harnesses, parsing, and `verify_task` live as **private modules** under [`resources_servers/swe_bench/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench) — not under `responses_api_agents/`.
+
+For cluster-scale SWE RL training topology (Apptainer, CPU sizing), see the older [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study). This note focuses on the **Environment server + agent-server wiring** pattern.
+
+## Three roles (orthogonal)
+
+| Role | Gym component | SWE-bench example |
+| --- | --- | --- |
+| **Environment** | Resources server | `swe_bench` — `seed_session`, `verify`, benchmark harnesses |
+| **Agent server** | `responses_api_agents/` | `claude_code_agent` — runs Claude in the instance sandbox |
+| **Sandbox runtime** | `nemo_gym/sandbox/` | Docker provider (OpenSandbox / Apptainer as needed) |
+
+
+**“Harness” overload.** In Gym docs, *agent harness* means orchestration inside an agent server. In SWE-bench, `swebench.harness` is the upstream eval stack. Under `swe_bench`, **`harness.py` / `harnesses/`** are **benchmark-family plugins** (provision + grade recipes keyed by `task.benchmark`). They are Environment-owned, not agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology) for the full map.
+
+
+## Environment vs Benchmark vs Task
+
+These names refer to different layers (see also [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks)):
+
+| Layer | SWE-bench example | What it is |
+| --- | --- | --- |
+| **Benchmark** | *SWE-bench Verified* | Fixed eval product: 500-task test split, `% resolved` metric, comparison protocol, leaderboard baselines |
+| **Environment** | `swe_bench` resources server | Executable engine: `seed_session`, `verify`, harness registry, hermetic grading |
+| **Task** | `django__django-13741` | One problem instance in the benchmark (prompt + privileged grading metadata) |
+
+- **`swe_bench` is the Environment** — you can train, dev, or eval with it; it is not the same as the published benchmark.
+- **SWE-bench Verified is a Benchmark** built on that Environment (frozen JSONL + eval config + reporting).
+- **`verified: true`** on the RS means this Environment configuration is **benchmark-grade** (gold-patch baseline, protocol locked) — not merely that the server exists.
+
+One Environment supports multiple benchmarks (Verified, Lite, Multilingual) by swapping **tasks** (dataset) and harness keys — no new resources server per publication.
+
+## What `swe_bench` exposes
+
+The HTTP surface is intentionally thin ([`app.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/app.py)). Heavy logic stays in private modules.
+
+| Endpoint | Responsibility |
+| --- | --- |
+| `POST /seed_session` | Build a **`SessionDescriptor`**: placement topology, per-instance `SandboxSpec`, merged `verifier_metadata` |
+| `POST /verify` | Grade `verifier_metadata.model_patch` in a **fresh** eval sandbox (hermetic twin) |
+
+### SessionDescriptor (response shape)
+
+`seed_session` returns:
+
+```json
+{
+ "placement": { "topology": "agent_in_env" },
+ "sandbox": { "spec": { "image": "swebench/sweb.eval.x86_64....", "workdir": "/testbed", ... } },
+ "egress": { "env": {} },
+ "verifier_metadata": {
+ "instance_id": "django__django-13741",
+ "benchmark": "swe-bench",
+ "dataset_name": "princeton-nlp/SWE-bench_Verified",
+ "flat_eval": true
+ }
+}
+```
+
+The agent server reads **`placement.topology`** and **`sandbox.spec`** — it never imports `swe_bench.harness` or picks a provider on its own (beyond what its config already declares).
+
+### Topology C (`agent_in_env`)
+
+| Topology | Who owns the working sandbox | Typical agent |
+| --- | --- | --- |
+| `none` | No in-box work; MCP / host-side tools | Default Claude Code + MCP resources |
+| `agent_in_env` | Agent starts the descriptor's sandbox and runs inside it | **`claude_code_swe_bench`** |
+| `env_sandboxed` | Environment brokers box lifecycle (future broker RS) | Planned |
+| `whole_interaction` | Single box for agent + eval (legacy) | `swe_agents` style |
+
+**Topology C** is the target for SWE-bench Verified with Claude Code:
+
+1. Environment returns image + workdir from the benchmark harness.
+2. Agent server starts that sandbox, runs `claude -p` **inside** the instance image.
+3. Agent harvests `git diff --cached` as `model_patch`.
+4. Environment grades the patch in a **separate fresh container** (no agent pollution).
+
+```mermaid
+flowchart TD
+ RC["gym eval run"]
+ RUN["agent POST /run"]
+ SEED["swe_bench POST /seed_session"]
+ BOX["Agent: AsyncSandbox from descriptor"]
+ CLAUDE["claude -p in /testbed"]
+ PATCH["git diff --cached → model_patch"]
+ VERIFY["swe_bench POST /verify"]
+ FRESH["verify_task: fresh sandbox"]
+ OUT["rollout JSONL reward 0/1"]
+
+ RC --> RUN --> SEED
+ SEED -->|"topology=agent_in_env, sandbox.spec"| BOX
+ BOX --> CLAUDE --> PATCH
+ PATCH --> VERIFY
+ VERIFY --> FRESH --> OUT
+```
+
+## Benchmark harness layer (private)
+
+Each SWE dataset family registers a harness under [`harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses):
+
+| Registry key | Class | Notes |
+| --- | --- | --- |
+| `swe-bench` | `SweBenchHarness("swe-bench")` | Uses upstream `swebench` `make_test_spec` + `get_logs_eval` |
+| `swe-bench-multilingual` | `SweBenchHarness("swe-bench-multilingual")` | Same class, different family name |
+| `swe-bench-ext` | `SweBenchExtHarness` | Extended / fuzzy parsers |
+| `swe-rebench` | `SweRebenchHarness` | SWE-rebench family |
+| `r2e-gym` | `R2EGymHarness` | R2E-Gym |
+| `nv-internal-1` | `NVInternalHarness` | Internal NV format |
+
+The harness contract ([`harness.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harness.py)) splits provisioning from grading:
+
+- **Agent-visible:** `build_spec`, `supports_provider`, `materialize`
+- **Verifier-only:** `reset_repo`, `run_eval`, `grade` (called only from `verify_task`)
+
+For official SWE-bench instances, grading delegates to the external [`swebench`](https://github.com/SWE-bench/SWE-bench) package — Gym runs the official per-instance `eval_script` in the sandbox and parses logs with `swebench.harness.grading.get_logs_eval`.
+
+## Dataset format
+
+Each JSONL row needs SWE instance metadata in **`verifier_metadata`** (and typically mirrored in `responses_create_params.metadata`):
+
+| Field | Purpose |
+| --- | --- |
+| `instance_id` | SWE-bench instance key (e.g. `django__django-13741`) |
+| `dataset_name` | HuggingFace dataset id (selects harness family) |
+| `split` | Usually `test` |
+| `problem_statement` | User message / issue text for the agent |
+| `instance_dict` | Full SWE-bench instance record (JSON string or object) — required for faithful grading |
+
+Optional per-row `container_formatter` overrides the server default image template.
+
+### Prepare SWE-bench Verified rows
+
+```bash
+python resources_servers/swe_bench/prepare.py --limit 5 --no-images
+```
+
+This writes `resources_servers/swe_bench/data/swebench_verified.jsonl`. Use `--no-images` for dataset-only smoke tests; full eval needs Docker images `swebench/sweb.eval.x86_64.{tag}` (see `prepare.py` for tag normalization: `__` → `_1776_`, lowercased).
+
+## Configuration
+
+Server config: [`resources_servers/swe_bench/configs/swe_bench.yaml`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/configs/swe_bench.yaml)
+
+```yaml
+swe_bench:
+ resources_servers:
+ swe_bench:
+ sandbox_provider:
+ docker: {}
+ container_formatter: swebench/sweb.eval.x86_64.{instance_id}
+ eval_timeout_s: 1800
+ flat_eval: true
+ default_topology: agent_in_env
+
+claude_code_swe_bench:
+ responses_api_agents:
+ claude_code_agent:
+ resources_server:
+ type: resources_servers
+ name: swe_bench
+ sandbox_provider:
+ docker: {}
+ in_box_timeout_s: 1800
+ bare: true
+```
+
+Key knobs:
+
+| Config field | Effect |
+| --- | --- |
+| `sandbox_provider` | Passed to `verify_task` and agent in-box binding |
+| `container_formatter` | Docker image template for instance sandboxes |
+| `flat_eval` | Host-side grading (runs on any exec-capable provider) |
+| `default_topology` | Returned from `seed_session` (`agent_in_env` for topology C) |
+| `in_box_timeout_s` | Agent-side Claude run timeout inside the sandbox |
+
+## Quickstart: evaluation rollouts
+
+**1. Install and test the server** (unit tests use a fake sandbox — no Docker required):
+
+```bash
+gym env test --resources-server swe_bench
+```
+
+**2. Start servers** (Anthropic API key for Claude Code):
+
+```bash
+gym env start \
+ --resources-server swe_bench \
+ --agent claude_code_swe_bench \
+ --model-type openai_model
+```
+
+**3. Run rollouts** on prepared JSONL:
+
+```bash
+gym eval run --no-serve --agent claude_code_swe_bench \
+ --input resources_servers/swe_bench/data/swebench_verified.jsonl \
+ --output results/swe_bench_rollouts.jsonl
+```
+
+The agent passes **`verifier_metadata.model_patch`** (unified diff) on `POST /verify`. The server returns `reward` ∈ `{0.0, 1.0}`, plus `resolved`, `patch_exists`, and optional `error_kind` / `mask_sample` for infra failures.
+
+## Hermetic verify
+
+`verify` **never** reuses the agent's working sandbox. `verify_task`:
+
+1. Selects the harness for `task.benchmark`
+2. Acquires a **fresh** sandbox via `acquire_sandbox` (always teardown)
+3. Runs `reset_repo` → `materialize(model_patch)` → `run_eval` → `grade`
+4. Maps the report to reward (`1.0` if resolved and no `error_kind`)
+
+This mirrors SWE-bench's separation between “agent edits” and “official eval script in a clean tree,” and prevents agent artifacts from affecting the score.
+
+
+`verify` short-circuits: `patch_exists=false`, `resolved=false`, `reward=0.0` — no eval sandbox spin-up.
+
+
+
+[`responses_api_agents/swe_agents`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/swe_agents) still shells out to `swebench.harness.run_local_evaluation` inside Apptainer-oriented rollouts. That path bundles agent + grading. **`swe_bench` + `claude_code_agent`** is the composable replacement: one Environment RS wired to many agent servers via the descriptor contract, without per-agent SWE wrappers.
+
+
+## Module map
+
+```text
+resources_servers/swe_bench/
+├── app.py # HTTP: seed_session → SessionDescriptor, verify
+├── task.py # First-class Task (SweTask, TaskPublic, parse helpers)
+├── session.py # SessionDescriptor wire models
+├── harness.py # SweTaskHarness ABC, registry, compute_resolved
+├── harnesses/ # Per-family grading plugins
+├── verify_task.py # Fresh-sandbox grading orchestrator
+├── sandbox.py # AsyncSweEnvironment + acquire_sandbox
+├── prepare.py # HF dataset → Gym JSONL
+└── configs/swe_bench.yaml
+```
+
+## Key takeaways
+
+1. **`swe_bench` is the Environment** — it owns benchmark authority, not the agent server.
+2. **`seed_session` returns a descriptor**, not opaque session state — agents bind sandboxes from `placement` + `sandbox.spec`.
+3. **Topology C** runs Claude inside the instance image; **verify** always uses a hermetic twin sandbox.
+4. **`harnesses/`** are benchmark eval plugins aligned with upstream `swebench.harness` — distinct from Gym “agent harness” orchestration.
+5. **Any agent server** that implements `/run` → `seed_session` → work → `verify` with `model_patch` can plug in; no SWE-specific wrapper required.
+
+## Related docs
+
+- [Claude Code Agent — Protocol Stack](/infrastructure/engineering-notes/claude-code-agent-protocol-stack) — Responses API, `/v1/messages`, and rollout data contracts
+- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — training-scale Apptainer topology
+- [Real-World Environment tutorial](/environment-tutorials/real-world-environment/resources-server-implementation) — `seed_session` / `verify` patterns for resources servers
diff --git a/nemo_gym/sandbox/providers/apptainer/provider.py b/nemo_gym/sandbox/providers/apptainer/provider.py
index 0605b1a805..bc2b70c499 100644
--- a/nemo_gym/sandbox/providers/apptainer/provider.py
+++ b/nemo_gym/sandbox/providers/apptainer/provider.py
@@ -148,6 +148,7 @@ class _ApptainerInstance:
mount_point: str # where the folder shows up inside
image: str # what it was built from
env: dict[str, str] = field(default_factory=dict)
+ overlay_dir: Path | None = None # per-instance disk overlay (cleaned on close)
def _resource_flags(resources: SandboxResources) -> list[str]:
@@ -386,6 +387,14 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
for key, value in spec.env.items():
argv += ["--env", f"{key}={value}"]
start_args = list(self._create_config.extra_start_args)
+ # --writable-tmpfs caps the writable layer at apptainer's `sessiondir max size`
+ # (default 64 MiB), which ENOSPCs for repos that rebuild on apply/eval (e.g. astropy's
+ # C extensions). Swap it for a per-instance DISK-backed overlay (bounded by host disk).
+ overlay_dir: Path | None = None
+ if "--writable-tmpfs" in start_args:
+ start_args = [a for a in start_args if a != "--writable-tmpfs"]
+ overlay_dir = Path(tempfile.mkdtemp(prefix="nemo-gym-apptainer-ovl-"))
+ start_args += ["--overlay", str(overlay_dir)]
resource_limit_flags = _resource_limit_flags(spec.resources)
if resource_limit_flags and self._create_config.apply_resource_limits:
if "--fakeroot" in start_args:
@@ -403,9 +412,13 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
code, _out, err = await self._run(argv, timeout_s=self._create_config.start_timeout_s, daemonize=True)
except TimeoutError as e:
shutil.rmtree(staging_dir, ignore_errors=True)
+ if overlay_dir:
+ shutil.rmtree(overlay_dir, ignore_errors=True)
raise ApptainerCreateError(f"apptainer instance start timed out for image={image!r}: {e}") from e
if code != 0:
shutil.rmtree(staging_dir, ignore_errors=True)
+ if overlay_dir:
+ shutil.rmtree(overlay_dir, ignore_errors=True)
raise ApptainerCreateError(
f"apptainer instance start failed (code={code}) for image={image!r}: {err.strip()}"
)
@@ -420,6 +433,7 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
mount_point=mount_point,
image=image,
env=dict(spec.env),
+ overlay_dir=overlay_dir,
),
)
@@ -485,6 +499,8 @@ async def _cleanup_failed_create_handle(self, handle: SandboxHandle) -> None:
timeout_s=self._exec_config.default_timeout_s,
)
shutil.rmtree(inst.staging_dir, ignore_errors=True)
+ if inst.overlay_dir:
+ shutil.rmtree(inst.overlay_dir, ignore_errors=True)
async def exec(
self,
@@ -667,6 +683,8 @@ async def close(self, handle: SandboxHandle) -> None:
shutil.rmtree(inst.staging_dir, ignore_errors=False)
except OSError as e:
LOGGER.warning("failed to remove staging dir %s: %s", inst.staging_dir, e)
+ if inst.overlay_dir:
+ shutil.rmtree(inst.overlay_dir, ignore_errors=True)
if stop_error is not None:
raise stop_error
diff --git a/nemo_gym/sandbox/providers/docker/__init__.py b/nemo_gym/sandbox/providers/docker/__init__.py
new file mode 100644
index 0000000000..a339158b99
--- /dev/null
+++ b/nemo_gym/sandbox/providers/docker/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Docker sandbox provider package."""
+
+from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider
+
+
+__all__ = ["DockerSandboxProvider"]
diff --git a/nemo_gym/sandbox/providers/docker/provider.py b/nemo_gym/sandbox/providers/docker/provider.py
new file mode 100644
index 0000000000..7af8fe6ffa
--- /dev/null
+++ b/nemo_gym/sandbox/providers/docker/provider.py
@@ -0,0 +1,324 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Local Docker-backed ``SandboxProvider`` implementation.
+
+Implements the ``nemo_gym.sandbox`` provider Protocol via the ``docker`` CLI so
+SWE environments can be provisioned and graded on any machine with Docker
+installed, making end-to-end SWE-bench verification runnable on a single
+workstation.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import posixpath
+import shlex
+import uuid
+from collections.abc import Mapping
+from pathlib import Path
+from typing import Any
+
+from nemo_gym.sandbox import (
+ SandboxCreateError,
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxResources,
+ SandboxSpec,
+ SandboxStatus,
+)
+
+
+class DockerSandboxProvider:
+ """Run sandboxes as long-lived Docker containers via the ``docker`` CLI."""
+
+ name = "docker"
+
+ def __init__(
+ self,
+ *,
+ docker_bin: str = "docker",
+ default_user: str | int | None = None,
+ network: str | None = None,
+ run_args: list[str] | None = None,
+ keep_alive_command: str = "sleep infinity",
+ concurrency: int = 32,
+ **_: Any,
+ ) -> None:
+ """Configure the Docker sandbox provider.
+
+ Args:
+ docker_bin: Name or path of the ``docker`` executable to invoke.
+ default_user: Default user (name or UID) to run ``exec`` commands as
+ when no per-call user is given; None leaves the image default.
+ network: Docker network to attach containers to; None uses the
+ Docker default.
+ run_args: Extra arguments appended to every ``docker run``
+ invocation.
+ keep_alive_command: Command run as the container's entrypoint to keep
+ it alive for subsequent ``exec`` calls.
+ concurrency: Maximum number of concurrent ``docker`` CLI subprocesses,
+ bounded by a shared semaphore (matches the apptainer provider).
+ **_: Additional keyword arguments are accepted and ignored.
+
+ Raises:
+ ValueError: If ``concurrency`` is less than 1.
+ """
+ if concurrency < 1:
+ raise ValueError("concurrency must be >= 1")
+ self._bin = docker_bin
+ self._default_user = default_user
+ self._network = network
+ self._run_args = list(run_args or [])
+ self._keep_alive = keep_alive_command
+ self._semaphore = asyncio.Semaphore(concurrency)
+
+ async def _run(self, *args: str, timeout_s: int | float | None = None) -> tuple[int, str, str]:
+ """Run the ``docker`` CLI with the given arguments and capture output.
+
+ Concurrency is bounded by the provider's shared semaphore so a busy SWE hot
+ path (one sandbox per rollout, many ``exec`` each) cannot spawn unbounded
+ ``docker`` subprocesses.
+
+ Args:
+ *args: Arguments passed to the ``docker`` executable.
+ timeout_s: Optional timeout in seconds; the process is killed and the
+ timeout error re-raised if it is exceeded.
+
+ Returns:
+ A tuple of ``(return_code, stdout, stderr)`` with output decoded as
+ text using ``errors="replace"``.
+ """
+ async with self._semaphore:
+ proc = await asyncio.create_subprocess_exec(
+ self._bin,
+ *args,
+ stdout=asyncio.subprocess.PIPE,
+ stderr=asyncio.subprocess.PIPE,
+ )
+ try:
+ out, err = await asyncio.wait_for(proc.communicate(), timeout=timeout_s)
+ except (asyncio.TimeoutError, TimeoutError):
+ proc.kill()
+ await proc.wait()
+ raise
+ return (
+ proc.returncode if proc.returncode is not None else -1,
+ out.decode(errors="replace"),
+ err.decode(errors="replace"),
+ )
+
+ @staticmethod
+ def _resources(spec: SandboxSpec) -> SandboxResources:
+ """Coerce a spec's resource request into a ``SandboxResources``.
+
+ Args:
+ spec: Sandbox spec whose ``resources`` field is a
+ ``SandboxResources`` or a mapping.
+
+ Returns:
+ The spec's ``SandboxResources`` if already one, otherwise a
+ ``SandboxResources`` built from the mapping (or empty defaults).
+ """
+ if isinstance(spec.resources, SandboxResources):
+ return spec.resources
+ return SandboxResources.from_mapping(spec.resources if isinstance(spec.resources, Mapping) else {})
+
+ async def create(self, spec: SandboxSpec) -> SandboxHandle:
+ """Start a detached container and return a handle to it.
+
+ Applies resource limits, network, working directory, environment, and
+ extra run args from the spec, then launches the image running the
+ keep-alive command so the container persists for later ``exec`` calls.
+
+ Args:
+ spec: Sandbox spec describing the image, resources, workdir, env, and
+ readiness timeout.
+
+ Returns:
+ A ``SandboxHandle`` whose ``sandbox_id`` is the container id.
+
+ Raises:
+ SandboxCreateError: If no image is given, ``docker run`` times out or
+ fails, or no container id is returned.
+ """
+ if not spec.image:
+ raise SandboxCreateError("DockerSandboxProvider requires spec.image")
+ # Pre-assign a unique name so a container the daemon may have started can still be reaped
+ # if the CLI client dies (e.g. on timeout) before we capture its id (mirrors apptainer's
+ # uuid-named instances).
+ name = f"nemo-gym-{uuid.uuid4().hex}"
+ args = ["run", "-d", "--init", "--name", name]
+ if self._network:
+ args += ["--network", self._network]
+ res = self._resources(spec)
+ if res.memory_mib:
+ args.append(f"--memory={int(res.memory_mib)}m")
+ if res.cpu:
+ args.append(f"--cpus={res.cpu}")
+ if res.gpu:
+ args.append("--gpus=all")
+ if spec.workdir:
+ args += ["-w", spec.workdir]
+ for key, value in (spec.env or {}).items():
+ args += ["-e", f"{key}={value}"]
+ args += self._run_args
+ args += [spec.image, "bash", "-c", self._keep_alive]
+ try:
+ rc, out, err = await self._run(*args, timeout_s=spec.ready_timeout_s or 600)
+ except (asyncio.TimeoutError, TimeoutError) as exc:
+ await self._reap_orphan(name)
+ raise SandboxCreateError(f"docker run timed out for image {spec.image!r}") from exc
+ if rc != 0:
+ await self._reap_orphan(name)
+ raise SandboxCreateError(f"docker run failed (rc={rc}) for {spec.image!r}: {err.strip() or out.strip()}")
+ lines = out.strip().splitlines()
+ container_id = lines[-1].strip() if lines else ""
+ if not container_id:
+ await self._reap_orphan(name)
+ raise SandboxCreateError("docker run did not return a container id")
+ return SandboxHandle(
+ sandbox_id=container_id,
+ provider_name=self.name,
+ raw={"image": spec.image, "workdir": spec.workdir},
+ )
+
+ async def _reap_orphan(self, name: str) -> None:
+ """Best-effort force-remove a container by its pre-assigned name.
+
+ Used to clean up a ``docker run`` that may have started a container on the daemon even
+ though the CLI client failed (timeout / non-zero rc / no id returned) before a handle was
+ captured. Swallows all errors and bounds itself with a short timeout — a missing or
+ already-gone container is fine.
+
+ Args:
+ name: The pre-assigned ``--name`` of the container to remove.
+ """
+ try:
+ await self._run("rm", "-f", name, timeout_s=30)
+ except Exception:
+ pass
+
+ async def exec(
+ self,
+ handle: SandboxHandle,
+ command: str,
+ *,
+ cwd: str | None = None,
+ env: dict[str, str] | None = None,
+ timeout_s: int | float | None = None,
+ user: str | int | None = None,
+ ) -> SandboxExecResult:
+ """Run a shell command inside the container.
+
+ Args:
+ handle: Handle identifying the target container.
+ command: Shell command executed via ``bash -c``.
+ cwd: Working directory for the command; falls back to the workdir
+ recorded at create time.
+ env: Extra environment variables for the command.
+ timeout_s: Optional timeout in seconds; on expiry a result with
+ return code 124 and ``error_type="timeout"`` is returned.
+ user: User (name or UID) to run as; falls back to the provider's
+ default user.
+
+ Returns:
+ A ``SandboxExecResult`` with stdout, stderr, return code, and an
+ ``error_type`` of ``"sandbox"`` for docker-level failures (125/126/
+ 127 with no stdout), ``"timeout"`` on timeout, or None otherwise.
+ """
+ args = ["exec"]
+ workdir = cwd or handle.raw.get("workdir")
+ if workdir:
+ args += ["-w", workdir]
+ eff_user = user if user is not None else self._default_user
+ if eff_user is not None:
+ args += ["-u", str(eff_user)]
+ for key, value in (env or {}).items():
+ args += ["-e", f"{key}={value}"]
+ args += [handle.sandbox_id, "bash", "-c", command]
+ try:
+ rc, out, err = await self._run(*args, timeout_s=timeout_s)
+ except (asyncio.TimeoutError, TimeoutError):
+ return SandboxExecResult(
+ stdout=None,
+ stderr=f"command timed out after {timeout_s}s",
+ return_code=124,
+ error_type="timeout",
+ )
+ # docker exec returns 125/126/127 for docker-level failures (container gone, not executable).
+ error_type = "sandbox" if rc in (125, 126, 127) and not out else None
+ return SandboxExecResult(stdout=out, stderr=err, return_code=rc, error_type=error_type)
+
+ async def upload_file(self, handle: SandboxHandle, source_path: Path, target_path: str) -> None:
+ """Copy a host file into the container, creating parent dirs as needed.
+
+ Args:
+ handle: Handle identifying the target container.
+ source_path: Path to the file on the host.
+ target_path: Destination path inside the container.
+
+ Raises:
+ RuntimeError: If the ``docker cp`` upload fails.
+ """
+ parent = posixpath.dirname(target_path)
+ if parent:
+ await self.exec(handle, f"mkdir -p {shlex.quote(parent)}")
+ rc, out, err = await self._run("cp", str(source_path), f"{handle.sandbox_id}:{target_path}")
+ if rc != 0:
+ raise RuntimeError(f"docker cp upload failed: {err.strip() or out.strip()}")
+
+ async def download_file(self, handle: SandboxHandle, source_path: str, target_path: Path) -> None:
+ """Copy a file out of the container to the host.
+
+ Args:
+ handle: Handle identifying the source container.
+ source_path: Path to the file inside the container.
+ target_path: Destination path on the host; parent dirs are created.
+
+ Raises:
+ RuntimeError: If the ``docker cp`` download fails.
+ """
+ target = Path(target_path)
+ target.parent.mkdir(parents=True, exist_ok=True)
+ rc, out, err = await self._run("cp", f"{handle.sandbox_id}:{source_path}", str(target))
+ if rc != 0:
+ raise RuntimeError(f"docker cp download failed: {err.strip() or out.strip()}")
+
+ async def status(self, handle: SandboxHandle) -> SandboxStatus:
+ """Report whether the container is running.
+
+ Args:
+ handle: Handle identifying the container to inspect.
+
+ Returns:
+ ``RUNNING`` or ``STOPPED`` based on the container's running state,
+ or ``UNKNOWN`` if the inspect command fails.
+ """
+ rc, out, _ = await self._run("inspect", "-f", "{{.State.Running}}", handle.sandbox_id)
+ if rc != 0:
+ return SandboxStatus.UNKNOWN
+ return SandboxStatus.RUNNING if out.strip() == "true" else SandboxStatus.STOPPED
+
+ async def close(self, handle: SandboxHandle) -> None:
+ """Force-remove the container.
+
+ Args:
+ handle: Handle identifying the container to remove.
+ """
+ await self._run("rm", "-f", handle.sandbox_id)
+
+ async def aclose(self) -> None:
+ """Release provider-level resources; this provider holds none."""
+ return None
diff --git a/nemo_gym/sandbox/providers/registry.py b/nemo_gym/sandbox/providers/registry.py
index 8c4e39e577..451056d470 100644
--- a/nemo_gym/sandbox/providers/registry.py
+++ b/nemo_gym/sandbox/providers/registry.py
@@ -75,11 +75,18 @@ def _load_opensandbox_provider() -> ProviderClass:
return OpenSandboxProvider
+def _load_docker_provider() -> ProviderClass:
+ from nemo_gym.sandbox.providers.docker import DockerSandboxProvider
+
+ return DockerSandboxProvider
+
+
def _load_apptainer_provider() -> ProviderClass:
from nemo_gym.sandbox.providers.apptainer import ApptainerProvider
return ApptainerProvider
-_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider
_BUILTIN_PROVIDER_LOADERS["opensandbox"] = _load_opensandbox_provider
+_BUILTIN_PROVIDER_LOADERS["docker"] = _load_docker_provider
+_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider
diff --git a/resources_servers/swe_bench/README.md b/resources_servers/swe_bench/README.md
new file mode 100644
index 0000000000..b60972b80a
--- /dev/null
+++ b/resources_servers/swe_bench/README.md
@@ -0,0 +1,51 @@
+# swe_bench resources server
+
+SWE-bench **Environment** resources server: `seed_session` returns a `SessionDescriptor` (topology **C**, per-instance sandbox spec); `verify` grades a model patch in a **fresh** eval sandbox (hermetic twin).
+
+Grading eval harnesses, parsing, and `verify_task` live as **private modules** under this directory (relocated from `responses_api_agents/swe_env/`).
+
+Key modules:
+
+- `task.py` — first-class **Task** (`SweTask`, `TaskPublic`, parse helpers)
+- `session.py` — **SessionDescriptor** returned from `seed_session`
+- `app.py` — thin HTTP surface (`seed_session`, `verify`)
+- `harness.py` / `harnesses/` — benchmark-family grading plugins
+
+## Wiring
+
+```yaml
+responses_api_agents:
+ claude_code_agent:
+ resources_server:
+ type: resources_servers
+ name: swe_bench
+```
+
+## Tests
+
+```bash
+gym env test --resources-server swe_bench
+```
+
+Unit tests use a fake sandbox provider (no Docker required).
+
+## Dataset
+
+Prepare SWE-bench Verified rows with `verifier_metadata` (see `prepare.py`):
+
+```bash
+python resources_servers/swe_bench/prepare.py --limit 5 --no-images
+```
+
+Each JSONL row includes `verifier_metadata.instance_id`, `instance_dict`, `dataset_name`, and optional `container_formatter`.
+
+## Rollouts
+
+```bash
+gym env start --resources-server swe_bench --agent claude_code_swe_bench --model-type openai_model
+gym eval run --no-serve --agent claude_code_swe_bench \
+ --input resources_servers/swe_bench/data/swebench_verified.jsonl \
+ --output results/swe_bench_rollouts.jsonl
+```
+
+Agent servers pass `verifier_metadata.model_patch` (git unified diff) on `POST /verify`.
diff --git a/resources_servers/swe_bench/__init__.py b/resources_servers/swe_bench/__init__.py
new file mode 100644
index 0000000000..ffd5d25501
--- /dev/null
+++ b/resources_servers/swe_bench/__init__.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE-bench Environment resources server modules.
+
+Grading harnesses, parsing, and verify_task implement the Environment MDP
+authority. Agent servers connect via HTTP ``seed_session`` / ``verify`` only.
+"""
+
+from resources_servers.swe_bench.harness import (
+ EvalArtifacts,
+ SweEvalReport,
+ SweTaskHarness,
+ compute_resolved,
+ get_harness,
+ list_harnesses,
+ register_harness,
+ reward_from_report,
+)
+from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+from resources_servers.swe_bench.session import SessionDescriptor
+from resources_servers.swe_bench.task import (
+ ENVIRONMENT_NAME,
+ SweTask,
+ TaskPublic,
+ TaskSubmission,
+ parse_task_from_request,
+)
+
+
+__all__ = [
+ "AsyncSweEnvironment",
+ "ENVIRONMENT_NAME",
+ "EvalArtifacts",
+ "SessionDescriptor",
+ "SweEvalReport",
+ "SweTask",
+ "SweTaskHarness",
+ "TaskPublic",
+ "TaskSubmission",
+ "compute_resolved",
+ "get_harness",
+ "list_harnesses",
+ "parse_task_from_request",
+ "register_harness",
+ "reward_from_report",
+]
diff --git a/resources_servers/swe_bench/app.py b/resources_servers/swe_bench/app.py
new file mode 100644
index 0000000000..214189c99e
--- /dev/null
+++ b/resources_servers/swe_bench/app.py
@@ -0,0 +1,113 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""SWE-bench Environment resources server."""
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, Literal
+
+from pydantic import Field
+
+import resources_servers.swe_bench.harnesses # noqa: F401
+from nemo_gym.base_resources_server import (
+ BaseResourcesServerConfig,
+ SimpleResourcesServer,
+)
+from nemo_gym.sandbox import SandboxSpec
+from resources_servers.swe_bench.harness import get_harness
+from resources_servers.swe_bench.session import (
+ EgressDescriptor,
+ PlacementDescriptor,
+ SandboxDescriptor,
+ SessionDescriptor,
+ SweBenchSeedSessionRequest,
+ SweBenchVerifyRequest,
+ SweBenchVerifyResponse,
+)
+from resources_servers.swe_bench.task import (
+ ENVIRONMENT_NAME,
+ SweTask,
+ parse_submission,
+ parse_task_from_request,
+)
+from resources_servers.swe_bench.verify_task import report_to_reward, verify_task
+
+
+Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"]
+
+
+class SweBenchResourcesServerConfig(BaseResourcesServerConfig):
+ sandbox_provider: dict[str, Any] = Field(default_factory=lambda: {"docker": {}})
+ container_formatter: str = "swebench/sweb.eval.x86_64.{instance_id}"
+ eval_timeout_s: float = 1800.0
+ flat_eval: bool = True
+ default_topology: Topology = "agent_in_env"
+
+
+def _spec_to_dict(spec: SandboxSpec) -> dict[str, Any]:
+ payload = dataclasses.asdict(spec)
+ resources = payload.get("resources")
+ if resources is not None and hasattr(resources, "__dataclass_fields__"):
+ payload["resources"] = dataclasses.asdict(resources)
+ return payload
+
+
+class SweBenchResourcesServer(SimpleResourcesServer):
+ config: SweBenchResourcesServerConfig
+
+ def _parse_task(self, body: SweBenchSeedSessionRequest | SweBenchVerifyRequest) -> SweTask:
+ return parse_task_from_request(
+ body,
+ container_formatter=self.config.container_formatter,
+ flat_eval=self.config.flat_eval,
+ environment=ENVIRONMENT_NAME,
+ )
+
+ async def seed_session(self, body: SweBenchSeedSessionRequest) -> SessionDescriptor:
+ task = self._parse_task(body)
+ harness = get_harness(task.harness_family)
+ if self.config.flat_eval and hasattr(harness, "with_flat_eval"):
+ harness = harness.with_flat_eval()
+ spec = harness.build_spec(task)
+
+ verifier_metadata = task.privileged_verifier_metadata(flat_eval=self.config.flat_eval)
+ if body.verifier_metadata:
+ verifier_metadata = {**body.verifier_metadata, **verifier_metadata}
+
+ return SessionDescriptor(
+ environment=ENVIRONMENT_NAME,
+ task=task.public_view(environment=ENVIRONMENT_NAME),
+ placement=PlacementDescriptor(topology=self.config.default_topology),
+ sandbox=SandboxDescriptor(spec=_spec_to_dict(spec)),
+ egress=EgressDescriptor(env={}),
+ verifier_metadata=verifier_metadata,
+ )
+
+ async def verify(self, body: SweBenchVerifyRequest) -> SweBenchVerifyResponse:
+ task = self._parse_task(body)
+ task = task.with_submission(parse_submission(body.verifier_metadata))
+
+ report = await verify_task(
+ self.config.sandbox_provider,
+ task,
+ eval_timeout_s=self.config.eval_timeout_s,
+ )
+ reward = report_to_reward(report)
+ masked = report.error_kind is not None
+
+ return SweBenchVerifyResponse(
+ **body.model_dump(),
+ task_id=task.task_id,
+ environment=ENVIRONMENT_NAME,
+ reward=reward,
+ resolved=report.resolved,
+ patch_exists=report.patch_exists,
+ mask_sample=masked,
+ error_kind=report.error_kind,
+ )
+
+
+if __name__ == "__main__":
+ SweBenchResourcesServer.run_webserver()
diff --git a/resources_servers/swe_bench/configs/swe_bench.yaml b/resources_servers/swe_bench/configs/swe_bench.yaml
new file mode 100644
index 0000000000..f3a60d11c2
--- /dev/null
+++ b/resources_servers/swe_bench/configs/swe_bench.yaml
@@ -0,0 +1,33 @@
+swe_bench:
+ resources_servers:
+ swe_bench:
+ entrypoint: app.py
+ domain: coding
+ verified: false
+ description: SWE-bench Environment (seed_session + hermetic verify)
+ sandbox_provider:
+ docker: {}
+ container_formatter: swebench/sweb.eval.x86_64.{instance_id}
+ eval_timeout_s: 1800
+ flat_eval: true
+ default_topology: agent_in_env
+
+claude_code_swe_bench:
+ responses_api_agents:
+ claude_code_agent:
+ entrypoint: app.py
+ resources_server:
+ type: resources_servers
+ name: swe_bench
+ concurrency: 16
+ model: claude-sonnet-4-6
+ anthropic_api_key: ${anthropic_api_key}
+ anthropic_base_url: null
+ max_turns: 30
+ timeout: 1800
+ in_box_timeout_s: 1800
+ sandbox_provider:
+ docker: {}
+ bare: true
+ mcp_config: null
+ settings: null
diff --git a/resources_servers/swe_bench/data/.gitignore b/resources_servers/swe_bench/data/.gitignore
new file mode 100644
index 0000000000..ac481ac55b
--- /dev/null
+++ b/resources_servers/swe_bench/data/.gitignore
@@ -0,0 +1,3 @@
+data/
+__pycache__/
+*.pyc
diff --git a/resources_servers/swe_bench/harness.py b/resources_servers/swe_bench/harness.py
new file mode 100644
index 0000000000..59467432bb
--- /dev/null
+++ b/resources_servers/swe_bench/harness.py
@@ -0,0 +1,365 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Task model and harness contract for the SWE environment library.
+
+The first-class **Task** value lives in ``task.py`` (``SweTask``). This module holds
+the harness registry and grading helpers.
+
+The harness contract is intentionally split across a trust boundary:
+
+* ``build_spec`` / ``supports_provider`` / ``materialize`` are **provisioning**
+ methods imported and called by *agents* (and the verifier).
+* ``reset_repo`` / ``run_eval`` / ``grade`` are **grading** methods used
+ **only** by the grader (``verify_task``). A test asserts agent adapters never
+ reference them.
+
+This module also holds the name->harness registry
+(``register_harness``/``get_harness``/``list_harnesses``) and the pure grading
+helpers (``compute_resolved``/``reward_from_report``), merged here so the harness
+contract, its dispatch, and its scoring live in one place.
+"""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from collections.abc import Iterable
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+from nemo_gym.sandbox import SandboxSpec
+from resources_servers.swe_bench.task import SweTask
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class GraderDependencyError(RuntimeError):
+ """A required grading dependency is unavailable for a task the harness must grade exactly.
+
+ Raised by a harness when it cannot grade an instance faithfully (e.g. ``swebench`` is
+ missing for a SWE-bench instance) and degrading to a generic parser would silently skew
+ the result. ``verify_task`` propagates this rather than swallowing it into an unmasked
+ reward-0, so a misconfigured grader fails loudly instead of quietly degrading scores.
+ """
+
+
+@dataclass
+class EvalArtifacts:
+ """Raw evaluation output retrieved from the sandbox, before grading."""
+
+ test_output: str = ""
+ return_code: int = 0
+ patch_applied: bool = False
+ raw: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class SweEvalReport:
+ """Graded result of a single task. ``error_kind`` masks a sample.
+
+ ``error_kind`` is ``None`` for a clean grade. A non-``None`` value (e.g.
+ ``"sandbox"`` / ``"eval_error"``) marks an infra failure: the sample is
+ masked via this flag and ``reward_from_report`` returns ``0.0`` — **never**
+ ``None`` (the wire ``reward`` field is a non-nullable ``float``).
+ """
+
+ instance_id: str
+ resolved: bool = False
+ patch_applied: bool = False
+ patch_exists: bool = False
+ error_kind: str | None = None
+ tests_status: dict[str, Any] = field(default_factory=dict)
+
+
+class SweTaskHarness(ABC):
+ """Per-family provisioning + (server-private) grading recipe."""
+
+ #: registry key, e.g. ``"swe-bench-ext"``.
+ name: str = ""
+ #: ``"flat-host-grade"`` (parse host-side) or ``"nested-harness"`` (in-container grader).
+ grade_strategy: str = "flat-host-grade"
+
+ # --- provisioning (agent-facing + verifier) ------------------------------
+
+ @abstractmethod
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox spec for a task.
+
+ Args:
+ task (SweTask): The task to provision a sandbox for.
+
+ Returns:
+ SandboxSpec: The spec describing image, workdir, env, ttl, and
+ provider options for the task.
+ """
+
+ def supports_provider(self, provider_name: str) -> bool:
+ """Report whether this harness can run on the named provider.
+
+ The base harness accepts every provider; flat host-graded families work on any
+ exec-capable provider.
+
+ Args:
+ provider_name (str): The name of the sandbox provider.
+
+ Returns:
+ bool: ``True`` if the provider is supported.
+ """
+ return True
+
+ def with_flat_eval(self) -> "SweTaskHarness":
+ """Return a variant that grades host-side (flat) on any exec-capable provider.
+
+ All families already grade host-side, so the base implementation returns ``self``.
+
+ Returns:
+ SweTaskHarness: A harness whose grading runs host-side.
+ """
+ return self
+
+ async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Upload the model patch and test patch into the started sandbox.
+
+ Args:
+ env (AsyncSweEnvironment): The started environment to write into.
+ task (SweTask): The task whose patches are uploaded.
+ """
+ if task.model_patch:
+ await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+ if task.test_patch:
+ await env.write_text("/root/test_patch.diff", _ensure_trailing_newline(task.test_patch))
+
+ # --- server-private grading (verifier only) ------------------------------
+
+ async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Reset the in-sandbox checkout to ``base_commit`` for hermetic grading.
+
+ Uses only ``git reset --hard``, never ``git clean -fdx``: verification
+ runs in a fresh sandbox (no agent edits to scrub), and a clean would
+ delete the image's prebuilt artifacts (compiled C extensions, installed
+ environment) and break the tests.
+
+ Args:
+ env (AsyncSweEnvironment): The started environment to reset.
+ task (SweTask): The task whose ``base_commit`` and ``repo_workdir``
+ are used.
+ """
+ if task.base_commit:
+ await env.execute(f"git reset --hard {task.base_commit}", cwd=task.repo_workdir)
+
+ @abstractmethod
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Apply the patches and run the evaluation, returning raw artifacts.
+
+ Args:
+ env (AsyncSweEnvironment): The started environment to evaluate in.
+ task (SweTask): The task being evaluated.
+
+ Returns:
+ EvalArtifacts: The raw evaluation output retrieved from the sandbox.
+ """
+
+ @abstractmethod
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Parse raw artifacts host-side into a graded report.
+
+ Args:
+ task (SweTask): The task that was evaluated.
+ artifacts (EvalArtifacts): The raw evaluation output to parse.
+
+ Returns:
+ SweEvalReport: The graded result for the task.
+ """
+
+
+def _ensure_trailing_newline(text: str) -> str:
+ """Return the text with a single trailing newline.
+
+ Args:
+ text (str): The input text.
+
+ Returns:
+ str: The text unchanged if it already ends in a newline, otherwise the
+ text with a newline appended.
+ """
+ return text if text.endswith("\n") else text + "\n"
+
+
+# --- name->harness registry ----------------------------
+
+_HARNESSES: dict[str, SweTaskHarness] = {}
+
+
+def register_harness(harness: SweTaskHarness, *, override: bool = False) -> None:
+ """Register a harness under its ``name``.
+
+ Args:
+ harness (SweTaskHarness): The harness to register. Its ``name`` must be
+ non-empty.
+ override (bool): If ``True``, replace an existing harness with the same
+ name instead of raising.
+
+ Raises:
+ ValueError: If the harness name is empty, or a harness with the same name
+ is already registered and ``override`` is ``False``.
+ """
+ if not harness.name:
+ raise ValueError("Harness must define a non-empty 'name'")
+ if not override and harness.name in _HARNESSES:
+ raise ValueError(f"Harness {harness.name!r} is already registered")
+ _HARNESSES[harness.name] = harness
+
+
+# HuggingFace dataset names don't match registry keys; map by substring (most-specific first)
+# so callers can pass a raw ``dataset_name`` (e.g. "princeton-nlp/SWE-bench_Verified").
+_HF_NAME_ALIASES: list[tuple[str, str]] = [
+ ("SWE-bench_Multilingual", "swe-bench-multilingual"),
+ ("R2E-Gym", "r2e-gym"),
+ ("SWE-rebench", "swe-rebench"),
+ ("SWE-bench", "swe-bench"),
+]
+
+
+def _ensure_registered() -> None:
+ """Lazily register the built-in harnesses if the registry is empty.
+
+ Importing ``resources_servers.swe_bench.harnesses`` registers all families, but a fresh
+ process (e.g. a Ray worker running the decoupled agent) may call ``get_harness`` before that
+ import has run. Registering on demand keeps lookups robust regardless of import order.
+ """
+ if _HARNESSES:
+ return
+ from resources_servers.swe_bench.harnesses import register_builtin_harnesses
+
+ register_builtin_harnesses()
+
+
+def get_harness(name: str) -> SweTaskHarness:
+ """Look up a harness by registry key, or by HuggingFace dataset-name substring.
+
+ Built-in harnesses are registered on first use (robust to import order). An exact key
+ match wins; otherwise a HuggingFace ``dataset_name`` substring is resolved to its key (e.g.
+ ``"princeton-nlp/SWE-bench_Verified"`` -> ``"swe-bench"``).
+
+ Args:
+ name (str): The registry key, or a HuggingFace dataset name.
+
+ Returns:
+ SweTaskHarness: The registered harness.
+
+ Raises:
+ KeyError: If no harness matches ``name``.
+ """
+ _ensure_registered()
+ if name in _HARNESSES:
+ return _HARNESSES[name]
+ for needle, key in _HF_NAME_ALIASES:
+ if needle in name and key in _HARNESSES:
+ return _HARNESSES[key]
+ available = ", ".join(sorted(_HARNESSES)) or "(none)"
+ raise KeyError(f"Unknown SWE harness {name!r}. Registered: {available}")
+
+
+def list_harnesses() -> list[str]:
+ """List the names of all registered harnesses.
+
+ Returns:
+ list[str]: The registered harness names, sorted alphabetically.
+ """
+ return sorted(_HARNESSES)
+
+
+# --- pure grading helpers -------------------------------
+
+
+def compute_resolved(
+ *,
+ fail_to_pass: Iterable[str],
+ pass_to_pass: Iterable[str],
+ passed: Iterable[str],
+ eval_type: str = "pass_and_fail",
+ status_map: dict[str, str] | None = None,
+) -> bool:
+ """Apply the SWE-bench resolution rule.
+
+ Two eval types are supported, mirroring swebench's per-repo selection
+ (``swebench.harness.grading.get_eval_report`` /
+ ``get_eval_tests_report`` + ``get_resolution_status``):
+
+ * ``"pass_and_fail"`` (default): mirrors swebench's ``check_pass_and_fail``
+ classification combined with the ratio-based ``get_resolution_status``. When a
+ ``status_map`` is supplied, each required test is a **success** when present and
+ PASSED/XFAIL (``test_passed``), a **failure** when absent or FAILED/ERROR
+ (``test_failed``), and **neutral** (excluded from both counts) for any other
+ status (e.g. SKIPPED/XPASS). A task is resolved only when there are zero
+ failures across FAIL_TO_PASS and PASS_TO_PASS (each ratio ``== 1``; an
+ all-neutral category with total ``0`` counts as ``1``). Without a
+ ``status_map`` it falls back to plain ``passed``-set membership.
+ * ``"fail_only"``: used for the JS multilingual repos in swebench's
+ ``FAIL_ONLY_REPOS`` (chartjs/Chart.js, processing/p5.js, markedjs/marked). A
+ required test counts as success **unless** it is present in ``status_map``
+ **and** its status is ``FAILED``. This mirrors swebench's ``check_fail_only``.
+
+ Args:
+ fail_to_pass (Iterable[str]): Tests that must transition from failing to
+ passing.
+ pass_to_pass (Iterable[str]): Tests that must remain passing.
+ passed (Iterable[str]): The tests that actually passed.
+ eval_type (str): ``"pass_and_fail"`` or ``"fail_only"`` (selected by the
+ caller from ``test_spec.repo``).
+ status_map (dict[str, str] | None): Full per-test status map. Required for
+ the ``"fail_only"`` rule (to detect a present-and-FAILED required test)
+ and used by ``"pass_and_fail"`` to exclude neutral-status required tests
+ exactly as swebench does.
+
+ Returns:
+ bool: ``True`` if all required tests passed under the selected rule,
+ ``False`` if there are no required tests or any required test did not
+ pass.
+ """
+ required = list(fail_to_pass) + list(pass_to_pass)
+ if not required:
+ return False
+ if eval_type == "fail_only":
+ sm = status_map or {}
+ # Mirror swebench's check_fail_only: a required test is a failure only when
+ # present in the status map AND explicitly FAILED; anything else is success.
+ return all(not (test in sm and sm[test] == "FAILED") for test in required)
+ if status_map is not None:
+ # Mirror swebench's check_pass_and_fail + get_resolution_status: a required
+ # test is a failure only when it is absent or its status is FAILED/ERROR;
+ # PASSED/XFAIL are successes and any other status (SKIPPED/XPASS) is neutral
+ # (excluded). Resolution requires zero failures in BOTH categories.
+ return all(not (test not in status_map or status_map[test] in ("FAILED", "ERROR")) for test in required)
+ passed_set = set(passed)
+ return all(test in passed_set for test in required)
+
+
+def reward_from_report(report: SweEvalReport) -> float:
+ """Map a graded report to a reward.
+
+ An infra or eval failure (``error_kind`` set) yields ``0.0`` and is masked
+ via the flag downstream; the result is always a ``float`` and never ``None``.
+
+ Args:
+ report (SweEvalReport): The graded result to convert.
+
+ Returns:
+ float: ``1.0`` if the task resolved with no error, otherwise ``0.0``.
+ """
+ if report.error_kind is not None:
+ return 0.0
+ return 1.0 if report.resolved else 0.0
diff --git a/resources_servers/swe_bench/harnesses/__init__.py b/resources_servers/swe_bench/harnesses/__init__.py
new file mode 100644
index 0000000000..e55c86e926
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/__init__.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE dataset-family harnesses. Importing this package registers all families.
+
+Every built-in family is flat and host-graded: it runs the instance's evaluation
+inside a single sandbox, parses the output host-side, and works on any
+exec-capable provider (including docker). The registered families are
+``swe-bench-ext``, ``nv-internal-1``, ``swe-rebench``, ``swe-bench``,
+``swe-bench-multilingual``, and ``r2e-gym``. (The previously apptainer-only nested
+grading for ``swe-bench``/``swe-bench-multilingual``/``r2e-gym`` was removed when
+PR #1694 took ownership of the apptainer provider.)
+"""
+
+from resources_servers.swe_bench.harness import list_harnesses, register_harness
+from resources_servers.swe_bench.harnesses.nv_internal import NVInternalHarness
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.harnesses.swe_rebench import SweRebenchHarness
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+def register_builtin_harnesses() -> None:
+ """Register every built-in SWE dataset-family harness.
+
+ Constructs each built-in harness and registers it under its name, skipping
+ any name that is already registered so the call is safe to run more than once.
+ """
+ builtins = [
+ SweBenchExtHarness(),
+ NVInternalHarness(),
+ SweRebenchHarness(),
+ SweBenchHarness("swe-bench"),
+ SweBenchHarness("swe-bench-multilingual"),
+ R2EGymHarness(),
+ ]
+ existing = set(list_harnesses())
+ for harness in builtins:
+ if harness.name not in existing:
+ register_harness(harness)
+
+
+register_builtin_harnesses()
+
+
+__all__ = [
+ "NVInternalHarness",
+ "R2EGymHarness",
+ "SweBenchExtHarness",
+ "SweBenchHarness",
+ "SweRebenchHarness",
+ "register_builtin_harnesses",
+]
diff --git a/resources_servers/swe_bench/harnesses/flat_eval.py b/resources_servers/swe_bench/harnesses/flat_eval.py
new file mode 100644
index 0000000000..37a6a1727d
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/flat_eval.py
@@ -0,0 +1,280 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Flat (host-graded) eval-script mode for SWE dataset families.
+
+Flat mode runs an instance's eval script directly in the sandbox and parses the
+produced log host-side, computing ``resolved`` from ``FAIL_TO_PASS`` /
+``PASS_TO_PASS`` via :func:`compute_resolved`. Because there is no nested
+container, this runs on any exec-capable provider (docker / opensandbox).
+
+The eval script resets the repo, applies the gold/model patch plus the test
+patch, runs the repo's test command, and wraps the test output between two
+sentinel markers::
+
+ >>>>> Start Test Output
+ ... per-test "PASSED " / "FAILED " lines ...
+ >>>>> End Test Output
+
+It also emits patch-apply / reset / timeout status codes
+(``>>>>> Applied Patch`` etc.). The host-side parser in this module recognises
+these markers and per-test status tokens without importing ``swebench``, so
+grading can run in environments where that package (and its Docker
+dependencies) is absent.
+
+``flat_eval_enabled`` reports whether flat mode applies to a task: when the harness
+selects it or the task opts in via ``SweTask.metadata["flat_eval"]``. The verifier
+honors that per-task key by calling ``SweTaskHarness.with_flat_eval()`` — a no-op for
+the built-in families, which already grade host-side. (A previously apptainer-only
+nested grading path for swe-bench / r2e-gym was removed in PR #1694.)
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, compute_resolved
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# SWE-bench eval-log sentinels, kept here so we never import swebench at grade
+# time.
+APPLY_PATCH_FAIL = ">>>>> Patch Apply Failed"
+APPLY_PATCH_PASS = ">>>>> Applied Patch"
+RESET_FAILED = ">>>>> Reset Failed"
+TESTS_ERROR = ">>>>> Tests Errored"
+TESTS_TIMEOUT = ">>>>> Tests Timed Out"
+START_TEST_OUTPUT = ">>>>> Start Test Output"
+END_TEST_OUTPUT = ">>>>> End Test Output"
+
+# Codes that mean the harness/patch/test setup failed before tests could be
+# trusted; their presence forces an empty status map + patch_applied=False.
+_BAD_CODES = (APPLY_PATCH_FAIL, RESET_FAILED, TESTS_ERROR, TESTS_TIMEOUT)
+
+# Per-test status tokens a pytest-style test runner emits at the start of a line
+# ("PASSED tests/test_x.py::test_a"). XFAIL counts as a pass.
+_PASS_TOKENS = ("PASSED", "XFAIL")
+_FAIL_TOKENS = ("FAILED", "ERROR")
+_STATUS_TOKENS = _PASS_TOKENS + _FAIL_TOKENS + ("SKIPPED",)
+
+# Where the flat path writes the eval script and its captured log inside the
+# sandbox.
+EVAL_SCRIPT_PATH = "/root/eval.sh"
+EVAL_LOG_PATH = "/root/eval_output.log"
+
+
+def parse_eval_log(log: str) -> tuple[dict[str, str], bool]:
+ """Parse a SWE-bench eval-script log host-side.
+
+ For the common pytest-style runner:
+
+ 1. If any "bad code" (patch-apply / reset / tests-error / timeout) is
+ present, the run is untrustworthy -> return ``({}, False)``.
+ 2. If the ``Start``/``End`` test-output markers are missing, the test patch
+ never applied -> return ``({}, False)``.
+ 3. Otherwise extract the slice between the markers and parse per-test
+ ``" "`` lines into a ``{node_id: STATUS}`` map. As a
+ fallback (output sometimes escapes the markers, e.g. to stderr) the whole
+ log is scanned when the slice yields nothing.
+
+ Args:
+ log: The combined stdout/stderr captured from running the eval script.
+
+ Returns:
+ A tuple ``(status_map, patch_applied)``. ``status_map`` maps each test
+ node id to its status token. ``patch_applied`` is ``True`` only when the
+ markers were found and no bad code fired.
+ """
+ if any(code in log for code in _BAD_CODES):
+ return {}, False
+ if START_TEST_OUTPUT not in log or END_TEST_OUTPUT not in log:
+ return {}, False
+
+ between = log.split(START_TEST_OUTPUT, 1)[1].split(END_TEST_OUTPUT, 1)[0]
+ status_map = _parse_pytest_status_lines(between)
+ if not status_map:
+ # Fallback: some runners emit per-test lines outside the markers.
+ status_map = _parse_pytest_status_lines(log)
+ return status_map, True
+
+
+def _parse_pytest_status_lines(text: str) -> dict[str, str]:
+ """Parse ``" "`` pytest-style lines into a status map.
+
+ A status line starts with one of the recognised status tokens, and the node
+ id is the second whitespace field. FAILED lines may read
+ ``"FAILED - "``; the trailing reason is stripped by rewriting
+ ``" - "`` to ``" "``.
+
+ Args:
+ text: Text containing zero or more per-test status lines.
+
+ Returns:
+ A mapping from each test node id to its status token. When a node id
+ appears more than once, the last occurrence wins.
+ """
+ status_map: dict[str, str] = {}
+ for raw_line in text.split("\n"):
+ line = raw_line.strip()
+ token = next((t for t in _STATUS_TOKENS if line.startswith(t)), None)
+ if token is None:
+ continue
+ if token == "FAILED":
+ line = line.replace(" - ", " ")
+ fields = line.split()
+ if len(fields) <= 1:
+ continue
+ node_id = fields[1]
+ # Last status wins for a duplicated node id: a later line overwrites an
+ # earlier one, so a runner that re-reports a node (e.g. a rerun plugin)
+ # ends up with its final status.
+ status_map[node_id] = fields[0]
+ return status_map
+
+
+def passed_tests(status_map: dict[str, str]) -> list[str]:
+ """Return node ids whose status counts as a pass (PASSED or XFAIL).
+
+ Args:
+ status_map: A mapping from test node id to its status token.
+
+ Returns:
+ The list of node ids whose status is a passing token.
+ """
+ return [node for node, status in status_map.items() if status in _PASS_TOKENS]
+
+
+async def flat_run_eval(env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Run the instance's eval script in the sandbox and capture its log.
+
+ The eval script must be supplied on the task via
+ ``task.metadata["eval_script"]``. It is written into the sandbox and run,
+ teeing its combined output to :data:`EVAL_LOG_PATH`; the captured
+ stdout/stderr already contain the ``>>>>>`` markers, so ``test_output`` is
+ graded directly. The log file is read back as a fallback when the streamed
+ output is empty.
+
+ Args:
+ env: The SWE environment used to write files and execute commands in the
+ sandbox.
+ task: The task whose ``metadata["eval_script"]`` is run.
+
+ Returns:
+ An :class:`EvalArtifacts` holding the captured test output, the script's
+ return code, whether a model patch existed, and raw metadata. When no
+ eval script is present the artifacts carry an ``eval_error``.
+ """
+ eval_script = task.metadata.get("eval_script", "")
+ if not eval_script:
+ # No script to run -> mask as an eval error rather than scoring 0.
+ return EvalArtifacts(
+ test_output="",
+ return_code=1,
+ patch_applied=False,
+ raw={"error_type": "eval_error", "flat": True},
+ )
+
+ await env.write_text(EVAL_SCRIPT_PATH, eval_script if eval_script.endswith("\n") else eval_script + "\n")
+ # The script is self-contained (it resets + applies patches + runs tests);
+ # `|| true` keeps the captured log even on a non-zero test exit so grade()
+ # can parse per-test status. Combined output is also tee'd to a log file.
+ result = await env.execute(
+ f"bash {EVAL_SCRIPT_PATH} 2>&1 | tee {EVAL_LOG_PATH}; exit ${{PIPESTATUS[0]}}",
+ cwd=task.repo_workdir,
+ is_eval=True,
+ timeout_s=task.metadata.get("tests_timeout"),
+ )
+ log_text = result["output"]
+ if not log_text.strip() and result.get("error_type") not in {"sandbox", "timeout"}:
+ # Streamed output was empty; fall back to the tee'd log file.
+ cat = await env.execute(f"cat {EVAL_LOG_PATH}", cwd=task.repo_workdir)
+ if cat["returncode"] == 0:
+ log_text = cat["output"]
+
+ return EvalArtifacts(
+ test_output=log_text,
+ return_code=result["returncode"],
+ patch_applied=bool(task.model_patch),
+ raw={"error_type": result.get("error_type"), "flat": True},
+ )
+
+
+def flat_grade(task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade a flat eval-script log host-side.
+
+ Only genuine infra failures (sandbox/timeout) are masked via ``error_kind``.
+ An unbuildable / missing / empty eval spec (``error_type == "eval_error"``) is
+ NOT masked: it falls through to the parser, which finds no markers and grades
+ unmasked ``resolved=False`` (reward 0), matching main's behavior. A log with a
+ bad code or missing markers likewise grades as unresolved with
+ ``patch_applied`` set from the parse, since a failed setup is a legitimate
+ unresolved rather than an infra mask.
+
+ Args:
+ task: The task being graded, supplying the instance id, expected
+ ``fail_to_pass`` / ``pass_to_pass`` tests, and model patch.
+ artifacts: The eval artifacts produced by :func:`flat_run_eval`.
+
+ Returns:
+ A :class:`SweEvalReport` describing whether the task was resolved,
+ whether the patch applied and existed, any masking ``error_kind``, and
+ the per-test status breakdown.
+ """
+ if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=artifacts.raw["error_type"],
+ )
+
+ status_map, log_patch_applied = parse_eval_log(artifacts.test_output)
+ passed = passed_tests(status_map)
+ # Thread the full status_map so compute_resolved mirrors swebench's
+ # get_eval_tests_report semantics: a required test counts as a failure only when
+ # absent or FAILED/ERROR, while neutral statuses (SKIPPED/XPASS) are excluded
+ # rather than treated as failures (which a bare passed-set membership check would).
+ resolved = log_patch_applied and compute_resolved(
+ fail_to_pass=task.fail_to_pass,
+ pass_to_pass=task.pass_to_pass,
+ passed=passed,
+ status_map=status_map,
+ )
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=resolved,
+ patch_applied=log_patch_applied,
+ patch_exists=bool(task.model_patch),
+ tests_status={"passed": passed, "all": status_map},
+ )
+
+
+def flat_eval_enabled(harness_flag: bool, task: SweTask) -> bool:
+ """Return whether flat (host-side) mode should be used for this task.
+
+ Flat mode applies when the harness flag selects it or the task opts in via
+ ``metadata["flat_eval"]``. This is a pure predicate; it neither swaps the
+ harness nor changes provider support.
+
+ Args:
+ harness_flag: Whether the harness itself selects flat grading.
+ task: The task whose ``metadata["flat_eval"]`` is consulted.
+
+ Returns:
+ ``True`` when either source selects flat mode, otherwise ``False``.
+ """
+ return bool(harness_flag) or bool(task.metadata.get("flat_eval", False))
diff --git a/resources_servers/swe_bench/harnesses/nv_internal.py b/resources_servers/swe_bench/harnesses/nv_internal.py
new file mode 100644
index 0000000000..3eb7fac4f0
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/nv_internal.py
@@ -0,0 +1,426 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""nv-internal-1 harness: flat, host-graded NVIDIA-internal family.
+
+This family does not run any in-container grading harness: it ships a per-instance
+``run_script.sh`` + ``parsing_script.py`` that emit a structured ``output.json``
+test report. The recipe is a 3-hop sequence:
+
+ 1. ``bash run_script.sh > stdout.log 2> stderr.log`` (keep streams separate)
+ 2. ``python parsing_script.py stdout.log stderr.log output.json`` (parse to JSON report)
+ 3. read ``output.json`` back host-side
+
+Grading is then a pure host-side parse of that report's ``{tests: [{name, status}]}``
+shape. Because the family is flat and host-graded, it runs on any exec-capable
+provider (e.g. docker). The run script, parsing script, and model patch are
+uploaded by ``materialize``.
+"""
+
+from __future__ import annotations
+
+import ast
+import json
+import re
+from typing import TYPE_CHECKING, Any
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+ EvalArtifacts,
+ SweEvalReport,
+ SweTask,
+ SweTaskHarness,
+ _ensure_trailing_newline,
+ compute_resolved,
+)
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+#: nv-internal default working directory.
+NV_DEFAULT_WORKDIR = "/app"
+#: The generic ``build_task`` default workdir; means "the row didn't set one".
+_GENERIC_DEFAULT_WORKDIR = "/testbed"
+
+
+def _nv_workdir(task: SweTask) -> str:
+ """Resolve the working directory for nv-internal hops.
+
+ The generic ``build_task`` defaults ``repo_workdir`` to ``/testbed``, which is
+ not the nv-internal convention. A row that explicitly sets a non-default
+ ``repo_workdir`` is honored; otherwise the nv-internal default ``/app`` is used.
+
+ Args:
+ task: The task whose ``repo_workdir`` is consulted.
+
+ Returns:
+ The working directory path (str) to run every nv-internal hop in.
+ """
+ workdir = task.repo_workdir
+ if not workdir or workdir == _GENERIC_DEFAULT_WORKDIR:
+ return NV_DEFAULT_WORKDIR
+ return workdir
+
+
+def parse_passed_tests(report: dict[str, Any]) -> list[str]:
+ """Extract PASSED test names from a parsing_script ``output.json`` report.
+
+ The report shape is ``{"tests": [{"name": ..., "status": "PASSED"|...}, ...]}``.
+
+ Args:
+ report: The parsed ``output.json`` report mapping.
+
+ Returns:
+ The list of test names (list[str]) whose status is ``"PASSED"``.
+ """
+ return [
+ test["name"]
+ for test in report.get("tests", [])
+ if isinstance(test, dict) and test.get("status") == "PASSED" and "name" in test
+ ]
+
+
+class NVInternalHarness(SweTaskHarness):
+ """Flat, host-graded harness for the NVIDIA-internal task family.
+
+ Tasks ship their own ``run_script.sh`` and ``parsing_script.py`` that produce
+ a structured ``output.json`` report, which is graded entirely host-side. The
+ harness runs on any exec-capable provider.
+ """
+
+ name = "nv-internal-1"
+ grade_strategy = "flat-host-grade"
+
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox spec for an nv-internal task.
+
+ Environment variables parsed from the task's dockerfiles are injected into
+ ``spec.env`` so the provider applies them to every exec hop. This is a
+ no-op when the dataset does not carry the dockerfiles.
+
+ Args:
+ task: The task to build a sandbox spec for.
+
+ Returns:
+ A :class:`SandboxSpec` describing the image, workdir, timeouts,
+ environment, metadata, resources, and provider options.
+ """
+ env = {"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"}
+ env.update(_parse_dockerfile_env(task))
+ return SandboxSpec(
+ image=task.image,
+ workdir=_nv_workdir(task),
+ ttl_s=task.metadata.get("ttl_s", 1800),
+ ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+ env=env,
+ metadata={
+ "instance_id": task.instance_id[:63],
+ "benchmark": task.benchmark,
+ "harness": self.name,
+ },
+ resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+ provider_options=task.metadata.get("provider_options", {}),
+ )
+
+ def supports_provider(self, provider_name: str) -> bool:
+ """Report whether this harness supports the named provider.
+
+ The family is flat and host-graded, so every exec-capable provider is
+ supported.
+
+ Args:
+ provider_name: The provider name being checked.
+
+ Returns:
+ ``True`` for every provider.
+ """
+ return True # flat, host-graded: works on any exec-capable provider
+
+ async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Upload run_script.sh, parsing_script.py, and the model patch.
+
+ The scripts live in ``task.metadata``. The dataset stores them under
+ dotted keys (``"run_script.sh"`` / ``"parsing_script.py"``), which are read
+ first, falling back to the extensionless keys only if the dotted ones are
+ absent.
+
+ Args:
+ env: The environment used to write files into the sandbox.
+ task: The task carrying the patch and scripts to upload.
+ """
+ if task.model_patch:
+ await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+ run_script = task.metadata.get("run_script.sh") or task.metadata.get("run_script", "")
+ parsing_script = task.metadata.get("parsing_script.py") or task.metadata.get("parsing_script", "")
+ if run_script:
+ await env.write_text("/root/run_script.sh", _ensure_trailing_newline(run_script))
+ if parsing_script:
+ await env.write_text("/root/parsing_script.py", _ensure_trailing_newline(parsing_script))
+
+ async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Reset the checkout to ``base_commit``.
+
+ Runs ``git reset --hard`` followed by ``git checkout`` of the base commit
+ (not ``git clean``) in the nv-internal working directory.
+
+ Args:
+ env: The environment used to execute commands in the sandbox.
+ task: The task carrying the ``base_commit`` to reset to.
+ """
+ if task.base_commit:
+ await env.execute(
+ f"git reset --hard {task.base_commit} && git checkout {task.base_commit}",
+ cwd=_nv_workdir(task),
+ )
+
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Run the 3-hop evaluation recipe and collect its artifacts.
+
+ Applies the model patch, runs the optional per-instance repo setup hook,
+ then executes the run/parse/read sequence. Sandbox or timeout failures in
+ any hop short-circuit and are surfaced via ``raw["error_type"]``.
+
+ Args:
+ env: The environment used to execute commands in the sandbox.
+ task: The task being evaluated.
+
+ Returns:
+ An :class:`EvalArtifacts` holding the report output, return code,
+ whether the patch applied cleanly, and any infra error type.
+ """
+ workdir = _nv_workdir(task)
+ # Apply the model patch with rejection to tolerate conflicts:
+ # `--reject` writes .rej files instead of failing; `|| true` keeps going.
+ patch_applied = True
+ if task.model_patch:
+ applied = await env.execute(
+ "git apply --ignore-space-change --ignore-whitespace --reject -v /root/patch.diff",
+ cwd=workdir,
+ )
+ patch_applied = applied["returncode"] == 0
+
+ # Optional per-instance repo setup hook.
+ repo_cmd = task.metadata.get("before_repo_set_cmd", "").strip()
+ if repo_cmd:
+ repo_cmd = repo_cmd.split("\n")[-1]
+ setup = await env.execute(repo_cmd, cwd=workdir, is_eval=True)
+ if setup.get("error_type") in {"sandbox", "timeout"}:
+ return EvalArtifacts(
+ test_output=setup["output"],
+ return_code=setup["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": setup.get("error_type")},
+ )
+
+ # Hop 1: run the per-instance script, keeping stdout/stderr separate.
+ # The selected test files are passed positionally.
+ test_files = _format_test_files(task.metadata.get("selected_test_files_to_run", []))
+ run = await env.execute(
+ f"bash /root/run_script.sh {test_files} > /root/stdout.log 2> /root/stderr.log || true",
+ cwd=workdir,
+ is_eval=True,
+ )
+ if run.get("error_type") in {"sandbox", "timeout"}:
+ return EvalArtifacts(
+ test_output=run["output"],
+ return_code=run["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": run.get("error_type")},
+ )
+
+ # Hop 2: parse the logs into a JSON report.
+ parse = await env.execute(
+ "python /root/parsing_script.py /root/stdout.log /root/stderr.log /root/output.json",
+ cwd=workdir,
+ is_eval=True,
+ )
+ if parse.get("error_type") in {"sandbox", "timeout"}:
+ return EvalArtifacts(
+ test_output=parse["output"],
+ return_code=parse["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": parse.get("error_type")},
+ )
+
+ # Hop 3: read the report back host-side.
+ report = await env.execute("cat /root/output.json", cwd=workdir, is_eval=True)
+ return EvalArtifacts(
+ test_output=report["output"],
+ return_code=report["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": report.get("error_type")},
+ )
+
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade the evaluation artifacts into a report.
+
+ Parses the host-side ``output.json`` report, extracts PASSED tests, and
+ derives resolution from the required FAIL_TO_PASS / PASS_TO_PASS sets. An
+ infra failure (sandbox or timeout) is masked via ``error_kind`` rather than
+ scored as unresolved.
+
+ Args:
+ task: The task being graded.
+ artifacts: The artifacts produced by ``run_eval``.
+
+ Returns:
+ A :class:`SweEvalReport` with resolution status, patch flags, and the
+ parsed test report.
+ """
+ # Infra failure → mask via error_kind (never scored as "unresolved").
+ if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=artifacts.raw["error_type"],
+ )
+ try:
+ report = json.loads(artifacts.test_output) if artifacts.test_output.strip() else {}
+ except (ValueError, TypeError):
+ report = {}
+ passed = parse_passed_tests(report)
+ f2p, p2p = _resolve_required_tests(task)
+ # Resolution is derived from tests alone and never gated on patch-apply rc.
+ # An empty report or no required tests → unresolved (compute_resolved
+ # returns False).
+ resolved = compute_resolved(
+ fail_to_pass=f2p,
+ pass_to_pass=p2p,
+ passed=passed,
+ )
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=resolved,
+ patch_applied=artifacts.patch_applied,
+ patch_exists=bool(task.model_patch),
+ tests_status={"passed": passed, "report": report},
+ )
+
+
+def _format_test_files(test_files: Any) -> str:
+ """Build the comma-joined test-files argument.
+
+ Accepts a list, or a string that is either a comma-joined value or a
+ ``repr``-style list. A stringified list may use single quotes
+ (``['a', 'b']``) which ``json.loads`` rejects, so ``ast.literal_eval`` is used
+ (handling single-quoted and native lists) with a safe fallback to the raw
+ string.
+
+ Args:
+ test_files: A list/tuple of names, or a string holding a comma-joined
+ value or a stringified list.
+
+ Returns:
+ The comma-joined test-files argument (str); empty for unsupported inputs.
+ """
+ if isinstance(test_files, (list, tuple)):
+ return ",".join(str(item) for item in test_files)
+ if isinstance(test_files, str):
+ stripped = test_files.strip()
+ if stripped.startswith("[") and stripped.endswith("]"):
+ try:
+ parsed = ast.literal_eval(stripped)
+ if isinstance(parsed, (list, tuple)):
+ return ",".join(str(item) for item in parsed)
+ except (ValueError, SyntaxError):
+ pass
+ return stripped
+ return ""
+
+
+def _resolve_required_tests(task: SweTask) -> tuple[list[str], list[str]]:
+ """Resolve the FAIL_TO_PASS / PASS_TO_PASS required-test sets.
+
+ The ``fail_to_pass_select`` / ``pass_to_pass_select`` keys on ``task.metadata``
+ take precedence when present; otherwise the plain ``task.fail_to_pass`` /
+ ``task.pass_to_pass`` are used. Values may be lists or stringified lists.
+
+ Args:
+ task: The task whose required-test sets are resolved.
+
+ Returns:
+ A ``(fail_to_pass, pass_to_pass)`` tuple of test-name lists.
+ """
+ f2p = task.metadata.get("fail_to_pass_select")
+ f2p = _coerce_test_list(f2p) if f2p is not None else list(task.fail_to_pass)
+ p2p = task.metadata.get("pass_to_pass_select")
+ p2p = _coerce_test_list(p2p) if p2p is not None else list(task.pass_to_pass)
+ return f2p, p2p
+
+
+def _coerce_test_list(value: Any) -> list[str]:
+ """Coerce a test-list value (list or stringified list) into a list of names.
+
+ Args:
+ value: A list/tuple of names, or a string holding a stringified list.
+
+ Returns:
+ The list of test names (list[str]); empty for unsupported inputs.
+ """
+ if isinstance(value, (list, tuple)):
+ return [str(item) for item in value]
+ if isinstance(value, str):
+ stripped = value.strip()
+ if stripped.startswith("[") and stripped.endswith("]"):
+ try:
+ parsed = ast.literal_eval(stripped)
+ if isinstance(parsed, (list, tuple)):
+ return [str(item) for item in parsed]
+ except (ValueError, SyntaxError):
+ pass
+ return []
+
+
+def _parse_dockerfile_env(task: SweTask) -> dict[str, str]:
+ """Parse ``ENV`` lines from the task's dockerfiles into a name->value mapping.
+
+ Scans ``base_dockerfile + instance_dockerfile`` for ``ENV`` directives and
+ converts them to environment variables. Handles both Docker forms:
+
+ ENV KEY=VALUE (equals)
+ ENV KEY VALUE (space-separated)
+
+ Returns ``{}`` when the dockerfiles are absent from metadata.
+
+ Args:
+ task: The task whose dockerfile metadata is scanned.
+
+ Returns:
+ A mapping (dict[str, str]) of environment variable names to values.
+ """
+ base_dockerfile = str(task.metadata.get("base_dockerfile", "") or "")
+ instance_dockerfile = str(task.metadata.get("instance_dockerfile", "") or "")
+ env: dict[str, str] = {}
+ for raw_line in (base_dockerfile + "\n" + instance_dockerfile).split("\n"):
+ line = raw_line.strip()
+ if not line.startswith("ENV "):
+ continue
+ body = line[len("ENV ") :].strip()
+ if "=" in body:
+ # Format: ENV KEY=VALUE -> normalize spaces around the first `=`.
+ key, _, value = body.partition("=")
+ key = re.sub(r"\s+", "", key)
+ value = value.strip()
+ else:
+ # Format: ENV KEY VALUE -> split into key + remainder value.
+ parts = body.split(None, 1)
+ if len(parts) < 2:
+ continue
+ key, value = parts[0], parts[1]
+ if key:
+ env[key] = value
+ return env
diff --git a/resources_servers/swe_bench/harnesses/r2egym.py b/resources_servers/swe_bench/harnesses/r2egym.py
new file mode 100644
index 0000000000..9b4f42f24c
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/r2egym.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""r2e-gym harness — host-side (flat) graded.
+
+Runs the instance's eval script in the sandbox and parses the log host-side via the shared
+flat-eval path, so it runs on any exec-capable provider.
+
+NOTE: the apptainer-only nested ``run_local_evaluation`` path (which produced r2e-gym's own
+``report.json`` in-container) was removed when PR #1694 took ownership of the apptainer
+provider. Re-wiring r2e-gym's nested grading + ``.sif``/mounts onto #1694's provider is tracked
+for a follow-up PR (see APPTAINER_PR3_TRACKER.md); until then r2e-gym grades flat (it needs an
+``eval_script`` in task metadata, else the flat grader masks the sample as an eval error).
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+ EvalArtifacts,
+ SweEvalReport,
+ SweTask,
+ SweTaskHarness,
+ _ensure_trailing_newline,
+ compute_resolved,
+)
+from resources_servers.swe_bench.harnesses import flat_eval
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class R2EGymHarness(SweTaskHarness):
+ """Harness for the r2e-gym family of SWE tasks (host-side / flat graded)."""
+
+ name = "r2e-gym"
+ grade_strategy = "flat-host-grade"
+
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox spec for an r2e-gym task.
+
+ Args:
+ task: The SWE task whose metadata, image, and workdir describe the sandbox.
+
+ Returns:
+ SandboxSpec: The populated sandbox spec (image, workdir, TTL, env, metadata,
+ resources, and any provider options carried on the task).
+ """
+ return SandboxSpec(
+ image=task.image,
+ workdir=task.repo_workdir,
+ ttl_s=task.metadata.get("ttl_s", 1800),
+ ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+ env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+ metadata={
+ "instance_id": task.instance_id[:63],
+ "benchmark": task.benchmark,
+ "harness": self.name,
+ },
+ resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+ provider_options=dict(task.metadata.get("provider_options", {})),
+ )
+
+ async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Write the bare ``/root/patch.diff`` the eval script applies.
+
+ Args:
+ env: The active SWE environment used to write files into the sandbox.
+ task: The SWE task supplying the model patch (newline-normalized).
+ """
+ if task.model_patch:
+ await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+
+ async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Reset the repository checkout (no-op for r2e-gym).
+
+ Args:
+ env: The active SWE environment (unused).
+ task: The SWE task (unused).
+ """
+ return None
+
+ def hide_eval_tests_commands(self) -> list[str]:
+ """Build shell commands that strip the held-out eval tests from the agent's checkout.
+
+ ``/r2e_tests`` holds the evaluation tests the agent must not see; ``run_tests.sh``
+ launches them. ``run_tests.sh`` is deleted only when it references ``r2e_tests``
+ (substring guard). The agent adapter runs these after ``materialize``.
+
+ Returns:
+ list[str]: One shell command per checkout root (``""``, ``/root``, ``/testbed``).
+ """
+ commands: list[str] = []
+ for root_dir in ["", "/root", "/testbed"]:
+ commands.append(
+ f"rm -rf {root_dir}/r2e_tests && "
+ f"if grep -qs r2e_tests {root_dir}/run_tests.sh; then rm -rf {root_dir}/run_tests.sh; fi"
+ )
+ return commands
+
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Run the instance's eval script in-sandbox and grade the log host-side.
+
+ Args:
+ env: The active SWE environment used to execute commands in the sandbox.
+ task: The SWE task whose ``metadata['eval_script']`` is run.
+
+ Returns:
+ EvalArtifacts: The captured test output, return code, patch existence, and flat
+ markers (masked as ``eval_error`` when no eval script is present).
+ """
+ return await flat_eval.flat_run_eval(env, task)
+
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade an r2e-gym task from its evaluation artifacts (host-side, flat).
+
+ Unlike the SWE-bench flat grader, this path does NOT gate ``resolved`` on the
+ SWE-bench ``>>>>> Start/End Test Output`` marker pair: r2e-gym's ``run_tests.sh``
+ does not emit those swebench sentinels, so requiring them would mask every r2e-gym
+ sample as unresolved. Per-test status lines are parsed from the whole log and the
+ node-ids are matched directly against the required ``fail_to_pass`` / ``pass_to_pass``
+ sets (R2E-Gym uses pytest node-ids verbatim). Only genuine infra failures
+ (sandbox/timeout) are masked.
+
+ Args:
+ task: The SWE task being graded.
+ artifacts: The evaluation artifacts produced by ``run_eval``.
+
+ Returns:
+ SweEvalReport: The resolved/unresolved verdict with patch state and any error kind.
+ """
+ if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=artifacts.raw["error_type"],
+ )
+ # Parse per-test status lines from the whole log (no swebench-marker gate). An
+ # unbuildable / empty log yields an empty status map -> no required test passes ->
+ # unmasked unresolved, and compute_resolved still returns False for an empty
+ # required set (the edge validated by main).
+ status_map = flat_eval._parse_pytest_status_lines(artifacts.test_output)
+ passed = flat_eval.passed_tests(status_map)
+ # Thread the full status_map so compute_resolved mirrors swebench's
+ # get_eval_tests_report semantics: neutral-status required tests (SKIPPED/XPASS)
+ # are excluded rather than treated as failures.
+ resolved = compute_resolved(
+ fail_to_pass=task.fail_to_pass,
+ pass_to_pass=task.pass_to_pass,
+ passed=passed,
+ status_map=status_map,
+ )
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=resolved,
+ patch_applied=bool(status_map),
+ patch_exists=bool(task.model_patch),
+ tests_status={"passed": passed, "all": status_map},
+ )
diff --git a/resources_servers/swe_bench/harnesses/swe_bench_ext.py b/resources_servers/swe_bench/harnesses/swe_bench_ext.py
new file mode 100644
index 0000000000..7c925c1264
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swe_bench_ext.py
@@ -0,0 +1,311 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-bench-ext harness: flat, host-graded reference family.
+
+Applies the model patch (and test patch) against the repository checkout, runs
+the framework test command, and grades host-side with the parser
+(:func:`resources_servers.swe_bench.parsing.parse_and_check_tests`).
+
+Grading delegates the full per-framework logic to ``parse_and_check_tests``:
+junit-xml parsing, test-id normalization, the fuzzy matcher, the framework
+dispatch, the ``::build``/``::compile`` synthetic-PASS injection, and
+build-failed-package propagation.
+
+``resolved`` is taken from the parser's verdict (all FAIL_TO_PASS passed AND all
+PASS_TO_PASS passed). It does not depend on ``patch_applied``: the model and test
+patches are applied best-effort and grading is on the tests only.
+``patch_applied`` is still recorded for information.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness
+from resources_servers.swe_bench.parsing import (
+ get_framework_config,
+ get_test_command_with_output,
+ parse_and_check_tests,
+)
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# Default checkout locations probed (in order) when locating the repo, mirroring main's
+# ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null`` ladder
+# in SweBenchExtDatasetProcessor's eval script.
+_REPO_WORKDIR_LADDER = ("/testbed", "/workspace/repo", "/app", "/root/repo")
+
+
+# Output markers the parser (parse_and_check_tests) extracts content between.
+_TEST_OUTPUT_START = "<<>>"
+_TEST_OUTPUT_END = "<<>>"
+_RESULT_FILE_START = "<<>>"
+_RESULT_FILE_END = "<<>>"
+
+
+class SweBenchExtHarness(SweTaskHarness):
+ """Flat, host-graded harness for the swe-bench-ext task family.
+
+ Runs the task's framework test command inside a single sandbox and grades the
+ captured output on the host. Works on any exec-capable sandbox provider.
+ """
+
+ name = "swe-bench-ext"
+ grade_strategy = "flat-host-grade"
+
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox specification for a task.
+
+ Args:
+ task: The SWE task describing the image, working directory, and
+ per-task metadata (timeouts, resources, provider options).
+
+ Returns:
+ SandboxSpec: The sandbox spec used to launch the task's container.
+ """
+ return SandboxSpec(
+ image=task.image,
+ workdir=task.repo_workdir,
+ ttl_s=task.metadata.get("ttl_s", 1800),
+ ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+ env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+ metadata={
+ "instance_id": task.instance_id[:63],
+ "benchmark": task.benchmark,
+ "harness": self.name,
+ },
+ resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+ provider_options=task.metadata.get("provider_options", {}),
+ )
+
+ def supports_provider(self, provider_name: str) -> bool:
+ """Report whether this harness supports a sandbox provider.
+
+ Being flat and host-graded, it works on any exec-capable provider.
+
+ Args:
+ provider_name: The name of the sandbox provider.
+
+ Returns:
+ bool: Always ``True``.
+ """
+ return True
+
+ async def _resolve_repo_workdir(self, env: "AsyncSweEnvironment", task: SweTask) -> str:
+ """Locate the repository checkout, mirroring main's ``cd`` fallback ladder.
+
+ Main's ``SweBenchExtDatasetProcessor`` eval script runs
+ ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null``
+ so a repo that is not at ``/testbed`` is still found. This reproduces that
+ host-side: a row-provided ``repo_workdir`` that differs from the default and holds a
+ ``.git`` checkout wins; otherwise the ladder (``/testbed``, ``/workspace/repo``,
+ ``/app``, ``/root/repo``) is probed for a ``.git`` directory. If nothing matches the
+ task's ``repo_workdir`` is returned unchanged (preserving prior behavior).
+
+ Args:
+ env: The async environment used to probe the sandbox.
+ task: The SWE task whose ``repo_workdir`` is the preferred/default location.
+
+ Returns:
+ str: The resolved repository working directory inside the sandbox.
+ """
+ # Prefer an explicit, non-default row workdir holding a checkout.
+ candidates: list[str] = []
+ if task.repo_workdir and task.repo_workdir != "/testbed":
+ candidates.append(task.repo_workdir)
+ candidates.extend(d for d in _REPO_WORKDIR_LADDER if d not in candidates)
+ for candidate in candidates:
+ probe = await env.execute(f'test -d "{candidate}/.git"', cwd="/")
+ if probe["returncode"] == 0:
+ return candidate
+ return task.repo_workdir
+
+ async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Reset the located checkout to ``base_commit`` for hermetic grading.
+
+ Resolves the repo workdir via the same ladder main uses (so a non-``/testbed``
+ checkout is found), then defers to the base ``git reset --hard`` behavior.
+
+ Args:
+ env: The started environment to reset.
+ task: The task whose ``base_commit`` is restored.
+ """
+ if task.base_commit:
+ workdir = await self._resolve_repo_workdir(env, task)
+ await env.execute(f"git reset --hard {task.base_commit}", cwd=workdir)
+
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Apply patches, run the test command, and capture the evaluation output.
+
+ Applies the model patch (and test patch) best-effort, then runs the
+ framework test command wrapped between output markers so the parser can
+ extract the structured result file or marked stdout.
+
+ Args:
+ env: The async environment used to execute commands in the sandbox.
+ task: The SWE task providing the patches, test command, and framework.
+
+ Returns:
+ EvalArtifacts: The captured test output, return code, whether the
+ model patch applied, and the execution error type if any.
+ """
+ # Resolve the checkout via main's cd ladder so a non-/testbed repo is found.
+ workdir = await self._resolve_repo_workdir(env, task)
+ patch_applied = True
+ # Best-effort apply: a bad apply never fails the run (grading is on the
+ # tests only); we still record whether the model patch applied for info.
+ apply_flags = "--reject --recount --ignore-space-change --ignore-whitespace"
+ if task.model_patch:
+ applied = await env.execute(
+ f"git apply {apply_flags} /root/patch.diff",
+ cwd=workdir,
+ )
+ patch_applied = applied["returncode"] == 0
+ if task.test_patch:
+ await env.execute(
+ f"git apply {apply_flags} /root/test_patch.diff",
+ cwd=workdir,
+ )
+ # Wrap the command's output: add structured-output flags (--junitxml/--json)
+ # via get_test_command_with_output, run it between the markers, and dump the
+ # framework result file so parse_and_check_tests receives junit-xml (preferred)
+ # or the marked stdout.
+ #
+ # The framework is passed through verbatim. An empty framework must NOT be
+ # coerced to "pytest": for a non-pytest instance whose framework is absent, the
+ # parser's auto-detect path is what grades correctly, and the default framework
+ # config adds no flags and no result file. grade() reuses this SAME value via
+ # _resolve_framework so the two stay in lockstep.
+ framework = self._resolve_framework(task)
+ # Use the row's test command verbatim, with NO default runner. Main's
+ # SweBenchExtDatasetProcessor uses ``inst.get("test_command", "")`` (empty when
+ # absent): a command-less row runs no runner and grades unresolved. Injecting a
+ # default ``python -m pytest`` here would diverge from main by fabricating results.
+ base_command = task.test_command
+ test_cmd = get_test_command_with_output(base_command, framework)
+ result_file = (get_framework_config(framework, base_command) or {}).get("result_file")
+ result = await env.execute(self._wrap_eval_command(test_cmd, result_file), cwd=workdir, is_eval=True)
+ return EvalArtifacts(
+ test_output=result["output"],
+ return_code=result["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": result.get("error_type")},
+ )
+
+ @staticmethod
+ def _resolve_framework(task: SweTask) -> str:
+ """Return the framework value used by both ``run_eval`` and ``grade``.
+
+ Returns the task's framework verbatim. An empty or unknown value is
+ intentionally passed through unchanged: coercing it to ``"pytest"`` would
+ mis-dispatch the parser for non-pytest instances that ship no framework.
+ Centralizing this guarantees ``run_eval`` (which selects the
+ structured-output flag and result file) and ``grade`` (which parses the
+ output) agree on the framework.
+
+ Args:
+ task: The SWE task whose framework value is returned.
+
+ Returns:
+ str: The task's test framework name (possibly empty).
+ """
+ return task.test_framework
+
+ @staticmethod
+ def _wrap_eval_command(test_cmd: str, result_file: str | None) -> str:
+ """Wrap the eval command in the output markers and a result-file dump.
+
+ The parser prefers the junit/json result file (emitted between the
+ RESULT_FILE markers) and falls back to the marked stdout. The ``mkdir -p``
+ ensures ``/workspace/test-results`` exists first, since some frameworks
+ (e.g. junit/gradle, xctest) write their result file there.
+
+ Args:
+ test_cmd: The test command to run inside the markers.
+ result_file: Path or glob of the framework result file to dump, or
+ ``None`` when the framework produces no result file.
+
+ Returns:
+ str: A shell script that runs the test command and emits the marked
+ output and result-file blocks.
+ """
+ mkdir_block = "mkdir -p /workspace/test-results\n"
+ if result_file and "*" in result_file:
+ result_block = (
+ f'echo "{_RESULT_FILE_START}"\n'
+ f"for f in {result_file}; do\n"
+ f' if [ -f "$f" ]; then echo "=== FILE: $f ==="; cat "$f"; echo ""; fi\n'
+ f"done 2>/dev/null || true\n"
+ f'echo "{_RESULT_FILE_END}"\n'
+ )
+ elif result_file:
+ result_block = (
+ f'echo "{_RESULT_FILE_START}"\n'
+ f'if [ -f "{result_file}" ]; then cat "{result_file}"; fi\n'
+ f'echo "{_RESULT_FILE_END}"\n'
+ )
+ else:
+ result_block = ""
+ return f'{mkdir_block}echo "{_TEST_OUTPUT_START}"\n{test_cmd}\n{result_block}echo "{_TEST_OUTPUT_END}"\n'
+
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade captured evaluation artifacts into a report.
+
+ Infrastructure failures are masked via ``error_kind`` and never scored as
+ unresolved. Otherwise the test output is handed to ``parse_and_check_tests``
+ and ``resolved`` is taken from the parser's verdict.
+
+ Args:
+ task: The SWE task providing the expected test sets and framework.
+ artifacts: The captured test output, return code, and error type.
+
+ Returns:
+ SweEvalReport: The grading report, including ``resolved``,
+ ``patch_applied``, ``patch_exists``, and the parsed test status (or
+ ``error_kind`` on infrastructure failure).
+ """
+ # Infra failure: mask via error_kind (never scored as "unresolved").
+ if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=artifacts.raw["error_type"],
+ )
+ # Delegate to the parser, passing the framework verbatim via the SAME
+ # _resolve_framework value run_eval used. An empty/unknown framework falls
+ # through to the parser's auto-detect path; coercing it to "pytest" here would
+ # mis-grade non-pytest instances.
+ test_framework = self._resolve_framework(task)
+ result = parse_and_check_tests(
+ test_output=artifacts.test_output,
+ test_framework=test_framework,
+ fail_to_pass=task.fail_to_pass,
+ pass_to_pass=task.pass_to_pass,
+ instance_id=task.instance_id,
+ )
+ # resolved is the parser's verdict (all F2P passed AND all P2P passed); it
+ # does NOT gate on patch_applied (grading is on tests only).
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=bool(result["resolved"]),
+ patch_applied=artifacts.patch_applied,
+ patch_exists=bool(task.model_patch),
+ tests_status=result,
+ )
diff --git a/resources_servers/swe_bench/harnesses/swe_rebench.py b/resources_servers/swe_bench/harnesses/swe_rebench.py
new file mode 100644
index 0000000000..68b863182b
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swe_rebench.py
@@ -0,0 +1,375 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-rebench harness: a flat, host-graded family with a vendored log parser.
+
+This is a flat host-graded family: reset to base, apply the model patch and test
+patch, run the install/test commands, then parse the test log host-side.
+
+Two things distinguish swe-rebench:
+
+* **JAVA env** — SWE-rebench tasks need
+ ``_JAVA_OPTIONS=-Djava.net.preferIPv6Addresses=false``, surfaced via
+ ``build_spec.env`` so it is set for the whole sandbox session.
+* **Dynamic log parser** — swe-rebench has no single uniform pytest summary; the
+ correct per-test PASSED/FAILED status comes from a repo-specific parser keyed
+ by ``log_parser`` and shipped in the cloned ``SWE-rebench-V2`` repo
+ (``lib/agent/log_parsers.py`` or ``agent/log_parsers.py``). It is imported
+ dynamically, guarded by try/except.
+
+The cloned ``SWE-rebench-V2`` directory must be provisioned out-of-band. When it
+is absent or the named parser cannot be resolved, ``grade`` masks the sample via
+``error_kind`` rather than scoring a misleading ``unresolved``.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import json
+import re
+import sys
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Callable
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# JAVA flag required for every SWE-rebench task.
+_JAVA_OPTIONS = "-Djava.net.preferIPv6Addresses=false"
+
+# Patch-apply flags shared by the model and test patch; non-fatal
+# ``git apply --reject`` style so a failed apply still runs the tests.
+_APPLY_FLAGS = "--reject --recount --ignore-space-change --whitespace=nowarn"
+
+# Timing/duration suffixes some test runners append to node names; stripped so
+# the parser output lines up with the (already-normalized) expected node ids.
+_REBENCH_TIMING_NORMALIZE_RES = [
+ re.compile(r"\s*\[\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\]\s*$", re.IGNORECASE),
+ re.compile(r"\s+in\s+\d+(?:\.\d+)?\s+(?:msec|sec)\b", re.IGNORECASE),
+ re.compile(r"\s*\(\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\)\s*$", re.IGNORECASE),
+]
+
+
+def _normalize_test_name(name: str) -> str:
+ """Strip trailing timing annotations from a test node name.
+
+ Args:
+ name (str): The raw test node name, possibly carrying a trailing timing
+ or duration annotation.
+
+ Returns:
+ str: The node name with any timing suffix removed and surrounding
+ whitespace stripped.
+ """
+ for pattern in _REBENCH_TIMING_NORMALIZE_RES:
+ name = pattern.sub("", name)
+ return name.strip()
+
+
+def _load_rebench_log_parsers(rebench_repo_dir: Path):
+ """Dynamically import the cloned SWE-rebench-V2 ``log_parsers`` module.
+
+ Prefers ``lib/agent/log_parsers.py`` and falls back to
+ ``agent/log_parsers.py``, temporarily prepending the repo (and its ``lib``
+ directory) to ``sys.path`` so the module's intra-repo imports resolve.
+
+ Args:
+ rebench_repo_dir (Path): Path to the cloned SWE-rebench-V2 repository.
+
+ Returns:
+ ModuleType: The imported ``log_parsers`` module.
+
+ Raises:
+ FileNotFoundError: If the cloned directory has not been provisioned and
+ no ``log_parsers.py`` can be located.
+ """
+ lp_path = rebench_repo_dir / "lib" / "agent" / "log_parsers.py"
+ if not lp_path.exists():
+ lp_path = rebench_repo_dir / "agent" / "log_parsers.py"
+ if not lp_path.exists():
+ raise FileNotFoundError(
+ f"SWE-rebench-V2 log_parsers not found under {rebench_repo_dir}; "
+ "provision the clone via setup_scripts/swe_rebench.sh"
+ )
+
+ extra_paths = [str(rebench_repo_dir), str(rebench_repo_dir / "lib")]
+ added: list[str] = []
+ for p in extra_paths:
+ if p not in sys.path:
+ sys.path.insert(0, p)
+ added.append(p)
+ try:
+ spec = importlib.util.spec_from_file_location("_rebench_log_parsers", str(lp_path))
+ mod = importlib.util.module_from_spec(spec)
+ spec.loader.exec_module(mod)
+ return mod
+ finally:
+ for p in added:
+ try:
+ sys.path.remove(p)
+ except ValueError:
+ pass
+
+
+def _resolve_parser(log_parsers, log_parser_name: str) -> Callable[[str], dict[str, str]] | None:
+ """Resolve a parser callable from the loaded module.
+
+ Looks up the name in the module's ``NAME_TO_PARSER`` mapping first, then
+ falls back to a module-level attribute of the same name.
+
+ Args:
+ log_parsers: The imported ``log_parsers`` module.
+ log_parser_name (str): The name of the parser to resolve.
+
+ Returns:
+ Callable[[str], dict[str, str]] | None: The resolved parser callable, or
+ ``None`` if no parser matches the name.
+ """
+ name_to_parser = getattr(log_parsers, "NAME_TO_PARSER", {}) or {}
+ return name_to_parser.get(log_parser_name) or getattr(log_parsers, log_parser_name, None)
+
+
+def _as_list(value: Any) -> list[str]:
+ """Coerce a test-command/install/list field to a list of strings.
+
+ Accepts the value as a JSON-encoded string, a bare string, or a list. A
+ JSON-encoded string is parsed and coerced recursively; a bare string that
+ fails to parse is wrapped in a single-element list.
+
+ Args:
+ value (Any): The field value to coerce. May be ``None``, a string, a
+ list, a tuple, or any other type.
+
+ Returns:
+ list[str]: The value normalized to a list of strings. An empty list is
+ returned for ``None`` or an empty string.
+ """
+ if value is None:
+ return []
+ if isinstance(value, str):
+ text = value.strip()
+ if not text:
+ return []
+ if text[0] in "[{":
+ try:
+ parsed = json.loads(text)
+ except (ValueError, TypeError):
+ return [value]
+ return _as_list(parsed)
+ return [value]
+ if isinstance(value, (list, tuple)):
+ return [str(v) for v in value]
+ return [str(value)]
+
+
+class SweRebenchHarness(SweTaskHarness):
+ """Flat, host-graded harness for the swe-rebench benchmark family.
+
+ Applies the model and test patches, runs the install/test commands, then
+ parses the test log host-side using a repo-specific parser loaded
+ dynamically from the cloned SWE-rebench-V2 repository.
+ """
+
+ name = "swe-rebench"
+ grade_strategy = "flat-host-grade"
+
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox spec for a swe-rebench task.
+
+ Sets the git and ``_JAVA_OPTIONS`` environment variables, merges any
+ task-provided env, and forwards TTL, readiness timeout, resources, and
+ provider options from the task metadata.
+
+ Args:
+ task (SweTask): The task to build a sandbox specification for.
+
+ Returns:
+ SandboxSpec: The sandbox specification for running the task.
+ """
+ # _JAVA_OPTIONS forces IPv4 for SWE-rebench tasks.
+ env = {
+ "GIT_CONFIG_GLOBAL": "/dev/null",
+ "GIT_PAGER": "cat",
+ "_JAVA_OPTIONS": _JAVA_OPTIONS,
+ }
+ env.update(task.metadata.get("env", {}))
+ return SandboxSpec(
+ image=task.image,
+ workdir=task.repo_workdir,
+ ttl_s=task.metadata.get("ttl_s", 1800),
+ ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+ env=env,
+ metadata={
+ "instance_id": task.instance_id[:63],
+ "benchmark": task.benchmark,
+ "harness": self.name,
+ },
+ resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+ provider_options=task.metadata.get("provider_options", {}),
+ )
+
+ def supports_provider(self, provider_name: str) -> bool:
+ """Report whether the harness supports a given sandbox provider.
+
+ Being flat and host-graded, it works on any exec-capable provider.
+
+ Args:
+ provider_name (str): The name of the sandbox provider.
+
+ Returns:
+ bool: Always ``True``.
+ """
+ return True # flat, host-graded: works on any exec-capable provider
+
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Apply patches, run install and test commands, and collect artifacts.
+
+ Applies the model patch then the test patch (both best-effort), runs the
+ non-fatal install commands, then runs the test block with the eval
+ timeout. Records whether the model patch applied for informational
+ purposes only; grading does not gate on it.
+
+ Args:
+ env (AsyncSweEnvironment): The environment used to execute commands
+ inside the sandbox.
+ task (SweTask): The task being evaluated.
+
+ Returns:
+ EvalArtifacts: The captured test output, return code, model-patch
+ application status, and raw error metadata.
+ """
+ workdir = task.repo_workdir
+ install_config = task.metadata.get("install_config", {}) or {}
+ install_cmds = _as_list(install_config.get("install"))
+ test_cmds = _as_list(install_config.get("test_cmd")) or ([task.test_command] if task.test_command else [])
+
+ # Apply the model patch first, then the test patch. Both are best-effort:
+ # a failed apply still runs the tests; model-patch application is recorded
+ # for info only (grading does not gate on it).
+ patch_applied = True
+ if task.model_patch:
+ applied = await env.execute(
+ f"git apply {_APPLY_FLAGS} /root/patch.diff",
+ cwd=workdir,
+ )
+ patch_applied = applied["returncode"] == 0
+ if task.test_patch:
+ await env.execute(f"git apply {_APPLY_FLAGS} /root/test_patch.diff", cwd=workdir)
+
+ # Install commands are non-fatal; failures there should not abort the
+ # test run.
+ for cmd in install_cmds:
+ await env.execute(cmd, cwd=workdir)
+
+ test_block = "\n".join(test_cmds) if test_cmds else "python -m pytest -rA -q"
+ # Thread the eval timeout into the test exec, defaulting to 1800s so a
+ # stuck swe-rebench run is bounded. A row that explicitly carries a
+ # ``tests_timeout`` overrides the default.
+ result = await env.execute(
+ test_block,
+ cwd=workdir,
+ is_eval=True,
+ timeout_s=task.metadata.get("tests_timeout", 1800),
+ )
+ return EvalArtifacts(
+ test_output=result["output"],
+ return_code=result["returncode"],
+ patch_applied=patch_applied,
+ raw={"error_type": result.get("error_type")},
+ )
+
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade a swe-rebench task from its evaluation artifacts.
+
+ Masks infra failures (sandbox/timeout) and grading errors (missing clone,
+ unknown parser, parser crash) via ``error_kind`` rather than scoring them.
+ Otherwise parses the test output with the resolved repo-specific parser
+ and marks the task resolved when every FAIL_TO_PASS and PASS_TO_PASS test
+ is in the passed set.
+
+ Args:
+ task (SweTask): The task being graded.
+ artifacts (EvalArtifacts): The artifacts captured during evaluation.
+
+ Returns:
+ SweEvalReport: The grading report, with ``resolved`` set on success
+ or ``error_kind`` set when the sample is masked.
+ """
+ # Infra failure -> mask via error_kind (never scored as "unresolved").
+ if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=artifacts.raw["error_type"],
+ )
+
+ install_config = task.metadata.get("install_config", {}) or {}
+ log_parser_name = install_config.get("log_parser", "")
+ # The cloned SWE-rebench-V2 dir is provisioned out-of-band; its absence,
+ # an unknown parser name, or a parser crash all mask the sample via
+ # ``error_kind`` rather than mis-scoring it.
+ rebench_repo_dir = task.metadata.get("rebench_repo_dir")
+ if not rebench_repo_dir:
+ return self._masked(task, artifacts, "eval_error")
+ try:
+ log_parsers = _load_rebench_log_parsers(Path(rebench_repo_dir))
+ parser = _resolve_parser(log_parsers, log_parser_name)
+ if parser is None:
+ return self._masked(task, artifacts, "eval_error")
+ results = parser(artifacts.test_output)
+ except Exception:
+ return self._masked(task, artifacts, "eval_error")
+
+ results = {_normalize_test_name(k): v for k, v in (results or {}).items()}
+ passed_set = {k for k, v in results.items() if v == "PASSED"}
+ fail_to_pass_set = {_normalize_test_name(n) for n in task.fail_to_pass}
+ pass_to_pass_set = {_normalize_test_name(n) for n in task.pass_to_pass}
+
+ # Resolution rule: every FAIL_TO_PASS and PASS_TO_PASS test must be in the
+ # passed set. Resolution is not gated on patch application, and the
+ # F2P/P2P sets are not required to be non-empty (an empty set is a subset
+ # of any set).
+ resolved = (fail_to_pass_set <= passed_set) and (pass_to_pass_set <= passed_set)
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=resolved,
+ patch_applied=artifacts.patch_applied,
+ patch_exists=bool(task.model_patch),
+ tests_status={"passed": sorted(passed_set), "all": results},
+ )
+
+ @staticmethod
+ def _masked(task: SweTask, artifacts: EvalArtifacts, kind: str) -> SweEvalReport:
+ """Build a masked report that records a grading error instead of a score.
+
+ Args:
+ task (SweTask): The task being graded.
+ artifacts (EvalArtifacts): The artifacts captured during evaluation.
+ kind (str): The error kind to record on the report.
+
+ Returns:
+ SweEvalReport: A report with ``error_kind`` set and no resolution.
+ """
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=kind,
+ )
diff --git a/resources_servers/swe_bench/harnesses/swebench.py b/resources_servers/swe_bench/harnesses/swebench.py
new file mode 100644
index 0000000000..563c6ae614
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swebench.py
@@ -0,0 +1,274 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-bench / swe-bench-multilingual harness — host-side (flat) grading.
+
+A single parametrized class serves both families. It runs the instance's official SWE-bench
+eval script (``swebench.make_test_spec(...).eval_script``) inside the sandbox and grades the
+produced log host-side with swebench's per-repo log parser, so it runs on any exec-capable
+provider (docker / opensandbox).
+
+NOTE: the apptainer-only nested ``run_local_evaluation`` path was removed when PR #1694 took
+ownership of the apptainer provider. The swe_env-specific nested-apptainer grading (mounts/.sif
+wiring + run_local_evaluation) is tracked for a follow-up PR (see APPTAINER_PR3_TRACKER.md).
+"""
+
+from __future__ import annotations
+
+import dataclasses
+import os
+import tempfile
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+ EvalArtifacts,
+ GraderDependencyError,
+ SweEvalReport,
+ SweTask,
+ SweTaskHarness,
+ _ensure_trailing_newline,
+ compute_resolved,
+)
+from resources_servers.swe_bench.harnesses import flat_eval
+
+
+if TYPE_CHECKING:
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# Per-test status tokens swebench's repo parsers emit that count as a pass.
+_SWEBENCH_PASS_STATUSES = frozenset({"PASSED", "XFAIL"})
+
+# swe-bench families this harness serves.
+_VALID_NAMES = frozenset({"swe-bench", "swe-bench-multilingual"})
+
+
+class SweBenchHarness(SweTaskHarness):
+ """SWE-bench (and multilingual) harness, host-side (flat) graded.
+
+ Runs the instance's official eval script in the sandbox and parses the log host-side with
+ swebench's per-repo parser. Construct one instance per family
+ (``SweBenchHarness("swe-bench")`` / ``SweBenchHarness("swe-bench-multilingual")``).
+ """
+
+ grade_strategy = "flat-host-grade"
+
+ def __init__(self, name: str = "swe-bench") -> None:
+ """Initialize the harness for a given swe-bench family.
+
+ Args:
+ name: The swe-bench family to serve (``"swe-bench"`` or ``"swe-bench-multilingual"``).
+
+ Raises:
+ ValueError: If ``name`` is not a known swe-bench family.
+ """
+ if name not in _VALID_NAMES:
+ raise ValueError(f"Unknown swe-bench family: {name!r} (expected one of {sorted(_VALID_NAMES)})")
+ self.name = name
+
+ # --- provisioning --------------------------------------------------------
+
+ def build_spec(self, task: SweTask) -> SandboxSpec:
+ """Build the sandbox spec for a task.
+
+ Args:
+ task: The task to provision a sandbox for.
+
+ Returns:
+ A ``SandboxSpec`` describing the image, workdir, environment, and any provider
+ options carried on the task. Flat grading runs the eval script directly in the
+ instance image, so no host harness/venv mounts are needed.
+ """
+ return SandboxSpec(
+ image=task.image,
+ workdir=task.repo_workdir,
+ ttl_s=task.metadata.get("ttl_s", 1800),
+ ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+ env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+ metadata={
+ "instance_id": task.instance_id[:63],
+ "benchmark": task.benchmark,
+ "harness": self.name,
+ },
+ resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+ provider_options=dict(task.metadata.get("provider_options", {})),
+ )
+
+ async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+ """Write the bare ``/root/patch.diff`` the eval script applies.
+
+ Args:
+ env: The environment used to write files into the sandbox.
+ task: The task whose model patch is staged for the eval script (newline-normalized
+ so the upstream ``git apply`` succeeds).
+ """
+ if task.model_patch:
+ await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+
+ def _flat_eval_script(self, task: SweTask) -> str:
+ """Build the official SWE-bench eval script for host-side (flat) grading.
+
+ Uses the ``swebench`` library's ``make_test_spec(...).eval_script`` (the per-repo recipe),
+ prefixed with a step that applies the model patch from ``/root/patch.diff``. Returns an
+ empty string if the instance dict is unavailable or the spec cannot be built, in which
+ case the flat grader masks the sample as an eval error rather than scoring 0.
+
+ Args:
+ task: The task whose ``metadata['instance_dict']`` describes the SWE-bench instance.
+
+ Returns:
+ The eval-script text, or ``""`` when it cannot be constructed.
+ """
+ instance = task.metadata.get("instance_dict")
+ if not instance:
+ return ""
+ try:
+ from swebench.harness.test_spec.test_spec import make_test_spec
+
+ spec = make_test_spec(instance, namespace="swebench")
+ except Exception:
+ return ""
+ # Mirror main's GIT_APPLY ladder (swebench/harness/run_evaluation.py GIT_APPLY_CMDS):
+ # try each apply command in order, breaking on the first rc==0, and never write
+ # conflict markers into the tree (no --3way). The trailing `echo` only fires when
+ # every command failed.
+ apply_model = (
+ "cd /testbed && "
+ "(git apply --verbose /root/patch.diff || "
+ "git apply --verbose --reject /root/patch.diff || "
+ "patch --batch --fuzz=5 -p1 -i /root/patch.diff || "
+ "echo 'NEMO_GYM_PATCH_APPLY_FAILED')\n"
+ )
+ return apply_model + spec.eval_script
+
+ # --- server-private grading ----------------------------------------------
+
+ async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+ """Run the instance's eval script in-sandbox and collect its log.
+
+ Args:
+ env: The environment used to execute commands in the sandbox.
+ task: The task to evaluate.
+
+ Returns:
+ An ``EvalArtifacts`` carrying the captured test output, return code, whether a patch
+ existed, and the flat-eval markers.
+ """
+ if not task.metadata.get("eval_script"):
+ task = dataclasses.replace(task, metadata={**task.metadata, "eval_script": self._flat_eval_script(task)})
+ return await flat_eval.flat_run_eval(env, task)
+
+ def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+ """Grade a task from its evaluation artifacts (host-side, flat).
+
+ The SWE-bench family spans repos with different test runners (pytest, django's unittest
+ runner, etc.). The generic flat parser is pytest-only and silently scores non-pytest
+ repos (e.g. django) unresolved — even the gold patch. Grade with swebench's official
+ per-repo log parser; if ``swebench`` cannot be imported for a real SWE-bench instance
+ this raises ``GraderDependencyError`` (fail loud) rather than silently mis-scoring. The
+ generic parser is used only for the legitimate cases where there is no instance dict or
+ the eval spec cannot be built (matching main's behavior for unbuildable instances).
+
+ Args:
+ task: The task being graded.
+ artifacts: The evaluation artifacts produced by ``run_eval``.
+
+ Returns:
+ A ``SweEvalReport`` recording resolution, patch state, and any error kind.
+
+ Raises:
+ GraderDependencyError: If ``swebench`` is unavailable for a real SWE-bench instance.
+ """
+ report = self._swebench_flat_grade(task, artifacts)
+ return report if report is not None else flat_eval.flat_grade(task, artifacts)
+
+ def _swebench_flat_grade(self, task: SweTask, artifacts: EvalArtifacts) -> "SweEvalReport | None":
+ """Grade a flat eval log with swebench's official per-repo log parser.
+
+ The generic :func:`flat_eval.flat_grade` parser only recognises pytest-style
+ ``PASSED `` lines, so repos with other test runners (e.g. django's unittest
+ runner) parse as zero passing tests and grade unresolved — even for the gold patch.
+ This path uses ``swebench.harness.grading.get_logs_eval`` (the same per-repo parser the
+ nested harness uses), keeping docker flat grading faithful to the official result.
+
+ Args:
+ task: The task being graded (supplies the instance dict + fail/pass test ids).
+ artifacts: The artifacts produced by :func:`flat_eval.flat_run_eval`.
+
+ Returns:
+ A ``SweEvalReport`` with the official verdict, or ``None`` when there is no instance
+ dict or the eval spec cannot be built (caller falls back to the generic parser).
+
+ Raises:
+ GraderDependencyError: If ``swebench`` cannot be imported for a real SWE-bench
+ instance (fail loud rather than silently degrading to the generic parser).
+ """
+ # Mirror flat_grade's infra masks so a genuine sandbox/timeout never scores 0. An
+ # unbuildable/empty eval spec is NOT masked here (it grades unmasked unresolved via
+ # the generic parser fallback below), matching main's behavior.
+ error_type = artifacts.raw.get("error_type")
+ if error_type in {"sandbox", "timeout"}:
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ patch_applied=artifacts.patch_applied,
+ error_kind=error_type,
+ )
+ instance = task.metadata.get("instance_dict")
+ if not instance:
+ return None
+ try:
+ from swebench.harness.constants import FAIL_ONLY_REPOS
+ from swebench.harness.grading import get_logs_eval
+ from swebench.harness.test_spec.test_spec import make_test_spec
+ except Exception as exc:
+ # Fail loud instead of degrading to the generic pytest-only parser, which mis-scores
+ # non-pytest repos (e.g. django) as unresolved even for a correct patch. swebench is a
+ # pinned hard dependency (requirements.txt: swebench==4.1.0); a missing/broken install
+ # is a misconfiguration that must surface, not silently skew the SWE-bench resolve rate.
+ raise GraderDependencyError(
+ "swebench is required to grade SWE-bench instances faithfully (per-repo log "
+ "parsers) but could not be imported; install the pinned 'swebench==4.1.0'."
+ ) from exc
+ log_fp = None
+ try:
+ spec = make_test_spec(instance, namespace="swebench")
+ with tempfile.NamedTemporaryFile("w", suffix=".log", delete=False) as handle:
+ handle.write(artifacts.test_output or "")
+ log_fp = handle.name
+ status_map, markers_found = get_logs_eval(spec, log_fp)
+ except Exception:
+ return None
+ finally:
+ if log_fp is not None and os.path.exists(log_fp):
+ os.unlink(log_fp)
+ passed = [node for node, status in status_map.items() if status in _SWEBENCH_PASS_STATUSES]
+ # Select the eval type per-repo exactly as swebench.harness.grading.get_eval_report:
+ # FAIL_ONLY_REPOS (the JS multilingual repos) use the fail-only resolution rule.
+ eval_type = "fail_only" if spec.repo in FAIL_ONLY_REPOS else "pass_and_fail"
+ resolved = bool(markers_found) and compute_resolved(
+ fail_to_pass=task.fail_to_pass,
+ pass_to_pass=task.pass_to_pass,
+ passed=passed,
+ eval_type=eval_type,
+ status_map=status_map,
+ )
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ resolved=resolved,
+ patch_applied=bool(markers_found),
+ patch_exists=bool(task.model_patch),
+ tests_status={"passed": passed, "all": status_map},
+ )
diff --git a/resources_servers/swe_bench/parsing/__init__.py b/resources_servers/swe_bench/parsing/__init__.py
new file mode 100644
index 0000000000..a9de18198d
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/__init__.py
@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE-Bench-Ext test-output parser.
+
+Provides the per-framework parsers, framework output config, and the
+resolution helper used by SWE harnesses for host-side grading. This
+``__init__`` re-exports the public symbols so callers can import them from a
+single location, e.g.::
+
+ from resources_servers.swe_bench.parsing import (
+ parse_and_check_tests,
+ get_framework_config,
+ get_test_command_with_output,
+ )
+"""
+
+from resources_servers.swe_bench.parsing.frameworks import (
+ FRAMEWORK_CONFIGS,
+ get_framework_config,
+ get_test_command_with_output,
+)
+from resources_servers.swe_bench.parsing.parsing import (
+ normalize_test_id,
+ parse_test_output,
+)
+from resources_servers.swe_bench.parsing.utils import parse_and_check_tests
+
+
+__all__ = [
+ # High-level grading entry point (F2P/P2P resolution).
+ "parse_and_check_tests",
+ # Framework output config + command augmentation.
+ "FRAMEWORK_CONFIGS",
+ "get_framework_config",
+ "get_test_command_with_output",
+ # Framework dispatcher + test-id normalization.
+ "parse_test_output",
+ "normalize_test_id",
+]
diff --git a/resources_servers/swe_bench/parsing/frameworks.py b/resources_servers/swe_bench/parsing/frameworks.py
new file mode 100644
index 0000000000..7de570c491
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/frameworks.py
@@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Test framework output configuration mapping."""
+
+from typing import Dict
+
+
+FRAMEWORK_CONFIGS: Dict[str, Dict] = {
+ "pytest": {
+ "output_flag": "--junitxml=/workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "unittest": {
+ "output_flag": "--junitxml=/workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "go": {
+ "output_flag": "-json",
+ "result_file": None,
+ },
+ "jest": {
+ "output_flag": "--json --outputFile=/workspace/test-results/output.json",
+ "result_file": "/workspace/test-results/output.json",
+ },
+ "vitest": {
+ "output_flag": "--reporter=json --outputFile=/workspace/test-results/output.json",
+ "result_file": "/workspace/test-results/output.json",
+ },
+ "mocha": {
+ "output_flag": "--reporter json --reporter-options output=/workspace/test-results/output.json",
+ "result_file": "/workspace/test-results/output.json",
+ },
+ "bun": {
+ "output_flag": None, # Bun doesn't have structured JSON output flag by default
+ "result_file": None, # Parse from stdout
+ },
+ "junit": {
+ "output_flag": None,
+ "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml",
+ },
+ "maven": {
+ "output_flag": None,
+ "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml",
+ },
+ "gtest": {
+ "output_flag": "--gtest_output=json:/workspace/test-results/output.json",
+ "result_file": "/workspace/test-results/output.json",
+ },
+ "cargo-nextest": {
+ "output_flag": None, # Profile is already in test_command
+ "result_file": None, # JUnit XML is output to repo/junit.xml by profile config
+ },
+ "ctest": {
+ "output_flag": "--output-on-failure --output-junit /workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "xctest": {
+ # For SwiftPM with XCTest framework
+ "output_flag": "--parallel --num-workers=1 --xunit-output /workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "testing": {
+ # For SwiftPM with new Swift Testing framework (Swift 6+)
+ "output_flag": "--disable-xctest --parallel --xunit-output /workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "cppunit": {
+ "output_flag": None,
+ "result_file": None,
+ },
+ # Lua test frameworks - Tier 1 (Standard XML output)
+ "busted": {
+ "output_flag": "--output=junit",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ "luaunit": {
+ "output_flag": "-o junit -n /workspace/test-results/output.xml",
+ "result_file": "/workspace/test-results/output.xml",
+ },
+ # Lua test frameworks - Tier 2 (Custom parsers)
+ "telescope": {
+ "output_flag": None,
+ "result_file": None,
+ },
+ "lust": {
+ "output_flag": None,
+ "result_file": None,
+ },
+ "minitest": {
+ "output_flag": None,
+ "result_file": None,
+ },
+ "bespoke_libgeos": {
+ "output_flag": None,
+ "result_file": None,
+ },
+ # TAP (Test Anything Protocol) - used by tape, node-tap
+ "tap": {
+ "output_flag": None, # TAP outputs to stdout
+ "result_file": None, # Parse from stdout
+ },
+ "tape": {
+ "output_flag": None, # tape outputs TAP to stdout
+ "result_file": None, # Parse from stdout
+ },
+ # Hardhat (Solidity) - uses Mocha under the hood
+ "hardhat": {
+ "output_flag": None, # Uses Mocha console reporter by default
+ "result_file": None, # Parse from stdout
+ },
+}
+
+
+def get_framework_config(framework: str, test_command: str = "") -> Dict:
+ """Get configuration for a test framework.
+
+ Args:
+ framework: Test framework name
+ test_command: The test command (optional, used to detect Gradle vs Maven)
+ """
+ config = FRAMEWORK_CONFIGS.get(
+ framework,
+ {
+ "output_flag": None,
+ "result_file": None,
+ },
+ )
+
+ # Special handling for JUnit: detect Gradle vs Maven from command
+ if framework == "junit" and test_command:
+ if "gradlew" in test_command or "gradle " in test_command:
+ # Gradle uses different output location than Maven
+ # Use */TEST-*.xml to match both standard Gradle (test/) and Android (testDebugUnitTest/)
+ config = {
+ "output_flag": None,
+ "result_file": "find:/workspace/repo:*/build/test-results*:TEST-*.xml",
+ }
+
+ # Special handling for xctest: detect Swift Testing vs XCTest from command
+ # When --disable-xctest is used, the task is using Swift Testing, not XCTest
+ # Use the 'testing' framework config to avoid adding XCTest-only flags like --num-workers
+ if framework == "xctest" and test_command:
+ if "--disable-xctest" in test_command:
+ config = FRAMEWORK_CONFIGS.get("testing", config)
+
+ return config
+
+
+def get_test_command_with_output(base_command: str, framework: str) -> str:
+ """
+ Add structured output flags to test command.
+
+ Returns: command_with_output_flags
+ """
+ config = get_framework_config(framework, base_command)
+ output_flag = config.get("output_flag")
+
+ enhanced = f"{base_command} {output_flag}" if output_flag else base_command
+
+ return enhanced
diff --git a/resources_servers/swe_bench/parsing/parsing.py b/resources_servers/swe_bench/parsing/parsing.py
new file mode 100644
index 0000000000..800586adad
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/parsing.py
@@ -0,0 +1,1606 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Test parsing utilities for build.py.
+
+Helper functions for:
+- Separating test and gold patches
+- Parsing JUnit XML and JSON test outputs
+"""
+
+import json
+import re
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+
+
+def read_patch(path: Path, skip_binary: bool = False) -> str:
+ """
+ Read the text content of a patch file optionally skipping binary files
+ """
+ parts = split_patch(path, skip_binary=skip_binary)
+ return "".join([diff for _, diff in parts])
+
+
+def split_patch(patch_path: Path, skip_binary: bool = False) -> list[Tuple[str, str]]:
+ """
+ Read a patch and partition by file.
+
+ Args:
+ patch_path (Path) - The patch file to split
+ skip_binary (bool) - Whether to exclude binary files
+
+ Returns: List of (filename, patch content) tuples
+ """
+ content = patch_path.read_text()
+ parts = []
+
+ # Split by file changes (each starts with "diff --git")
+ file_diffs = re.split(r"(diff --git.*?)(?=diff --git|\Z)", content, flags=re.DOTALL)
+
+ for i in range(0, len(file_diffs), 2):
+ if i + 1 >= len(file_diffs):
+ continue
+
+ header = file_diffs[i]
+ content = file_diffs[i + 1]
+ full_diff = header + content
+
+ # Extract filename from diff header
+ file_match = re.search(r"diff --git a/(.*?) b/", full_diff)
+ if not file_match:
+ continue
+
+ filepath = file_match.group(1)
+
+ if skip_binary:
+ binary_match = re.search(r"^GIT binary patch$", full_diff, flags=re.MULTILINE)
+ if binary_match:
+ continue
+
+ parts.append((filepath, full_diff))
+
+ return parts
+
+
+def _parse_embedded_test_results(text_output: str, test_prefix: str = "") -> Dict[str, str]:
+ """Parse embedded test results from system-out text.
+
+ This handles cases like wolfssl where a single ctest testcase runs many individual tests
+ and outputs them in a specific format within .
+
+ Expected formats:
+ - " 1: test_name : passed ( 0.00016)"
+ - " 2: test_name : failed ( 0.00016)"
+ - " 3: test_name : skipped"
+ - "HMAC-MD5 test passed!"
+ - "RSA test failed!"
+
+ Args:
+ text_output: The text content from
+ test_prefix: Prefix to add to test names (usually the testcase name)
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ """
+ results = {}
+
+ # Pattern 1: Numbered test format (wolfssl API tests)
+ # Format: " 1: test_name : passed ( 0.00016)"
+ numbered_pattern = re.compile(
+ r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*?)\s*:\s*(passed|failed|skipped)", re.MULTILINE | re.IGNORECASE
+ )
+
+ for match in numbered_pattern.finditer(text_output):
+ test_name = match.group(1).strip()
+ status = match.group(2).lower()
+
+ # Build test ID with prefix
+ if test_prefix:
+ test_id = f"{test_prefix}::{test_name}"
+ else:
+ test_id = test_name
+
+ if status == "passed":
+ results[test_id] = "PASSED"
+ elif status == "failed":
+ results[test_id] = "FAILED"
+ elif status == "skipped":
+ results[test_id] = "SKIPPED"
+
+ # Pattern 2: Unit test format (wolfssl unit tests)
+ # Format: "HMAC-MD5 test passed!"
+ # Only match lines that don't contain '---' (separator lines)
+ # Use [ \t] instead of \s to avoid matching newlines
+ unit_pattern = re.compile(
+ r"^([A-Za-z0-9_\-/]+(?:[ \t]+[A-Za-z0-9_\-/]+){0,5}?)[ \t]+test[ \t]+(passed|failed)!",
+ re.MULTILINE | re.IGNORECASE,
+ )
+
+ for match in unit_pattern.finditer(text_output):
+ test_name = match.group(1).strip()
+ status = match.group(2).lower()
+
+ # Skip if the test name contains special characters indicating it's not a real test
+ if "---" in test_name or len(test_name) > 50:
+ continue
+
+ # Build test ID with prefix
+ if test_prefix:
+ test_id = f"{test_prefix}::{test_name}"
+ else:
+ test_id = test_name
+
+ if status == "passed":
+ results[test_id] = "PASSED"
+ elif status == "failed":
+ results[test_id] = "FAILED"
+
+ # Pattern 3: FAILURES section (wolfssl API tests)
+ # Format: "FAILURES:\n 892: test_wolfSSL_CTX_load_verify_locations"
+ failures_section = re.search(r"FAILURES:\s*\n(.*?)(?:\n\s*End|$)", text_output, re.DOTALL)
+ if failures_section:
+ failure_pattern = re.compile(r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*)", re.MULTILINE)
+ for match in failure_pattern.finditer(failures_section.group(1)):
+ test_name = match.group(1).strip()
+ if test_prefix:
+ test_id = f"{test_prefix}::{test_name}"
+ else:
+ test_id = test_name
+ # Mark as failed (this overrides any previous 'passed' if it exists)
+ results[test_id] = "FAILED"
+
+ return results
+
+
+def parse_junit_xml(xml_content: str) -> Dict[str, str]:
+ """Parse JUnit XML to extract test results.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (import errors, syntax errors, etc.), if we find valid
+ XML test results, we parse and return them. We only return None if we're certain
+ the framework didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If test framework failed to run (not the same as tests failing)
+ """
+ results = {}
+ found_any_xml = False
+
+ # PRIORITY 1 & 2: Try to parse XML documents (pure or mixed with other output)
+ # Handle multiple concatenated XML documents (from multiple test result files)
+ # Split by 0:
+ doc = "", xml_start)
+ if xml_end > xml_start:
+ xml_extracted = doc[xml_start : xml_end + len("")]
+ else:
+ xml_end = doc.find("", xml_start)
+ if xml_end > xml_start:
+ xml_extracted = doc[xml_start : xml_end + len("")]
+ else:
+ continue
+
+ try:
+ tree = ET.fromstring(xml_extracted)
+ found_any_xml = True
+ except ET.ParseError:
+ continue
+
+ # Parse all testcases from this document
+ for testcase in tree.iter("testcase"):
+ classname = testcase.get("classname", "")
+ name = testcase.get("name", "")
+ test_id = f"{classname}::{name}" if classname else name
+
+ # Check if this testcase has system-out with embedded test results
+ # This handles cases like wolfssl where a single ctest executable runs many tests
+ system_out = testcase.find("system-out")
+ embedded_results = {}
+ if system_out is not None and system_out.text:
+ embedded_results = _parse_embedded_test_results(system_out.text, classname or name)
+
+ if embedded_results:
+ # If we found embedded test results, use those instead of the testcase status
+ results.update(embedded_results)
+ elif testcase.find("failure") is not None or testcase.find("error") is not None:
+ results[test_id] = "FAILED"
+ elif testcase.find("skipped") is not None:
+ results[test_id] = "SKIPPED"
+ else:
+ results[test_id] = "PASSED"
+
+ # PRIORITY 3: If we found NO valid XML and NO results, check for error indicators
+ # Only return None if we're certain the framework failed to run
+ if not found_any_xml and not results:
+ error_indicators = [
+ "ERROR: ", # Generic error marker
+ "ImportError:", # Python import errors
+ "ModuleNotFoundError:", # Python module errors
+ "SyntaxError:", # Python syntax errors
+ "FAILED ", # Framework failure markers
+ "INTERNALERROR", # pytest internal errors
+ "collection errors", # pytest collection errors
+ "error: ", # Generic error (C++, Swift, etc.)
+ "fatal error:", # Fatal compilation errors
+ "cannot find symbol", # Java compilation errors
+ "error: build had", # Swift build errors (xctest)
+ "error: terminated", # Swift process crashes (xctest)
+ ]
+ has_errors = any(indicator in xml_content for indicator in error_indicators)
+ # Return None ONLY if: no XML found AND errors present
+ # Return empty dict if: no XML found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_go_json(json_output: str) -> Dict[str, str]:
+ """Parse Go test -json output (newline-delimited JSON).
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (module errors, build errors, etc.), if we find valid
+ test results JSON, we parse and return it. We only return None if we're certain
+ the tests didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If Go tests failed to run (not the same as tests failing)
+ """
+ results = {}
+ has_valid_json = False
+
+ # PRIORITY 1: Try to parse newline-delimited JSON (valid test output)
+ for line in json_output.strip().split("\n"):
+ if not line.strip():
+ continue
+ try:
+ event = json.loads(line)
+ has_valid_json = True # Found at least one valid JSON line
+ action = event.get("Action")
+
+ # Handle test-level events
+ if "Test" in event and action in ["pass", "fail", "skip"]:
+ test_name = event.get("Test", "")
+ if test_name:
+ package = event.get("Package", "")
+ test_id = f"{package}::{test_name}" if package else test_name
+
+ if action == "pass":
+ results[test_id] = "PASSED"
+ elif action == "fail":
+ results[test_id] = "FAILED"
+ elif action == "skip":
+ results[test_id] = "SKIPPED"
+
+ # Handle package-level failures (no Test field)
+ elif "Package" in event and "Test" not in event and action == "fail":
+ package = event.get("Package", "")
+ test_id = f"{package}::package"
+ results[test_id] = "FAILED"
+
+ except json.JSONDecodeError:
+ # PRIORITY 2: Handle plaintext build failures (legitimate failures)
+ # When tests can't compile/build, Go outputs plaintext "FAIL package [build failed]"
+ # This is a legitimate test failure, not a parsing error
+ build_fail_match = re.match(r"^FAIL\s+(\S+)\s+\[build failed\]", line)
+ if build_fail_match:
+ package_name = build_fail_match.group(1)
+ results[package_name] = "FAILED"
+ has_valid_json = True # Count build failures as valid results
+
+ # PRIORITY 3: If we found NO valid JSON and NO build failures, check for error indicators
+ if not has_valid_json and not results:
+ error_indicators = [
+ "go: cannot find main module", # Module not found
+ "can't load package", # Package loading errors
+ "pattern matches no packages", # No matching packages
+ "build constraints exclude all Go files", # Build constraints error
+ ]
+ has_errors = any(indicator in json_output for indicator in error_indicators)
+ # Return None ONLY if: no JSON found AND errors present
+ # Return empty dict if: no JSON found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_jest_vitest_json(json_output: str) -> Dict[str, str]:
+ """Parse Jest/Vitest JSON output.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (TypeScript, npm, etc.), if we find valid
+ test results JSON, we parse and return it. We only return None if we're
+ certain the framework didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If Jest itself failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+ try:
+ data = json.loads(json_output.strip())
+ # If we got JSON, check if it has test results (even if errors exist elsewhere in output)
+ except json.JSONDecodeError:
+ # PRIORITY 2: Search for JSON markers in mixed output
+ # Even with errors in output, tests might have run and produced JSON
+ json_start = json_output.find('{"numFailed') # Jest format
+ if json_start == -1:
+ json_start = json_output.find('{"numTotalTest') # Vitest format
+ if json_start == -1:
+ json_start = json_output.find('{"test') # Alternative format
+ if json_start == -1:
+ # PRIORITY 3: No JSON found - NOW check if there are error indicators
+ # Only return None if we're sure tests didn't run (no results + errors present)
+ # NOTE: error_indicators are a LAST RESORT - we prefer finding test results
+ error_indicators = [
+ "error TS", # TypeScript compilation errors (e.g., error TS2307:)
+ "ELIFECYCLE", # npm script failures
+ "npm ERR!", # npm errors
+ "Error: Cannot find module", # Module loading errors (like Mocha)
+ "SyntaxError:", # JavaScript/TypeScript syntax errors
+ "Test suite failed to run", # Jest-specific: tests couldn't be loaded
+ "FAIL ", # Jest failure marker without JSON
+ ]
+ has_errors = any(indicator in json_output for indicator in error_indicators)
+ # Return None ONLY if: no JSON found AND errors present
+ # Return empty dict if: no JSON found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ # Try to extract JSON from mixed output
+ decoder = json.JSONDecoder()
+ try:
+ data, _ = decoder.raw_decode(json_output[json_start:])
+ except json.JSONDecodeError:
+ # Could not parse JSON even after finding marker
+ return None
+
+ # At this point, we have successfully parsed JSON
+ # Check if this is Jest's error response format (Jest itself failed, not the tests)
+ # Format: {"error": {"code": 2, "summary": "", "detail": ""}}
+ # This is a structured error response, NOT test results
+ if "error" in data and "code" in data.get("error", {}):
+ # This is an error response from Jest itself, not test results
+ return None
+
+ # Check if we have the expected test results structure
+ # If we have testResults, parse it even if tests failed - those are legitimate test results
+ # Parse test results
+ if "testResults" in data:
+ for test_result in data.get("testResults", []):
+ file_path = test_result.get("name", "")
+ suite_status = test_result.get("status", "")
+ assertions = test_result.get("assertionResults", [])
+
+ # Handle suite-level failures (no assertions ran)
+ if suite_status == "failed" and len(assertions) == 0:
+ test_id = f"{file_path}::suite"
+ results[test_id] = "FAILED"
+ continue
+
+ # Handle individual test assertions
+ for assertion in assertions:
+ full_name = assertion.get("fullName", "")
+ title = assertion.get("title", "")
+ status = assertion.get("status", "")
+ test_id = f"{file_path}::{full_name}" if full_name else f"{file_path}::{title}"
+
+ if status == "passed":
+ results[test_id] = "PASSED"
+ elif status == "failed":
+ results[test_id] = "FAILED"
+ elif status in ["pending", "skipped"]:
+ results[test_id] = "SKIPPED"
+
+ # If we successfully parsed JSON but found no testResults, that's unexpected
+ # Return None to indicate this isn't valid test output
+ # (Valid Jest output should have testResults array, even if empty)
+ if "testResults" not in data:
+ return None
+
+ return results
+
+
+def parse_mocha_json(json_output: str) -> Optional[Dict[str, str]]:
+ """Parse Mocha JSON output.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (module errors, syntax errors, etc.), if we find valid
+ test results JSON, we parse and return it. We only return None if we're certain
+ the framework didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If Mocha itself failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+ try:
+ data = json.loads(json_output.strip())
+ # Validate this is Mocha JSON by checking for 'stats' key
+ if "stats" not in data:
+ data = None
+ except json.JSONDecodeError:
+ data = None
+
+ # PRIORITY 2: If direct parse failed, search for JSON in mixed output
+ if data is None:
+ # Look for stats key in JSON
+ stats_pos = json_output.find('"stats"')
+ if stats_pos == -1:
+ # PRIORITY 3: No JSON found - NOW check if there are error indicators
+ error_indicators = [
+ "Error: Cannot find module", # Module loading errors
+ "SyntaxError:", # JavaScript syntax errors
+ "TypeError:", # Type errors
+ "ReferenceError:", # Reference errors
+ "No test files found", # Mocha-specific: no tests found
+ ]
+ has_errors = any(indicator in json_output for indicator in error_indicators)
+ # Return None ONLY if: no JSON found AND errors present
+ # Return empty dict if: no JSON found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ # Find the opening brace before "stats"
+ json_start = json_output.rfind("{", 0, stats_pos)
+ if json_start == -1:
+ return None
+
+ # Try parsing from this position
+ json_portion = json_output[json_start:]
+
+ # Use json.JSONDecoder to find where the object ends
+ decoder = json.JSONDecoder()
+ try:
+ data, _ = decoder.raw_decode(json_portion)
+ except json.JSONDecodeError:
+ return None
+
+ # Validate extracted JSON has 'stats'
+ if "stats" not in data:
+ return None
+
+ # At this point, we have valid Mocha JSON with 'stats'
+ # Parse test results even if some tests failed - those are legitimate results
+
+ # Process passed tests
+ for test in data.get("passes", []):
+ file_path = test.get("file", "")
+ full_title = test.get("fullTitle", "")
+ test_id = f"{file_path}::{full_title}" if full_title else file_path
+ results[test_id] = "PASSED"
+
+ # Process failed tests
+ for test in data.get("failures", []):
+ file_path = test.get("file", "")
+ full_title = test.get("fullTitle", "")
+ test_id = f"{file_path}::{full_title}" if full_title else file_path
+ results[test_id] = "FAILED"
+
+ # Process pending/skipped tests
+ for test in data.get("pending", []):
+ file_path = test.get("file", "")
+ full_title = test.get("fullTitle", "")
+ test_id = f"{file_path}::{full_title}" if full_title else file_path
+ results[test_id] = "SKIPPED"
+
+ return results
+
+
+def parse_gtest_json(json_output: str) -> Dict[str, str]:
+ """Parse Google Test JSON output.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (compilation errors, linking errors, etc.), if we find valid
+ test results JSON, we parse and return it. We only return None if we're certain
+ the tests didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If GTest itself failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+ try:
+ data = json.loads(json_output.strip())
+ # Validate this is GTest JSON by checking for 'testsuites' key
+ if "testsuites" not in data:
+ data = None
+ except json.JSONDecodeError:
+ data = None
+
+ # PRIORITY 2: If direct parse failed, search for JSON in mixed output
+ if data is None:
+ # Try to find JSON in mixed output
+ json_start = json_output.find('{"testsuites"')
+ if json_start == -1:
+ json_start = json_output.find('{\n "testsuites"')
+ if json_start == -1:
+ # PRIORITY 3: No JSON found - NOW check if there are error indicators
+ error_indicators = [
+ "error:", # C++ compilation errors
+ "undefined reference to", # Linking errors
+ "fatal error:", # Fatal compilation errors
+ "cannot find -l", # Linking library errors (e.g., "cannot find -lgtest")
+ ": No such file or directory", # File not found errors
+ ]
+ has_errors = any(indicator in json_output for indicator in error_indicators)
+ # Return None ONLY if: no JSON found AND errors present
+ # Return empty dict if: no JSON found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ # Extract JSON object
+ json_portion = json_output[json_start:]
+ decoder = json.JSONDecoder()
+ try:
+ data, _ = decoder.raw_decode(json_portion)
+ except json.JSONDecodeError:
+ return None
+
+ # Validate extracted JSON has 'testsuites'
+ if "testsuites" not in data:
+ return None
+
+ # At this point, we have valid GTest JSON with 'testsuites'
+ # Parse test results even if some tests failed - those are legitimate results
+
+ # Parse test results from testsuites
+ testsuites = data.get("testsuites", [])
+ if not isinstance(testsuites, list):
+ testsuites = [testsuites] if isinstance(testsuites, dict) else []
+
+ for testsuite in testsuites:
+ suite_name = testsuite.get("name", "")
+
+ # Handle both 'testsuite' (array) and direct test cases
+ test_cases = testsuite.get("testsuite", [])
+ if not test_cases:
+ test_cases = testsuite.get("tests", [])
+
+ for test_case in test_cases:
+ test_name = test_case.get("name", "")
+ classname = test_case.get("classname", suite_name)
+
+ # Build test ID in format: SuiteName::TestName
+ test_id = f"{classname}::{test_name}" if classname else test_name
+
+ # Determine test status
+ status = test_case.get("status", "RUN")
+ result = test_case.get("result", "COMPLETED")
+
+ # Check for failures
+ failures = test_case.get("failures", [])
+ if failures and len(failures) > 0:
+ results[test_id] = "FAILED"
+ elif status == "NOTRUN" or result == "SKIPPED":
+ results[test_id] = "SKIPPED"
+ elif result == "COMPLETED" or status == "RUN":
+ results[test_id] = "PASSED"
+ else:
+ results[test_id] = "FAILED"
+
+ return results
+
+
+def parse_maven_text_output(text_output: str) -> Dict[str, str]:
+ """Parse Maven text output for test results."""
+ results = {}
+
+ # Look for test summary lines like:
+ # Tests run: 5, Failures: 1, Errors: 0, Skipped: 0
+ summary_pattern = r"Tests run: (\d+),\s*Failures: (\d+),\s*Errors: (\d+),\s*Skipped: (\d+)"
+
+ # Check for compilation errors - if tests can't compile, mark them as failed
+ compilation_error_pattern = r"\[ERROR\].*?testCompile.*?Compilation failure"
+ if re.search(compilation_error_pattern, text_output, re.DOTALL | re.IGNORECASE):
+ # Find test files mentioned in compilation errors
+ test_file_pattern = r"/workspace/repo/[^/]+/src/test/java/([\w/]+)\.java"
+ for match in re.finditer(test_file_pattern, text_output):
+ test_class = match.group(1).replace("/", ".")
+ # Mark as failed due to compilation
+ results[f"{test_class}::compile"] = "FAILED"
+ # If we found compilation errors, return early
+ if results:
+ return results
+
+ # Check for BUILD FAILURE
+ if "BUILD FAILURE" in text_output:
+ # If build failed and we haven't found specific test failures, mark as generic failure
+ if not results:
+ results["maven::build"] = "FAILED"
+ return results
+
+ # Parse test run summaries per module
+ lines = text_output.split("\n")
+ current_module = None
+
+ for line in lines:
+ # Track which module we're in
+ if "Building" in line and "[" in line and "]" in line:
+ # Extract module name from lines like "[INFO] Building Docs Web 1.12-SNAPSHOT [4/4]"
+ parts = line.split("Building")
+ if len(parts) > 1:
+ module_parts = parts[1].strip().split()
+ if len(module_parts) > 0:
+ current_module = module_parts[0]
+
+ # Look for test summary
+ summary_match = re.search(summary_pattern, line)
+ if summary_match:
+ total = int(summary_match.group(1))
+ failures = int(summary_match.group(2))
+ errors = int(summary_match.group(3))
+ skipped = int(summary_match.group(4))
+
+ if total > 0:
+ # We have test counts but might not have individual test names
+ # Generate generic test IDs based on the current module
+ module_name = current_module or "unknown"
+ passed = total - failures - errors - skipped
+
+ for j in range(passed):
+ results[f"{module_name}::test_{j + 1}"] = "PASSED"
+ for j in range(failures + errors):
+ results[f"{module_name}::test_failed_{j + 1}"] = "FAILED"
+ for j in range(skipped):
+ results[f"{module_name}::test_skipped_{j + 1}"] = "SKIPPED"
+ return results
+
+
+def parse_cargo_nextest(output: str) -> Dict[str, str]:
+ """Parse cargo-nextest text output.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (warnings, etc.), if we find valid test results,
+ we parse and return them. We only return None if we're certain the tests didn't
+ run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If cargo-nextest failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Parse individual test result lines
+ # Format: PASS [ 1.588s] rusty::tests integration::linking::test_name
+ # FAIL [ 5.845s] rusty codegen::tests::parameters_tests::test_name
+ test_line_pattern = re.compile(r"^\s*(PASS|FAIL|SIGKILL|SKIP)\s+\[.*?\]\s+(.+)$", re.MULTILINE)
+
+ for match in test_line_pattern.finditer(output):
+ status = match.group(1)
+ test_name = match.group(2).strip()
+
+ if status == "PASS":
+ results[test_name] = "PASSED"
+ elif status in ("FAIL", "SIGKILL"):
+ results[test_name] = "FAILED"
+ elif status == "SKIP":
+ results[test_name] = "SKIPPED"
+
+ # PRIORITY 2: If we found NO test results, check for error indicators
+ # Only return None if we're certain tests didn't run (compilation/linking errors)
+ if not results:
+ error_indicators = [
+ "error[E", # Rust compiler errors (e.g., error[E0425])
+ "error: could not compile", # Cargo compilation errors
+ "error: linking with", # Linking errors
+ "error: aborting due to", # Compilation aborted
+ ]
+ has_errors = any(indicator in output for indicator in error_indicators)
+ # Return None ONLY if: no results found AND errors present
+ # Return empty dict if: no results found AND no errors (rare but valid - no tests in project)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_bun_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse Bun test framework output.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (TypeScript, compilation, etc.), if we find valid
+ test results, we parse and return them. We only return None if we're
+ certain Bun didn't run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If Bun itself failed to run (not the same as tests failing)
+ """
+ results = {}
+ current_file = None
+ current_describe = None
+
+ # PRIORITY 1: Try to parse test results (✓ and ✗ symbols)
+ for line in text_output.split("\n"):
+ # Track current file (lines ending with .ts: or .js:)
+ if re.match(r"^[^\s].*\.(ts|js|tsx|jsx):?\s*$", line.strip()):
+ current_file = line.strip().rstrip(":")
+ current_describe = None
+ continue
+
+ # Track describe blocks (indented text followed by colon, but not test results)
+ describe_match = re.match(r"^\s+([^✓✗\n]+):\s*$", line)
+ if describe_match:
+ current_describe = describe_match.group(1).strip()
+ continue
+
+ # Remove ANSI color codes
+ clean_line = re.sub(r"\x1b\[[0-9;]*m", "", line)
+
+ # Match passed tests: ✓ test_name [time]
+ pass_match = re.match(r"^\s*✓\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line)
+ if pass_match:
+ test_name = pass_match.group(1).strip()
+ # Build test ID with file, describe block, and test name
+ test_id = test_name
+ if current_file:
+ test_id = f"{current_file}::{test_name}"
+ if current_describe:
+ test_id = f"{current_file}::{current_describe} > {test_name}"
+ results[test_id] = "PASSED"
+ continue
+
+ # Match failed tests: ✗ test_name [time]
+ fail_match = re.match(r"^\s*✗\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line)
+ if fail_match:
+ test_name = fail_match.group(1).strip()
+ # Build test ID with file, describe block, and test name
+ test_id = test_name
+ if current_file:
+ test_id = f"{current_file}::{test_name}"
+ if current_describe:
+ test_id = f"{current_file}::{current_describe} > {test_name}"
+ results[test_id] = "FAILED"
+ continue
+
+ # Alternative format: FAIL filepath > describe > test_name
+ alt_fail_match = re.match(r"^\s*FAIL\s+(.+?)\s+>\s+(.+?)\s*$", clean_line)
+ if alt_fail_match:
+ file_path = alt_fail_match.group(1).strip()
+ test_path = alt_fail_match.group(2).strip()
+ test_id = f"{file_path}::{test_path}"
+ results[test_id] = "FAILED"
+ continue
+
+ # PRIORITY 2: If no individual test results found, try parsing summary
+ if not results:
+ # Look for summary like "5 pass, 2 fail" or "X passing (Yms)"
+ summary_match = re.search(r"(\d+)\s+pass(?:ing|ed)?.*?(\d+)\s+fail(?:ing|ed)?", text_output.lower())
+ if summary_match:
+ passed = int(summary_match.group(1))
+ failed = int(summary_match.group(2))
+
+ # Generate generic test IDs
+ for i in range(passed):
+ results[f"test_{i + 1}"] = "PASSED"
+ for i in range(failed):
+ results[f"test_failed_{i + 1}"] = "FAILED"
+
+ # PRIORITY 3: No test results found - NOW check if there are error indicators
+ # Only return None if we're sure tests didn't run (no results + errors present)
+ # NOTE: error_indicators are a LAST RESORT - we prefer finding test results
+ if not results:
+ error_indicators = [
+ "error TS", # TypeScript compilation errors (e.g., error TS2307:)
+ "Error: Cannot find module", # Module loading errors
+ "SyntaxError:", # JavaScript/TypeScript syntax errors
+ "error: ", # Generic Bun errors (lowercase 'error:')
+ "Error:", # Generic errors
+ "ModuleNotFoundError", # Module not found
+ "bun: command not found", # Bun not installed
+ "panicked at", # Bun runtime panics
+ "Segmentation fault", # Critical runtime errors
+ ]
+ has_errors = any(indicator in text_output for indicator in error_indicators)
+ # Return None ONLY if: no test results found AND errors present
+ # Return empty dict if: no test results found AND no errors (rare but valid)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_cppunit_text(text_output: str) -> Dict[str, str]:
+ """Parse CppUnit text output for test results.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ Even if output contains errors (warnings, etc.), if we find valid test results,
+ we parse and return them. We only return None if we're certain the tests didn't
+ run (no test results + error indicators).
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If CppUnit failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Parse individual test result lines
+ # Format: TestClassName::testMethodName : OK
+ # TestClassName::testMethodName : FAIL
+ test_line_pattern = re.compile(
+ r"^([A-Za-z_][A-Za-z0-9_]*::[A-Za-z_][A-Za-z0-9_]*)\s*:\s*(OK|FAIL|ERROR)$", re.MULTILINE
+ )
+
+ for match in test_line_pattern.finditer(text_output):
+ test_name = match.group(1).strip()
+ status = match.group(2).strip()
+
+ if status == "OK":
+ results[test_name] = "PASSED"
+ elif status in ["FAIL", "ERROR"]:
+ results[test_name] = "FAILED"
+
+ # PRIORITY 2: If we found NO test results, check for error indicators
+ # Only return None if we're certain tests didn't run (compilation/linking errors)
+ if not results:
+ error_indicators = [
+ "error:", # C++ compilation errors
+ "undefined reference to", # Linking errors
+ "fatal error:", # Fatal compilation errors
+ "ld returned", # Linker errors
+ "cannot find -l", # Library linking errors
+ ]
+ has_errors = any(indicator in text_output for indicator in error_indicators)
+ # Return None ONLY if: no results found AND errors present
+ # Return empty dict if: no results found AND no errors (rare but valid - no tests)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_minitest_text(text_output: str, test_metadata_path: str = None) -> Dict[str, str]:
+ """
+ Parse mini.nvim (MiniTest) test framework output.
+
+ MiniTest is used by Neovim plugins for testing.
+ Example output:
+ Total number of cases: 5
+ tests/test_treesitter.lua: ooooo
+
+ Fails (0) and Notes (0)
+
+ Or with failures:
+ FAIL in tests/test_treesitter.lua | wrap_cursor | normal: error message
+ FAIL in tests/test_treesitter.lua | enumerate: error message
+
+ Fails (2) and Notes (0)
+
+ IMPORTANT: MiniTest only outputs individual test names when they FAIL.
+ When all tests pass, only summary is shown - no individual test names.
+
+ Solution: When all tests pass, read test_metadata.json to get expected test names
+ and return them as PASSED. This ensures real test names are used consistently.
+ """
+ results = {}
+
+ # Parse individual test results from FAIL/NOTE lines
+ # Format: FAIL in file.lua | group | test_name: error message
+ # Use [^|:]+ to stop at pipe OR colon (prevents capturing error message)
+ fail_pattern = re.compile(
+ r"^(?:\x1b\[\d+(?:;\d+)?m)?FAIL(?:\x1b\[0m)?\s+in\s+([^|]+)\s*\|\s*([^|:]+)(?:\s*\|\s*([^:]+))?:", re.MULTILINE
+ )
+
+ for match in fail_pattern.finditer(text_output):
+ file_path = match.group(1).strip()
+ group = match.group(2).strip()
+ test_name = match.group(3).strip() if match.group(3) else ""
+
+ # Create test ID: file | group | test_name or file | group
+ if test_name:
+ test_id = f"{file_path} | {group} | {test_name}"
+ else:
+ test_id = f"{file_path} | {group}"
+
+ results[test_id] = "FAILED"
+
+ return results
+
+
+def parse_telescope_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse telescope test framework output.
+
+ Telescope outputs lines like:
+ ✓ test_name
+ ✗ test_name
+ - test_name (skipped)
+
+ Also handles PlenaryBusted output for Neovim plugins.
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ We only return None if we're certain the framework didn't run.
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If telescope failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ for line in text_output.split("\n"):
+ line = line.strip()
+
+ # Match passed tests: ✓ test_name or "Success: test_name"
+ if "✓" in line:
+ test_name = line.split("✓", 1)[1].strip()
+ if test_name: # Avoid empty test names
+ results[test_name] = "PASSED"
+ elif line.lower().startswith("success:"):
+ test_name = line.split(":", 1)[1].strip()
+ if test_name:
+ results[test_name] = "PASSED"
+
+ # Match failed tests: ✗ test_name or "Failed: test_name"
+ elif "✗" in line:
+ test_name = line.split("✗", 1)[1].strip()
+ if test_name:
+ results[test_name] = "FAILED"
+ elif line.lower().startswith("failed:"):
+ test_name = line.split(":", 1)[1].strip()
+ if test_name:
+ results[test_name] = "FAILED"
+
+ # Match skipped tests: - test_name or "Skipped: test_name"
+ elif line.startswith("- ") and "skip" in line.lower():
+ test_name = line[2:].strip()
+ # Remove "(skipped)" suffix if present
+ test_name = re.sub(r"\s*\(skipped\)\s*$", "", test_name, flags=re.IGNORECASE)
+ if test_name:
+ results[test_name] = "SKIPPED"
+ elif line.lower().startswith("skipped:"):
+ test_name = line.split(":", 1)[1].strip()
+ if test_name:
+ results[test_name] = "SKIPPED"
+
+ # If no results found, try parsing summary line
+ if not results:
+ # Look for summary like "5 passed, 2 failed, 1 skipped"
+ summary_pattern = r"(\d+)\s+passed.*?(\d+)\s+failed"
+ match = re.search(summary_pattern, text_output.lower())
+ if match:
+ passed = int(match.group(1))
+ failed = int(match.group(2))
+
+ # Generate generic test IDs
+ for i in range(passed):
+ results[f"test_{i + 1}"] = "PASSED"
+ for i in range(failed):
+ results[f"test_failed_{i + 1}"] = "FAILED"
+
+ # PRIORITY 2: If we found NO test results, check for error indicators
+ # Only return None if we're certain tests didn't run (Lua/Neovim errors)
+ if not results:
+ error_indicators = [
+ "Error:", # Generic Lua errors
+ "error loading module", # Lua module loading errors
+ "attempt to call", # Lua runtime errors
+ "bad argument", # Lua runtime errors
+ "stack traceback:", # Lua errors with traceback
+ ]
+ has_errors = any(indicator in text_output for indicator in error_indicators)
+ # Return None ONLY if: no results found AND errors present
+ # Return empty dict if: no results found AND no errors (rare but valid - no tests)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_lust_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse lust test framework output.
+
+ Lust outputs test results with dots (.) for pass, F for fail.
+ Example output:
+ ..F.
+ 4 tests, 1 failure
+ test/my_test.lua:15: Expected true but got false
+
+ We parse individual test results when available, or fall back to summary.
+ """
+ results = {}
+
+ # Try to parse individual test results from verbose output
+ # Pattern: " test_name ... ok" or " test_name ... FAILED"
+ test_pattern = re.compile(r"^\s*(.+?)\s+\.\.\.\s+(ok|FAILED|ERROR)", re.MULTILINE)
+ matches = test_pattern.findall(text_output)
+
+ if matches:
+ # Found individual test results
+ for test_name, status in matches:
+ test_name = test_name.strip()
+ if status == "ok":
+ results[test_name] = "PASSED"
+ else:
+ results[test_name] = "FAILED"
+ return results
+
+ # Try to extract test descriptions from failure messages
+ # Pattern: "test_file.lua:line_number: test description"
+ failure_pattern = re.compile(r"^([^\s:]+\.lua):(\d+):\s*(.+)$", re.MULTILINE)
+ failures = failure_pattern.findall(text_output)
+
+ if failures:
+ for filepath, _, description in failures:
+ test_id = f"{filepath}::{description.strip()}"
+ results[test_id] = "FAILED"
+
+ # Parse summary line to get total count: "X tests, Y failures"
+ summary_match = re.search(r"(\d+)\s+tests?,\s+(\d+)\s+failures?", text_output.lower())
+ if summary_match:
+ total_tests = int(summary_match.group(1))
+ failures = int(summary_match.group(2))
+
+ # If we haven't parsed individual tests yet, generate generic ones
+ if not results:
+ passed = total_tests - failures
+ for i in range(passed):
+ results[f"test_{i + 1}"] = "PASSED"
+ for i in range(failures):
+ results[f"test_failed_{i + 1}"] = "FAILED"
+ return results
+
+ # Fallback: if no detailed info, check for overall success/failure
+ if not results:
+ if "0 failures" in text_output.lower() or "0 errors" in text_output.lower():
+ results["test_suite"] = "PASSED"
+ else:
+ results["test_suite"] = "FAILED"
+
+ return results
+
+
+def parse_bespoke_libgeos(text_output: str) -> Dict[str, str]:
+ """Parse libgeos/GEOS test output format.
+
+ Format:
+ capi::GEOSBoundary: .
+ capi::GEOSBuffer: .....................
+ geos::operation::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F][5=F][6=F]
+ geos::operation::buffer::BufferOp: ..........................[27=X]
+
+ Where:
+ - dots (.) = passing tests
+ - [N=F] = explicit failure markers
+ - [N=X] = exception markers (also failures)
+ - standalone F or X = failure/exception
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ We only return None if we're certain the framework didn't run.
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If libgeos tests failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Parse individual test result lines
+ # Pattern: TestSuite::TestName: followed by dots, Fs, Xs, or [N=F]/[N=X] markers
+ # Example: capi::GEOSBoundary: .
+ # Example: geos::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F]
+ # Example: geos::operation::buffer::BufferOp: ..........................[27=X]
+ test_line_pattern = re.compile(
+ r"^([a-zA-Z_][a-zA-Z0-9_:]*::[a-zA-Z_][a-zA-Z0-9_]*)\s*:\s*(.+?)(?:\n|$)", re.MULTILINE
+ )
+
+ for match in test_line_pattern.finditer(text_output):
+ test_id = match.group(1) # Full name like "capi::GEOSBoundary"
+ test_output_line = match.group(2) # Everything after the colon
+
+ # Check for failure markers:
+ # 1. [N=F] pattern (explicit failure notation)
+ # 2. [N=X] pattern (exception notation)
+ # 3. Standalone F or X characters
+ has_failure = bool(re.search(r"\[.*=[FX]\]|(? str:
+ """Normalize XCTest case identifiers from swift test console output."""
+ name = raw_name.strip()
+
+ # Typical format: -[Module.Class testMethod]
+ if name.startswith("-[") and name.endswith("]"):
+ inner = name[2:-1].strip()
+ if " " in inner:
+ class_name, method = inner.split(" ", 1)
+ return f"{class_name}::{method}"
+ return inner
+
+ # Alternate format: Module.Class.testMethod
+ if "." in name:
+ parts = name.split(".", 1)
+ return f"{parts[0]}::{parts[1]}"
+
+ return name
+
+
+def parse_swift_test_text(text_output: str) -> Dict[str, str]:
+ """Parse vanilla `swift test` console output (without --xunit-output)."""
+ results = {}
+ test_case_pattern = re.compile(r"Test Case '([^']+)' (passed|failed|skipped)", re.IGNORECASE)
+
+ for match in test_case_pattern.finditer(text_output):
+ raw_name, status = match.groups()
+ test_id = _normalize_swift_test_name(raw_name)
+ results[test_id] = status.upper()
+
+ if results:
+ return results
+
+ # Fallback: parse summary line to infer aggregate results if per-test lines missing
+ summary_match = re.search(
+ r"Executed\s+(\d+)\s+tests?,\s+with\s+(\d+)\s+failures?",
+ text_output,
+ re.IGNORECASE,
+ )
+ if summary_match:
+ total_tests = int(summary_match.group(1))
+ failures = int(summary_match.group(2))
+ passes = max(total_tests - failures, 0)
+
+ for i in range(passes):
+ results[f"swift_test_pass_{i + 1}"] = "PASSED"
+ for i in range(failures):
+ results[f"swift_test_fail_{i + 1}"] = "FAILED"
+
+ return results
+
+
+def parse_xctest_output(output: str) -> Dict[str, str]:
+ """Parse XCTest results, preferring XML when available."""
+ xml_results = parse_junit_xml(output)
+ if xml_results:
+ return xml_results
+ return parse_swift_test_text(output)
+
+
+def normalize_test_id(test_id: str, framework: str = "") -> str:
+ """Normalize test IDs for stable matching across different formats.
+
+ This function performs several normalizations:
+
+ 1. Removes unstable runtime prefixes that change between runs:
+ - (N/M) - Test execution order (e.g., "(2/5) test_name")
+ - [N/M] - Alternative bracket format
+ - #N - Test number prefix (e.g., "#42 test_name")
+ - N. - Numbered list format (e.g., "1. test_name")
+
+ 2. Removes common file extensions (.py, .js, .ts, .go, etc.) from test paths
+ to allow matching between "test_file.py::test" and "test_file::test"
+
+ 3. Normalizes delimiters (`.`, `::`, `/`) to a canonical form (`::`)
+ when they appear between alphanumeric characters, allowing matching
+ between "testa.testb::testc" and "testa/testb.testc"
+
+ Examples:
+ "(2/5) test_name" -> "test_name"
+ "test_file.py::test_name" -> "test_file::test_name"
+ "testa.testb::testc" -> "testa::testb::testc"
+ "testa/testb.testc" -> "testa::testb::testc"
+ "tests/module.js::describe::it" -> "tests::module::describe::it"
+
+ Args:
+ test_id: Original test ID from parser
+ framework: Test framework name (for future framework-specific rules if needed)
+
+ Returns:
+ Normalized test ID
+ """
+ # Step 1: Remove unstable runtime prefixes
+
+ # Universal pattern: Remove (N/M) or [N/M] prefixes (test execution order)
+ # Matches: "(2/5) test", "[2/5] test", "(123/456) test", "( 1/75) test" (with internal space)
+ normalized = re.sub(r"^[\(\[]?\s*\d+/\d+[\)\]]?\s+", "", test_id)
+
+ # Universal pattern: Remove #N prefix (test numbering)
+ # Matches: "#42 test", "# 42 test"
+ normalized = re.sub(r"^#\s*\d+\s+", "", normalized)
+
+ # Universal pattern: Remove "N. " prefix (numbered list)
+ # Matches: "1. test", "42. test"
+ normalized = re.sub(r"^\d+\.\s+", "", normalized)
+
+ # Step 2: Remove common file extensions before delimiters
+ # This prevents .py from becoming ::py after delimiter normalization
+ # Match extensions like .py, .js, .ts, etc. that appear before :: / . or end of string
+ extensions_pattern = (
+ r"\.(py|pyw|js|mjs|cjs|ts|mts|cts|jsx|tsx|"
+ r"go|java|rb|rs|c|cpp|cc|cxx|h|hpp|hxx|"
+ r"swift|kt|kts|scala|php|cs|fs|"
+ r"ex|exs|erl|hrl|clj|cljs|cljc|"
+ r"lua|pl|pm|t|r|R|m|mm|"
+ r"f|f90|f95|for|vb|pas|pp|"
+ r"d|nim|zig|v|sv|vhd|vhdl|"
+ r"tcl|sh|bash|zsh|fish|ps1|psm1|psd1)"
+ r"(?=::|/|\.|$)"
+ )
+ normalized = re.sub(extensions_pattern, "", normalized, flags=re.IGNORECASE)
+
+ # Step 3: Normalize delimiters (., ::, /) to :: when between word characters
+ # This allows matching "testa.testb::testc" with "testa/testb.testc"
+ delimiter_pattern = r"(?<=\w)(::|\.|/)(?=\w)"
+ normalized = re.sub(delimiter_pattern, "::", normalized)
+
+ return normalized
+
+
+def parse_tap_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse TAP (Test Anything Protocol) output.
+
+ TAP is used by tape, node-tap, and other JavaScript test frameworks.
+
+ Format:
+ TAP version 13
+ # Subtest: Test name
+ 1..N
+ ok 1 - assertion name
+ not ok 2 - assertion name
+ ok 1 - Test name # time=123ms
+ not ok 2 - Test name
+ 1..N
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+ We only return None if we're certain the tests didn't run.
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If TAP tests failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # PRIORITY 1: Parse top-level test results (not indented subtests)
+ # Format: "ok N - Test name" or "not ok N - Test name"
+ # Skip lines starting with whitespace (subtests)
+ tap_test_pattern = re.compile(
+ r"^(not )?ok\s+(\d+)\s*(?:-\s*)?(.+?)(?:\s*#\s*(skip|todo|time=.*))?$", re.MULTILINE | re.IGNORECASE
+ )
+
+ for match in tap_test_pattern.finditer(text_output):
+ is_failure = match.group(1) is not None # "not ok" prefix
+ test_num = match.group(2)
+ test_name = match.group(3).strip() if match.group(3) else f"test_{test_num}"
+ directive = match.group(4)
+
+ # Clean up test name (remove timing info like "# time=123ms")
+ test_name = re.sub(r"\s*#\s*time=[\d.]+m?s\s*$", "", test_name, flags=re.IGNORECASE)
+
+ test_id = test_name if test_name else f"test_{test_num}"
+
+ # Check for skip directive
+ if directive and directive.lower().startswith("skip"):
+ results[test_id] = "SKIPPED"
+ elif is_failure:
+ results[test_id] = "FAILED"
+ else:
+ results[test_id] = "PASSED"
+
+ # PRIORITY 2: If no results found, try parsing summary line
+ # Format: "# tests N", "# pass N", "# fail N"
+ if not results:
+ pass_match = re.search(r"#\s*pass\s+(\d+)", text_output, re.IGNORECASE)
+ fail_match = re.search(r"#\s*fail\s+(\d+)", text_output, re.IGNORECASE)
+
+ if pass_match or fail_match:
+ passed = int(pass_match.group(1)) if pass_match else 0
+ failed = int(fail_match.group(1)) if fail_match else 0
+
+ for i in range(passed):
+ results[f"tap_test_passed_{i + 1}"] = "PASSED"
+ for i in range(failed):
+ results[f"tap_test_failed_{i + 1}"] = "FAILED"
+
+ # PRIORITY 3: If no results, check for error indicators
+ if not results:
+ error_indicators = [
+ "npm ERR!", # npm errors
+ "Error: Cannot find module", # Module loading errors
+ "SyntaxError:", # JavaScript syntax errors
+ "TypeError:", # Type errors
+ ]
+ has_errors = any(indicator in text_output for indicator in error_indicators)
+ # Return None ONLY if: no results found AND errors present
+ return None if has_errors else results
+
+ return results
+
+
+def parse_hardhat_mocha_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse Hardhat/Mocha console text output (non-JSON reporter).
+
+ Hardhat uses Mocha under the hood and outputs text like:
+ Contract: FeeSharingProxy:
+ withdrawFees
+ ✓ Shouldn't be able to use zero token address
+ ✓ Shouldn't be able to withdraw second time in period
+ 1) Should fail with specific error
+
+ 5 passing (1s)
+ 1 failing
+
+ IMPORTANT: We prioritize finding valid test results over detecting errors.
+
+ Returns:
+ Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+ None: If tests failed to run (not the same as tests failing)
+ """
+ results = {}
+
+ # Track current context (Contract/describe blocks)
+ current_context = []
+
+ # PRIORITY 1: Parse individual test results
+ for line in text_output.split("\n"):
+ stripped = line.strip()
+
+ # Track Contract: or describe blocks
+ contract_match = re.match(r"^Contract:\s*(.+?):\s*$", stripped)
+ if contract_match:
+ current_context = [contract_match.group(1)]
+ continue
+
+ # Track describe blocks (indented without checkmark/number)
+ if stripped and not stripped.startswith(("✓", "✗", "-")) and not re.match(r"^\d+\)", stripped):
+ # Check if this looks like a describe block (usually followed by test cases)
+ if ":" not in stripped and len(stripped) < 100:
+ # This might be a describe block, but we'll handle it dynamically
+ pass
+
+ # Match passed tests: ✓ test_name or ✔ test_name
+ pass_match = re.match(r"^[✓✔]\s+(.+?)(?:\s+\(\d+m?s\))?$", stripped)
+ if pass_match:
+ test_name = pass_match.group(1).strip()
+ test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+ results[test_id] = "PASSED"
+ continue
+
+ # Match failed tests: N) test_name or ✗ test_name
+ fail_match = re.match(r"^(?:\d+\)|[✗✘])\s*(.+?)$", stripped)
+ if fail_match:
+ test_name = fail_match.group(1).strip()
+ test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+ results[test_id] = "FAILED"
+ continue
+
+ # Match skipped tests: - test_name
+ skip_match = re.match(r"^-\s+(.+?)$", stripped)
+ if skip_match:
+ test_name = skip_match.group(1).strip()
+ test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+ results[test_id] = "SKIPPED"
+ continue
+
+ # PRIORITY 2: Parse summary if no individual results found
+ if not results:
+ # Look for "N passing" and "N failing"
+ pass_match = re.search(r"(\d+)\s+passing", text_output, re.IGNORECASE)
+ fail_match = re.search(r"(\d+)\s+failing", text_output, re.IGNORECASE)
+
+ if pass_match or fail_match:
+ passed = int(pass_match.group(1)) if pass_match else 0
+ failed = int(fail_match.group(1)) if fail_match else 0
+
+ for i in range(passed):
+ results[f"mocha_test_passed_{i + 1}"] = "PASSED"
+ for i in range(failed):
+ results[f"mocha_test_failed_{i + 1}"] = "FAILED"
+
+ # PRIORITY 3: If no results, check for error indicators
+ if not results:
+ error_indicators = [
+ "Error: Cannot find module",
+ "SyntaxError:",
+ "CompilerError:", # Solidity compilation errors
+ "Error: HH", # Hardhat errors
+ ]
+ has_errors = any(indicator in text_output for indicator in error_indicators)
+ return None if has_errors else results
+
+ return results
+
+
+def parse_pytest_text(text_output: str) -> Dict[str, str]:
+ """
+ Parse pytest plain text output (-v flag).
+
+ Pytest outputs lines like:
+ tests/test_foo.py::test_one PASSED
+ tests/test_foo.py::test_two FAILED
+ tests/test_foo.py::test_three SKIPPED
+
+ Or in short form:
+ tests/test_foo.py .F.s
+
+ Also handles summary lines like:
+ ===== 3 passed, 1 failed, 1 skipped in 0.5s =====
+ """
+ results = {}
+
+ # Pattern 1: Verbose output with test names
+ # e.g., "tests/test_foo.py::test_one PASSED"
+ verbose_pattern = re.compile(
+ r"^([\w./]+::\w+(?:::\w+)*)\s+(PASSED|FAILED|SKIPPED|ERROR|XFAIL|XPASS)", re.MULTILINE
+ )
+
+ for match in verbose_pattern.finditer(text_output):
+ test_id = match.group(1).strip()
+ status = match.group(2).upper()
+
+ if status in ("PASSED", "XPASS"):
+ results[test_id] = "PASSED"
+ elif status in ("FAILED", "ERROR", "XFAIL"):
+ results[test_id] = "FAILED"
+ elif status == "SKIPPED":
+ results[test_id] = "SKIPPED"
+
+ if results:
+ return results
+
+ # Pattern 2: Short form with dots (. = pass, F = fail, s = skip)
+ # e.g., "tests/test_foo.py .F.s"
+ short_pattern = re.compile(r"^([\w./]+\.py)\s+([.FsExX]+)", re.MULTILINE)
+
+ for match in short_pattern.finditer(text_output):
+ file_path = match.group(1)
+ outcomes = match.group(2)
+
+ for i, char in enumerate(outcomes):
+ test_id = f"{file_path}::test_{i + 1}"
+ if char == ".":
+ results[test_id] = "PASSED"
+ elif char.upper() == "F":
+ results[test_id] = "FAILED"
+ elif char.lower() == "s":
+ results[test_id] = "SKIPPED"
+
+ if results:
+ return results
+
+ # Pattern 3: Summary line fallback
+ # e.g., "===== 3 passed, 1 failed, 1 skipped in 0.5s ====="
+ summary_pattern = re.compile(r"(\d+)\s+passed(?:,\s*(\d+)\s+failed)?(?:,\s*(\d+)\s+(?:skipped|deselected))?")
+ match = summary_pattern.search(text_output)
+ if match:
+ passed = int(match.group(1) or 0)
+ failed = int(match.group(2) or 0)
+ skipped = int(match.group(3) or 0)
+
+ for i in range(passed):
+ results[f"pytest_test_passed_{i + 1}"] = "PASSED"
+ for i in range(failed):
+ results[f"pytest_test_failed_{i + 1}"] = "FAILED"
+ for i in range(skipped):
+ results[f"pytest_test_skipped_{i + 1}"] = "SKIPPED"
+
+ return results
+
+
+def parse_test_output(output: str, framework: str) -> Dict[str, str]:
+ """
+ Parse test output to extract individual test results.
+
+ Returns: {'test_id': 'PASSED'|'FAILED'|'SKIPPED'}
+ """
+ # Direct framework → parser mapping
+ parsers = {
+ "pytest": parse_junit_xml,
+ "unittest": parse_junit_xml,
+ "junit": parse_junit_xml,
+ "maven": parse_maven_text_output,
+ "gtest": parse_gtest_json,
+ "cargo-nextest": parse_cargo_nextest,
+ "go": parse_go_json,
+ "jest": parse_jest_vitest_json,
+ "vitest": parse_jest_vitest_json,
+ "mocha": parse_mocha_json,
+ "bun": parse_bun_text,
+ "ctest": parse_junit_xml,
+ "cppunit": parse_cppunit_text,
+ "bespoke_libgeos": parse_bespoke_libgeos,
+ # XCTest using hybrid approach
+ "xctest": parse_xctest_output,
+ "testing": parse_xctest_output, # New Swift Testing framework (Swift 6+)
+ # Lua frameworks
+ "busted": parse_junit_xml, # Uses JUnit XML output
+ "luaunit": parse_junit_xml, # Uses JUnit XML output
+ "telescope": parse_telescope_text,
+ "lust": parse_lust_text,
+ "minitest": parse_minitest_text, # Neovim mini.nvim test framework
+ # TAP (Test Anything Protocol) - used by tape, node-tap
+ "tap": parse_tap_text,
+ "tape": parse_tap_text,
+ # Hardhat (Solidity) - uses Mocha console output
+ "hardhat": parse_hardhat_mocha_text,
+ }
+
+ parser = parsers.get(framework)
+ if parser:
+ result = parser(output)
+ # Fallback for common frameworks if their primary parser returns None/empty
+ if not result:
+ if framework in ["junit", "maven"]:
+ result = parse_maven_text_output(output)
+ elif framework == "pytest":
+ # Pytest often outputs plain text, not JUnit XML
+ result = parse_pytest_text(output)
+ elif framework == "mocha":
+ # Mocha might output text instead of JSON (console reporter)
+ result = parse_hardhat_mocha_text(output)
+ return result or {}
+
+ # Try auto-detection for unknown frameworks
+ # Check for TAP output
+ if "TAP version" in output or re.search(r"^(?:not )?ok\s+\d+", output, re.MULTILINE):
+ return parse_tap_text(output) or {}
+
+ # Check for Mocha/Hardhat console output
+ if "Contract:" in output or re.search(r"^\s*[✓✔]\s+", output, re.MULTILINE):
+ return parse_hardhat_mocha_text(output) or {}
+
+ return {}
diff --git a/resources_servers/swe_bench/parsing/utils.py b/resources_servers/swe_bench/parsing/utils.py
new file mode 100644
index 0000000000..4ef21aad79
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/utils.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""SWE-Bench-Ext test output parsing utilities.
+
+Provides the high-level grading entry point that parses raw test output,
+normalizes test IDs, fuzzy-matches the expected FAIL_TO_PASS / PASS_TO_PASS
+tests, and reports whether the task was resolved. Example usage::
+
+ from resources_servers.swe_bench.parsing import parse_and_check_tests
+
+ result = parse_and_check_tests(
+ test_output=log_text,
+ test_framework="pytest",
+ fail_to_pass=["test_a", "test_b"],
+ pass_to_pass=["test_c"],
+ instance_id="my-task-123",
+ )
+ # result["resolved"] -> bool
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from resources_servers.swe_bench.parsing.parsing import (
+ normalize_test_id,
+ parse_test_output,
+)
+
+
+# Marker strings used to delimit structured output in the raw test log.
+_TEST_OUTPUT_START = "<<>>"
+_TEST_OUTPUT_END = "<<>>"
+_RESULT_FILE_START = "<<>>"
+_RESULT_FILE_END = "<<>>"
+
+
+def _extract_between_markers(text: str, start: str, end: str) -> Optional[str]:
+ """Extract the substring between two marker strings.
+
+ Args:
+ text: Text to search within.
+ start: Opening marker; the result begins after it.
+ end: Closing marker; the result ends before it.
+
+ Returns:
+ The stripped text between the markers, or None if either marker is
+ missing or they appear out of order.
+ """
+ s = text.find(start)
+ e = text.find(end)
+ if s != -1 and e != -1 and s < e:
+ return text[s + len(start) : e].strip()
+ return None
+
+
+def _match_test_with_fuzzy(
+ test_id: str,
+ parsed_results: Dict[str, str],
+ build_failed_packages: set,
+) -> str:
+ """Resolve the status of a single test ID against parsed results.
+
+ Tries, in order: a direct lookup, a check for membership in a package that
+ failed to build, a substring match, and a match on the final ``::``
+ component.
+
+ Args:
+ test_id: Normalized test identifier to look up.
+ parsed_results: Mapping of parsed test ID to status string.
+ build_failed_packages: Set of package names whose build failed; any
+ test ID prefixed by one of these is treated as failed.
+
+ Returns:
+ The matched status string, ``"FAILED"`` if the test's package failed to
+ build, or ``"NOT_FOUND"`` if no match is found.
+ """
+ # Direct match
+ if test_id in parsed_results:
+ return parsed_results[test_id]
+
+ # Check if this test belongs to a package that failed to build
+ for pkg in build_failed_packages:
+ if test_id.startswith(pkg):
+ return "FAILED"
+
+ # Substring match (normalized IDs may differ in prefix)
+ for parsed_id, status in parsed_results.items():
+ if test_id in parsed_id or parsed_id in test_id:
+ return status
+
+ # Try matching by last component (after last ::)
+ if "::" in test_id:
+ suffix = test_id.rsplit("::", 1)[-1]
+ for parsed_id, status in parsed_results.items():
+ if "::" in parsed_id and parsed_id.rsplit("::", 1)[-1] == suffix:
+ return status
+
+ return "NOT_FOUND"
+
+
+def parse_and_check_tests(
+ test_output: str,
+ test_framework: str,
+ fail_to_pass: List[str],
+ pass_to_pass: List[str],
+ instance_id: str = "",
+) -> Dict[str, Any]:
+ """Parse test output and check FAIL_TO_PASS / PASS_TO_PASS resolution.
+
+ The pipeline extracts structured output from the result-file markers (if
+ present), parses it with the framework dispatcher, normalizes both parsed
+ and expected test IDs, fuzzy-matches each expected test, and computes
+ ``resolved`` as all FAIL_TO_PASS passing and all PASS_TO_PASS passing.
+
+ Args:
+ test_output: Raw test log to parse.
+ test_framework: Name of the test framework (e.g. ``"pytest"``) used to
+ select the parser and normalize IDs.
+ fail_to_pass: Test IDs expected to transition from failing to passing.
+ pass_to_pass: Test IDs expected to remain passing.
+ instance_id: Optional task identifier, accepted for caller convenience.
+
+ Returns:
+ A report dict containing the overall ``resolved`` flag, per-test
+ FAIL_TO_PASS and PASS_TO_PASS results, pass/total counts for each
+ group, the number of parsed tests, and the framework name.
+ """
+ # Try to extract result file content from the markers.
+ result_file_content = _extract_between_markers(test_output, _RESULT_FILE_START, _RESULT_FILE_END)
+
+ if result_file_content:
+ parsed = parse_test_output(result_file_content, test_framework)
+ if not parsed:
+ parsed = parse_test_output(test_output, test_framework)
+ else:
+ parsed = parse_test_output(test_output, test_framework)
+
+ if parsed is None:
+ parsed = {}
+
+ # Normalize parsed test IDs
+ parsed = {normalize_test_id(tid, test_framework): status for tid, status in parsed.items()}
+
+ # Normalize expected test IDs
+ norm_f2p = [normalize_test_id(tid, test_framework) for tid in fail_to_pass]
+ norm_p2p = [normalize_test_id(tid, test_framework) for tid in pass_to_pass]
+
+ # Handle synthetic build/compile tests
+ for tid in norm_f2p + norm_p2p:
+ if (tid.endswith("::build") or tid.endswith("::compile")) and tid not in parsed:
+ parsed[tid] = "PASSED"
+
+ # Identify packages that failed to build
+ build_failed_packages = {pkg for pkg, status in parsed.items() if status == "FAILED" and "::" not in pkg}
+
+ # Match FAIL_TO_PASS
+ f2p_results = {}
+ for tid in norm_f2p:
+ f2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages)
+
+ # Match PASS_TO_PASS
+ p2p_results = {}
+ for tid in norm_p2p:
+ p2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages)
+
+ all_f2p_passed = all(v == "PASSED" for v in f2p_results.values()) if f2p_results else False
+ all_p2p_passed = all(v == "PASSED" for v in p2p_results.values())
+ resolved = all_f2p_passed and all_p2p_passed
+
+ return {
+ "resolved": resolved,
+ "patch_exists": True,
+ "patch_successfully_applied": True,
+ "fail_to_pass_results": f2p_results,
+ "pass_to_pass_results": p2p_results,
+ "f2p_passed": sum(1 for v in f2p_results.values() if v == "PASSED"),
+ "f2p_total": len(f2p_results),
+ "p2p_passed": sum(1 for v in p2p_results.values() if v == "PASSED"),
+ "p2p_total": len(p2p_results),
+ "parsed_count": len(parsed),
+ "framework": test_framework,
+ }
diff --git a/resources_servers/swe_bench/prepare.py b/resources_servers/swe_bench/prepare.py
new file mode 100644
index 0000000000..505ad263cc
--- /dev/null
+++ b/resources_servers/swe_bench/prepare.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+ python prepare.py # full SWE-bench Verified + all SIFs
+ python prepare.py --limit 5 # 5 instances + their 5 SIFs (smoke test)
+ python prepare.py --instance-id django__django-13741
+ python prepare.py --no-images # dataset only, skip image builds
+ python prepare.py --no-dataset --sif-dir PATH # build images only
+
+schema anyswe_agent expects: each line has
+`responses_create_params.metadata` with `instance_id`, `dataset_name`, `split`,
+`problem_statement`, and `instance_dict` (the full SWE-bench instance the eval
+harness needs). Images are Apptainer SIFs named `{instance_id}.sif` so the
+agent's container_formatter is simply `/{instance_id}.sif`.
+
+Prerequisites for image builds: `apptainer` on PATH and network access to the
+SWE-bench image registry. Each SIF is multiple GB, building all of SWE-bench
+Verified (500 tasks) needs hundreds of GB of disk. Can use --limit and iterate.
+"""
+
+import argparse
+import json
+import subprocess
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+
+
+HF_DATASET = "princeton-nlp/SWE-bench_Verified"
+DEFAULT_SPLIT = "test"
+# SWE-bench publishes eval images with `__` -> `_1776_` and lowercased.
+DOCKER_IMAGE_TMPL = "docker://swebench/sweb.eval.x86_64.{tag}:latest"
+DEFAULT_MODEL = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
+
+_THIS_DIR = Path(__file__).parent
+
+
+def _docker_tag(instance_id: str) -> str:
+ return instance_id.replace("__", "_1776_").lower()
+
+
+def _to_gym_row(inst: dict, split: str, sampling: dict) -> dict:
+ swe_meta = {
+ "instance_id": inst["instance_id"],
+ "dataset_name": HF_DATASET,
+ "split": split,
+ "problem_statement": inst["problem_statement"],
+ "instance_dict": json.dumps(inst),
+ }
+ user_text = inst["problem_statement"]
+ return {
+ "responses_create_params": {
+ "input": [{"role": "user", "content": user_text}],
+ **sampling,
+ "metadata": swe_meta,
+ },
+ "verifier_metadata": swe_meta,
+ }
+
+
+def build_dataset(output: Path, split: str, limit: int | None, instance_id: str | None, sampling: dict) -> list[str]:
+ try:
+ from datasets import load_dataset
+ except ImportError:
+ sys.exit("`datasets` is required for dataset prep: pip install datasets")
+
+ print(f"Loading {HF_DATASET} [{split}]...", flush=True)
+ rows = load_dataset(HF_DATASET, split=split)
+
+ if instance_id:
+ rows = [r for r in rows if r["instance_id"] == instance_id]
+ if not rows:
+ sys.exit(f"instance_id {instance_id!r} not found in {HF_DATASET}")
+ elif limit:
+ rows = rows.select(range(min(limit, len(rows))))
+
+ output.parent.mkdir(parents=True, exist_ok=True)
+ ids: list[str] = []
+ with output.open("w") as f:
+ for inst in rows:
+ inst = dict(inst)
+ f.write(json.dumps(_to_gym_row(inst, split, sampling)) + "\n")
+ ids.append(inst["instance_id"])
+ print(f"Wrote {len(ids)} rows -> {output}", flush=True)
+ return ids
+
+
+def _build_one_sif(instance_id: str, sif_dir: Path, force: bool) -> tuple[str, bool, str]:
+ sif_path = sif_dir / f"{instance_id}.sif"
+ if sif_path.exists() and not force:
+ return instance_id, True, "exists"
+ image = DOCKER_IMAGE_TMPL.format(tag=_docker_tag(instance_id))
+ proc = subprocess.run(
+ ["apptainer", "build", "--force", str(sif_path), image],
+ capture_output=True,
+ text=True,
+ errors="replace",
+ )
+ if proc.returncode != 0:
+ return instance_id, False, proc.stderr.strip()[-500:]
+ return instance_id, True, "built"
+
+
+def build_images(instance_ids: list[str], sif_dir: Path, jobs: int, force: bool) -> None:
+ if not _which("apptainer"):
+ sys.exit("`apptainer` not found on PATH. Install it or pass --no-images")
+ sif_dir.mkdir(parents=True, exist_ok=True)
+ print(f"Building {len(instance_ids)} SIF(s) into {sif_dir} with {jobs} worker(s)...", flush=True)
+ failures: list[str] = []
+ with ThreadPoolExecutor(max_workers=jobs) as pool:
+ futures = {pool.submit(_build_one_sif, iid, sif_dir, force): iid for iid in instance_ids}
+ for done in as_completed(futures):
+ iid, ok, detail = done.result()
+ print(f" [{'ok' if ok else 'FAIL'}] {iid}: {detail}", flush=True)
+ if not ok:
+ failures.append(iid)
+ if failures:
+ print(f"\n{len(failures)} image build(s) failed:", flush=True)
+ for iid in failures:
+ print(f" - {iid}", flush=True)
+ sys.exit(1)
+ print(f"All images ready. Use: container_formatter='{sif_dir}/{{instance_id}}.sif'", flush=True)
+
+
+def _which(name: str) -> bool:
+ from shutil import which
+
+ return which(name) is not None
+
+
+def main() -> None:
+ p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+ p.add_argument("--output", type=Path, default=_THIS_DIR / "data" / "swebench_verified.jsonl")
+ p.add_argument("--split", default=DEFAULT_SPLIT)
+ p.add_argument("--limit", type=int, default=None, help="Only the first N instances (default: all)")
+ p.add_argument("--instance-id", default=None, help="Only this instance")
+ p.add_argument("--sif-dir", type=Path, default=_THIS_DIR / "data" / "sifs")
+ p.add_argument("--no-dataset", action="store_true", help="Skip dataset build")
+ p.add_argument("--no-images", action="store_true", help="Skip image build")
+ p.add_argument("--jobs", type=int, default=4, help="Parallel image builds")
+ p.add_argument("--force", action="store_true", help="Rebuild SIFs that already exist")
+ p.add_argument("--model", default=DEFAULT_MODEL, help="Default model baked into each row")
+ p.add_argument("--temperature", type=float, default=0.7)
+ p.add_argument("--top-p", type=float, default=0.8)
+ p.add_argument("--max-output-tokens", type=int, default=12288)
+ args = p.parse_args()
+
+ sampling = {
+ "model": args.model,
+ "temperature": args.temperature,
+ "top_p": args.top_p,
+ "max_output_tokens": args.max_output_tokens,
+ }
+
+ instance_ids: list[str]
+ if args.no_dataset:
+ if not args.output.exists():
+ sys.exit(f"--no-dataset but {args.output} does not exist")
+ instance_ids = [
+ json.loads(line)["responses_create_params"]["metadata"]["instance_id"]
+ for line in args.output.read_text().splitlines()
+ if line.strip()
+ ]
+ else:
+ instance_ids = build_dataset(args.output, args.split, args.limit, args.instance_id, sampling)
+
+ if not args.no_images:
+ build_images(instance_ids, args.sif_dir, args.jobs, args.force)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/resources_servers/swe_bench/requirements.txt b/resources_servers/swe_bench/requirements.txt
new file mode 100644
index 0000000000..cef7e1d96d
--- /dev/null
+++ b/resources_servers/swe_bench/requirements.txt
@@ -0,0 +1,2 @@
+swebench
+datasets>=2.14.0
diff --git a/resources_servers/swe_bench/sandbox.py b/resources_servers/swe_bench/sandbox.py
new file mode 100644
index 0000000000..99dafcd50c
--- /dev/null
+++ b/resources_servers/swe_bench/sandbox.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Async SWE sandbox: an environment wrapper plus its acquire/teardown lifecycle.
+
+``AsyncSweEnvironment`` is a thin async wrapper around a started sandbox that any
+agent or the verifier uses to run commands and move files in and out.
+``acquire_sandbox`` starts a fresh sandbox and always tears it down on exit
+(normal return, exception, or cancellation).
+"""
+
+from __future__ import annotations
+
+import os
+import tempfile
+from contextlib import asynccontextmanager
+from pathlib import Path
+from typing import Any, AsyncIterator, Mapping
+
+from nemo_gym.sandbox import AsyncSandbox, SandboxProvider, SandboxSpec
+
+
+class AsyncSweEnvironment:
+ """Thin async wrapper around a started ``AsyncSandbox``.
+
+ Agents drive their own loop with ``execute``/``upload``/``download``; the
+ verifier uses the same surface to run eval recipes. The environment never
+ owns trajectory capture or grading logic — only sandbox I/O.
+ """
+
+ def __init__(self, sandbox: AsyncSandbox) -> None:
+ """Wrap an already-started sandbox.
+
+ Args:
+ sandbox (AsyncSandbox): A started sandbox to drive I/O against.
+ """
+ self._sandbox = sandbox
+ self._closed = False
+
+ @classmethod
+ async def start(
+ cls,
+ provider: Mapping[str, Any] | SandboxProvider,
+ spec: SandboxSpec,
+ ) -> "AsyncSweEnvironment":
+ """Create and start a fresh sandbox and return the environment.
+
+ Args:
+ provider (Mapping[str, Any] | SandboxProvider): The sandbox provider
+ config or instance to launch the sandbox with.
+ spec (SandboxSpec): The sandbox spec describing image, workdir, env,
+ and other launch options.
+
+ Returns:
+ AsyncSweEnvironment: An environment wrapping the started sandbox.
+ """
+ sandbox = AsyncSandbox(provider, spec)
+ await sandbox.start()
+ return cls(sandbox)
+
+ @property
+ def sandbox(self) -> AsyncSandbox:
+ """The wrapped sandbox.
+
+ Returns:
+ AsyncSandbox: The underlying sandbox instance.
+ """
+ return self._sandbox
+
+ @property
+ def sandbox_id(self) -> str | None:
+ """The provider-assigned sandbox identifier.
+
+ Returns:
+ str | None: The sandbox id, or ``None`` if the sandbox has no handle.
+ """
+ handle = getattr(self._sandbox, "_handle", None)
+ return handle.sandbox_id if handle is not None else None
+
+ @property
+ def provider_name(self) -> str | None:
+ """The name of the provider backing the sandbox.
+
+ Returns:
+ str | None: The provider name, or ``None`` if the sandbox has no handle.
+ """
+ handle = getattr(self._sandbox, "_handle", None)
+ return handle.provider_name if handle is not None else None
+
+ async def execute(
+ self,
+ command: str,
+ *,
+ cwd: str | None = None,
+ user: str | int | None = "root",
+ timeout_s: int | float | None = None,
+ is_eval: bool = False,
+ ) -> dict[str, Any]:
+ """Run a command in the sandbox and return a normalized result.
+
+ Args:
+ command (str): The shell command to execute.
+ cwd (str | None): Working directory for the command, or ``None`` to
+ use the sandbox default.
+ user (str | int | None): User to run the command as. Defaults to
+ ``"root"``.
+ timeout_s (int | float | None): Optional timeout in seconds.
+ is_eval (bool): Accepted for caller bookkeeping; it does not affect
+ how the command is executed.
+
+ Returns:
+ dict[str, Any]: A dict with ``output`` (combined stdout and stderr),
+ ``returncode``, ``stdout``, ``stderr``, and ``error_type``.
+ """
+ result = await self._sandbox.exec(command, cwd=cwd, env=None, timeout_s=timeout_s, user=user)
+ stdout = result.stdout or ""
+ stderr = result.stderr or ""
+ output = "\n".join(part for part in (stdout, stderr) if part)
+ return {
+ "output": output,
+ "returncode": result.return_code,
+ "stdout": stdout,
+ "stderr": stderr,
+ "error_type": result.error_type,
+ }
+
+ async def upload(self, local_path: Path | str, remote_path: str) -> None:
+ """Upload a local file into the sandbox.
+
+ Args:
+ local_path (Path | str): Path to the file on the host.
+ remote_path (str): Destination path inside the sandbox.
+ """
+ await self._sandbox.upload(local_path, remote_path)
+
+ async def download(self, remote_path: str, local_path: Path | str) -> None:
+ """Download a file from the sandbox to the host.
+
+ Args:
+ remote_path (str): Source path inside the sandbox.
+ local_path (Path | str): Destination path on the host.
+ """
+ await self._sandbox.download(remote_path, local_path)
+
+ async def write_text(self, remote_path: str, content: str) -> None:
+ """Write a string to a file inside the sandbox via a temporary upload.
+
+ Args:
+ remote_path (str): Destination path inside the sandbox.
+ content (str): The text content to write.
+ """
+ tmp = tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8")
+ try:
+ tmp.write(content)
+ tmp.flush()
+ tmp.close()
+ await self._sandbox.upload(tmp.name, remote_path)
+ finally:
+ os.unlink(tmp.name)
+
+ async def cleanup(self) -> None:
+ """Stop the sandbox. Idempotent: subsequent calls are no-ops."""
+ if self._closed:
+ return
+ self._closed = True
+ await self._sandbox.stop()
+
+ async def __aenter__(self) -> "AsyncSweEnvironment":
+ """Enter the async context manager.
+
+ Returns:
+ AsyncSweEnvironment: This environment instance.
+ """
+ return self
+
+ async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
+ """Exit the async context manager and stop the sandbox.
+
+ Args:
+ exc_type (Any): The exception type, if one was raised.
+ exc_val (Any): The exception instance, if one was raised.
+ exc_tb (Any): The traceback, if an exception was raised.
+ """
+ await self.cleanup()
+
+
+# --- sandbox acquire/teardown lifecycle ---------------
+
+
+@asynccontextmanager
+async def acquire_sandbox(
+ provider: Mapping[str, Any] | SandboxProvider,
+ spec: SandboxSpec,
+ *,
+ instance_id: str = "",
+) -> AsyncIterator[AsyncSweEnvironment]:
+ """Start a fresh sandbox, yield it, and always stop it on exit.
+
+ Args:
+ provider: Either a ``SandboxProvider`` instance or a mapping describing
+ the provider configuration used to create the sandbox.
+ spec: The ``SandboxSpec`` describing how to provision the sandbox.
+ instance_id: Identifier accepted for logging/telemetry; it does not
+ affect behavior.
+
+ Yields:
+ AsyncSweEnvironment: The started environment wrapping the sandbox,
+ which is cleaned up when the context manager exits.
+ """
+ env: AsyncSweEnvironment | None = None
+ try:
+ env = await AsyncSweEnvironment.start(provider, spec)
+ yield env
+ finally:
+ if env is not None:
+ try:
+ await env.cleanup()
+ except Exception:
+ pass
diff --git a/resources_servers/swe_bench/self_drive.py b/resources_servers/swe_bench/self_drive.py
new file mode 100644
index 0000000000..2e00cfac69
--- /dev/null
+++ b/resources_servers/swe_bench/self_drive.py
@@ -0,0 +1,392 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Provider-neutral self-driving scaffolding for SWE agents.
+
+Any agent that runs to completion inside a sandbox (editing the repo at the task's
+working directory) can reuse these helpers: provision a working sandbox via a
+``SandboxProvider``, inject a sandbox-reachable model endpoint and/or extra
+environment for egress, run an opaque agent launch command, and extract the
+resulting unified-diff patch. Grading is decoupled — callers grade the patch
+in-process via :func:`run_self_driving` (or ``verify_task`` directly) in a fresh
+sandbox. The agent launch command, staged files, and patch-output location are
+caller-supplied, so nothing here is specific to any one agent harness.
+
+This module also defines the in-sandbox model-server egress primitive
+(``ModelEndpoint`` / ``resolve``), used to inject a sandbox-reachable endpoint
+into the agent's environment.
+"""
+
+from __future__ import annotations
+
+import dataclasses
+import json
+import shlex
+from collections.abc import Mapping
+from dataclasses import dataclass
+from typing import Any
+
+from nemo_gym.sandbox import SandboxProvider
+from resources_servers.swe_bench.harness import SweTask, get_harness, reward_from_report
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+
+
+def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str:
+ """Return the name of a sandbox provider.
+
+ Args:
+ provider: Either a mapping keyed by provider name, or a ``SandboxProvider``
+ instance with a ``name`` attribute.
+
+ Returns:
+ The provider name, or ``"?"`` if it cannot be determined.
+ """
+ if isinstance(provider, Mapping):
+ return next(iter(provider), "?")
+ return getattr(provider, "name", "?")
+
+
+async def _read_output_jsonl_row(env, output_glob: str) -> dict[str, Any]:
+ """Return the last row of the newest matching ``output.jsonl`` (or ``{}`` if absent).
+
+ Some self-driving harnesses write their result row to an ``output.jsonl`` file under an
+ output directory rather than to the working tree, so a plain ``git diff`` would miss the
+ patch. When several files match (e.g. a re-run left a stale one), the newest by mtime is
+ picked. ``find -printf "%T@ %p"`` emits `` `` per match; ``sort -n | tail -1``
+ selects the most-recently-modified, and the leading float timestamp plus single space is
+ stripped back off (so paths containing spaces survive).
+
+ Args:
+ env: The sandbox handle exposing ``execute`` for running shell commands.
+ output_glob: Path or glob under which to search for ``output.jsonl`` files.
+
+ Returns:
+ The parsed last JSON row of the newest matching ``output.jsonl`` as a dict, or an
+ empty dict if no file or content is found.
+ """
+ found = await env.execute(
+ f'find {shlex.quote(output_glob)} -name output.jsonl -printf "%T@ %p\\n" 2>/dev/null | sort -n | tail -1'
+ )
+ newest = (found.get("stdout", "") or "").strip()
+ # newest is " "; the path may contain spaces, so split only on the first one.
+ path = newest.split(" ", 1)[1].strip() if " " in newest else ""
+ if not path:
+ return {}
+ catted = await env.execute(f"cat {shlex.quote(path)}")
+ raw = (catted.get("stdout", "") or "").strip()
+ if not raw:
+ return {}
+ return json.loads(raw.splitlines()[-1])
+
+
+async def _extract_patch_from_output_jsonl(env, output_glob: str) -> str:
+ """Read the unified-diff patch from the newest matching ``output.jsonl``.
+
+ Args:
+ env: The sandbox handle exposing ``execute`` for running shell commands.
+ output_glob: Path or glob under which to search for ``output.jsonl`` files.
+
+ Returns:
+ The patch string from ``row["test_result"]["git_patch"]``, or an empty string if
+ absent.
+ """
+ row = await _read_output_jsonl_row(env, output_glob)
+ return (row.get("test_result") or {}).get("git_patch", "") or ""
+
+
+def _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env):
+ """Build the agent sandbox spec, injecting egress env (model endpoint and/or extra env).
+
+ Args:
+ task: The SWE task whose benchmark selects the harness and seeds the spec.
+ provider: The sandbox provider, used to resolve the model endpoint for egress.
+ model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+ is resolved and merged into the spec's environment.
+ opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+ model endpoint.
+ extra_env: Optional environment variables merged verbatim into the spec.
+
+ Returns:
+ The sandbox spec with egress environment variables applied.
+ """
+ harness = get_harness(task.benchmark)
+ spec = harness.build_spec(task)
+ # Model-server egress: inject only a sandbox-reachable endpoint (never the global dict).
+ if model_server is not None:
+ endpoint = resolve(_provider_name(provider), model_server, opensandbox_service_url=opensandbox_service_url)
+ spec = dataclasses.replace(spec, env={**spec.env, **endpoint.to_sandbox_env()})
+ # Any extra in-sandbox env (e.g. a NeMo-Gym ServerClient config dict, ANTHROPIC_* vars).
+ if extra_env:
+ spec = dataclasses.replace(spec, env={**spec.env, **dict(extra_env)})
+ return spec
+
+
+async def provision_and_collect(
+ task: SweTask,
+ *,
+ provider: Mapping[str, Any] | SandboxProvider,
+ agent_launch_command: str,
+ model_server: Mapping[str, Any] | None = None,
+ opensandbox_service_url: str | None = None,
+ extra_env: Mapping[str, str] | None = None,
+ stage_files: Mapping[str, str] | None = None,
+ patch_output_glob: str | None = None,
+ agent_timeout_s: int | float = 1800,
+) -> dict[str, Any]:
+ """Provision and self-drive the agent, returning the patch and error signals.
+
+ Provisions a writable sandbox from the task image, stages any caller-supplied files,
+ runs the opaque ``agent_launch_command`` at the repo working directory, then extracts the
+ unified-diff patch. No grading happens here.
+
+ Two egress styles are supported and composable:
+
+ * ``model_server`` -> a sandbox-reachable OpenAI ``base_url`` (via ``resolve``),
+ for agents that call the model via a standard OpenAI/litellm client.
+ * ``extra_env`` -> injected verbatim, for agents wired to NeMo Gym's ``ServerClient`` or to
+ a CLI that reads its endpoint from environment variables.
+
+ ``env.execute`` does not raise on timeout; it returns an ``error_type`` instead, so the
+ caller must read the returned ``"error_type"`` to set ``agent_timed_out`` (otherwise a
+ timed-out agent would wrongly not be masked).
+
+ Args:
+ task: The SWE task describing the instance, image, and working directory.
+ provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+ agent_launch_command: The shell command that runs the agent inside the sandbox.
+ model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+ is resolved and injected into the agent's environment.
+ opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+ model endpoint.
+ extra_env: Optional environment variables injected verbatim into the sandbox.
+ stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+ before launch.
+ patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+ path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+ agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+ Returns:
+ A dict with keys ``"patch"`` (the unified-diff string), ``"agent_error"`` (the
+ harness error field or ``None``), and ``"error_type"`` (``"timeout"``, ``"sandbox"``,
+ or ``None``).
+ """
+ spec = _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env)
+ async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env:
+ for remote_path, content in (stage_files or {}).items():
+ await env.write_text(remote_path, content)
+ run = await env.execute(agent_launch_command, cwd=task.repo_workdir, timeout_s=agent_timeout_s)
+ error_type = run.get("error_type")
+ if patch_output_glob:
+ row = await _read_output_jsonl_row(env, patch_output_glob)
+ patch = (row.get("test_result") or {}).get("git_patch", "") or ""
+ return {"patch": patch, "agent_error": row.get("error"), "error_type": error_type}
+ diff = await env.execute(f"cd {task.repo_workdir} && git add -A && git diff --cached", cwd=task.repo_workdir)
+ return {"patch": diff.get("stdout", "") or "", "agent_error": None, "error_type": error_type}
+
+
+async def provision_and_extract_patch(
+ task: SweTask,
+ *,
+ provider: Mapping[str, Any] | SandboxProvider,
+ agent_launch_command: str,
+ model_server: Mapping[str, Any] | None = None,
+ opensandbox_service_url: str | None = None,
+ extra_env: Mapping[str, str] | None = None,
+ stage_files: Mapping[str, str] | None = None,
+ patch_output_glob: str | None = None,
+ agent_timeout_s: int | float = 1800,
+) -> str:
+ """Provision a working sandbox, self-drive the agent, and return the unified-diff patch.
+
+ A thin wrapper over :func:`provision_and_collect` returning only the patch. No grading
+ happens here.
+
+ Args:
+ task: The SWE task describing the instance, image, and working directory.
+ provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+ agent_launch_command: The shell command that runs the agent inside the sandbox.
+ model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+ is resolved and injected into the agent's environment.
+ opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+ model endpoint.
+ extra_env: Optional environment variables injected verbatim into the sandbox.
+ stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+ before launch.
+ patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+ path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+ agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+ Returns:
+ The extracted unified-diff patch as a string (empty if none was produced).
+ """
+ result = await provision_and_collect(
+ task,
+ provider=provider,
+ agent_launch_command=agent_launch_command,
+ model_server=model_server,
+ opensandbox_service_url=opensandbox_service_url,
+ extra_env=extra_env,
+ stage_files=stage_files,
+ patch_output_glob=patch_output_glob,
+ agent_timeout_s=agent_timeout_s,
+ )
+ return result["patch"]
+
+
+async def run_self_driving(
+ task: SweTask,
+ *,
+ provider: Mapping[str, Any] | SandboxProvider,
+ agent_launch_command: str,
+ model_server: Mapping[str, Any] | None = None,
+ opensandbox_service_url: str | None = None,
+ extra_env: Mapping[str, str] | None = None,
+ stage_files: Mapping[str, str] | None = None,
+ patch_output_glob: str | None = None,
+ agent_timeout_s: int | float = 1800,
+) -> dict[str, Any]:
+ """Provision, self-drive, extract the patch, then grade it in-process in a fresh sandbox.
+
+ Bundles provisioning and verification for standalone use and tests. The patch is graded by
+ ``verify_task`` in its OWN fresh sandbox (so grading is hermetic — never the agent's dirtied
+ tree). ``verify_task`` is imported lazily to avoid a circular import between this library and
+ the verifier module.
+
+ Args:
+ task: The SWE task describing the instance, image, and working directory.
+ provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+ agent_launch_command: The shell command that runs the agent inside the sandbox.
+ model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+ is resolved and injected into the agent's environment.
+ opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+ model endpoint.
+ extra_env: Optional environment variables injected verbatim into the sandbox.
+ stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+ before launch.
+ patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+ path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+ agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+ Returns:
+ A dict with the instance id, model patch, resolution status, reward, whether a patch
+ exists, whether the sample is masked, and the verifier's error kind.
+ """
+ from resources_servers.swe_bench.verify_task import verify_task
+
+ patch = await provision_and_extract_patch(
+ task,
+ provider=provider,
+ agent_launch_command=agent_launch_command,
+ model_server=model_server,
+ opensandbox_service_url=opensandbox_service_url,
+ extra_env=extra_env,
+ stage_files=stage_files,
+ patch_output_glob=patch_output_glob,
+ agent_timeout_s=agent_timeout_s,
+ )
+ # Score the patch in the verifier's OWN fresh sandbox (decoupled, hermetic verification).
+ report = await verify_task(provider, dataclasses.replace(task, model_patch=patch))
+ masked = report.error_kind is not None
+ return {
+ "instance_id": task.instance_id,
+ "model_patch": patch,
+ "resolved": report.resolved,
+ "reward": reward_from_report(report),
+ "patch_exists": bool(patch.strip()),
+ "mask_sample": masked,
+ "error_kind": report.error_kind,
+ }
+
+
+# --- in-sandbox model-server egress --------------
+
+
+class ModelEgressUnavailable(RuntimeError):
+ """Raised when no sandbox-reachable model endpoint can be resolved for a provider."""
+
+
+@dataclass(frozen=True)
+class ModelEndpoint:
+ """A sandbox-reachable model-server endpoint.
+
+ Attributes:
+ base_url: The base URL the in-sandbox agent uses to reach the model server.
+ api_key: Optional API key for authenticating to the model server.
+ model: Optional model name to use.
+ """
+
+ base_url: str
+ api_key: str = ""
+ model: str = ""
+
+ def to_sandbox_env(self) -> dict[str, str]:
+ """Build the minimal set of environment variables to inject into the sandbox.
+
+ Returns:
+ dict[str, str]: Environment variables carrying the base URL and,
+ when set, the API key and model name. The global config dict is
+ never included.
+ """
+ env = {"OPENAI_BASE_URL": self.base_url, "NEMO_GYM_MODEL_BASE_URL": self.base_url}
+ if self.api_key:
+ env["OPENAI_API_KEY"] = self.api_key
+ if self.model:
+ env["NEMO_GYM_MODEL"] = self.model
+ return env
+
+
+def resolve(
+ provider_name: str,
+ model_server: Mapping[str, Any],
+ *,
+ host_loopback_url: str = "http://127.0.0.1:8000/v1",
+ opensandbox_service_url: str | None = None,
+) -> ModelEndpoint:
+ """Resolve a sandbox-reachable model endpoint for a sandbox provider.
+
+ Args:
+ provider_name: The sandbox provider name (e.g. ``"apptainer"``,
+ ``"opensandbox"``, ``"docker"``).
+ model_server: Mapping describing the model server, read for the
+ ``api_key``, ``model``, and ``base_url`` keys.
+ host_loopback_url: Fallback URL used when the provider shares the host
+ network namespace and no base URL is configured.
+ opensandbox_service_url: Cluster-reachable Service/ingress URL used for
+ the opensandbox provider when no other base URL is configured.
+
+ Returns:
+ ModelEndpoint: The resolved endpoint carrying the base URL, API key,
+ and model name.
+
+ Raises:
+ ModelEgressUnavailable: If the opensandbox provider cannot resolve a
+ cluster-reachable model-server URL (e.g. only loopback is available).
+ """
+ api_key = str(model_server.get("api_key", "") or "")
+ model = str(model_server.get("model", "") or "")
+ configured_base = str(model_server.get("base_url", "") or "")
+
+ if provider_name == "opensandbox":
+ base_url = opensandbox_service_url or configured_base
+ if not base_url or "127.0.0.1" in base_url or "localhost" in base_url:
+ raise ModelEgressUnavailable(
+ "opensandbox needs a cluster-reachable model-server URL (k8s Service/ingress); "
+ "loopback is unreachable from the pod. Configure 'opensandbox_service_url', or "
+ "run the agent with the docker provider instead."
+ )
+ else:
+ # docker / local: shares host network by default (host loopback reachable).
+ base_url = configured_base or host_loopback_url
+
+ return ModelEndpoint(base_url=base_url, api_key=api_key, model=model)
diff --git a/resources_servers/swe_bench/session.py b/resources_servers/swe_bench/session.py
new file mode 100644
index 0000000000..ca3112bc57
--- /dev/null
+++ b/resources_servers/swe_bench/session.py
@@ -0,0 +1,71 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""SessionDescriptor — Environment response after accepting a Task.
+
+The descriptor is **episode context**, not the Task itself: placement topology,
+sandbox spec, egress hints, and a round-trip verifier payload for ``/verify``.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Literal, Optional
+
+from pydantic import BaseModel, ConfigDict, Field
+
+from nemo_gym.base_resources_server import (
+ BaseSeedSessionRequest,
+ BaseSeedSessionResponse,
+ BaseVerifyRequest,
+ BaseVerifyResponse,
+)
+from resources_servers.swe_bench.task import ENVIRONMENT_NAME, TaskPublic
+
+
+Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"]
+
+
+class PlacementDescriptor(BaseModel):
+ topology: Topology
+
+
+class SandboxDescriptor(BaseModel):
+ spec: dict[str, Any]
+
+
+class EgressDescriptor(BaseModel):
+ env: dict[str, str] = Field(default_factory=dict)
+
+
+class SessionDescriptor(BaseSeedSessionResponse):
+ """Environment-owned episode context returned from ``seed_session``."""
+
+ environment: str = ENVIRONMENT_NAME
+ task: TaskPublic
+ placement: PlacementDescriptor
+ sandbox: SandboxDescriptor
+ egress: EgressDescriptor
+ verifier_metadata: dict[str, Any]
+
+
+class SweBenchSeedSessionRequest(BaseSeedSessionRequest):
+ model_config = ConfigDict(extra="allow")
+ verifier_metadata: Optional[dict[str, Any]] = None
+
+
+class SweBenchVerifyRequest(BaseVerifyRequest):
+ model_config = ConfigDict(extra="allow")
+ verifier_metadata: Optional[dict[str, Any]] = None
+
+
+class SweBenchVerifyResponse(BaseVerifyResponse):
+ model_config = ConfigDict(extra="allow")
+ task_id: str = ""
+ environment: str = ENVIRONMENT_NAME
+ resolved: bool = False
+ patch_exists: bool = False
+ mask_sample: bool = False
+ error_kind: Optional[str] = None
+
+
+SweBenchSeedSessionResponse = SessionDescriptor
diff --git a/resources_servers/swe_bench/task.py b/resources_servers/swe_bench/task.py
new file mode 100644
index 0000000000..53d86b506e
--- /dev/null
+++ b/resources_servers/swe_bench/task.py
@@ -0,0 +1,256 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""First-class Task model for the ``swe_bench`` Environment.
+
+A **Task** (τ) is one problem instance from a benchmark's task distribution — not the
+Environment (``swe_bench`` resources server) and not the published benchmark name alone
+(e.g. *SWE-bench Verified*).
+
+Terminology:
+
+* ``task_id`` / ``instance_id`` — unique instance key (``django__django-13741``)
+* ``dataset_name`` — published benchmark product (HuggingFace id)
+* ``harness_family`` / ``benchmark`` — harness registry key inside this Environment
+ (``swe-bench``, ``r2e-gym``, …)
+* ``problem_statement`` — initial observation (user message) for the agent
+* ``metadata`` — privileged grading fields (``instance_dict``, etc.); Environment-only
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field, replace
+from typing import Any, Protocol
+
+from pydantic import BaseModel, ConfigDict
+
+from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming
+
+
+ENVIRONMENT_NAME = "swe_bench"
+
+_HARNESS_FAMILY_ALIASES: list[tuple[str, str]] = [
+ ("R2E-Gym", "r2e-gym"),
+ ("SWE-bench_Multilingual", "swe-bench-multilingual"),
+ ("SWE-bench", "swe-bench"),
+]
+
+
+class TaskRunBody(Protocol):
+ """Minimal run/seed/verify request shape carrying task fields."""
+
+ responses_create_params: NeMoGymResponseCreateParamsNonStreaming | None
+ verifier_metadata: dict[str, Any] | None
+
+
+class TaskPublic(BaseModel):
+ """Agent-visible task identity returned from ``seed_session``."""
+
+ model_config = ConfigDict(extra="forbid")
+
+ task_id: str
+ environment: str = ENVIRONMENT_NAME
+ dataset_name: str = ""
+ harness_family: str = ""
+ split: str = "test"
+
+
+class TaskSubmission(BaseModel):
+ """Agent-produced artifact graded at ``verify`` (Environment-owned scoring)."""
+
+ model_config = ConfigDict(extra="forbid")
+
+ model_patch: str = ""
+
+
+@dataclass
+class SweTask:
+ """One SWE Environment task instance — provisioning + grading input.
+
+ This is the Environment-internal task value. Harnesses consume ``SweTask``;
+ HTTP callers supply dataset rows that parse into this type.
+ """
+
+ instance_id: str
+ image: str | None = None
+ base_commit: str | None = None
+ repo_workdir: str = "/testbed"
+ test_command: str = ""
+ test_framework: str = ""
+ model_patch: str = ""
+ test_patch: str = ""
+ fail_to_pass: list[str] = field(default_factory=list)
+ pass_to_pass: list[str] = field(default_factory=list)
+ benchmark: str = "swe-bench-ext"
+ split: str = "test"
+ dataset_name: str = ""
+ problem_statement: str = ""
+ metadata: dict[str, Any] = field(default_factory=dict)
+
+ @property
+ def task_id(self) -> str:
+ return self.instance_id
+
+ @property
+ def harness_family(self) -> str:
+ return self.benchmark
+
+ def public_view(self, *, environment: str = ENVIRONMENT_NAME) -> TaskPublic:
+ """Return the agent-visible task identity (no privileged metadata)."""
+ return TaskPublic(
+ task_id=self.task_id,
+ environment=environment,
+ dataset_name=self.dataset_name,
+ harness_family=self.harness_family,
+ split=self.split,
+ )
+
+ def privileged_verifier_metadata(self, *, flat_eval: bool) -> dict[str, Any]:
+ """Privileged fields the Environment needs on verify (not for agent logic)."""
+ return {
+ "instance_id": self.instance_id,
+ "dataset_name": self.dataset_name,
+ "split": self.split,
+ "benchmark": self.benchmark,
+ "harness_family": self.harness_family,
+ "problem_statement": self.problem_statement,
+ "flat_eval": flat_eval,
+ "instance_dict": self.metadata.get("instance_dict"),
+ }
+
+ def with_submission(self, submission: TaskSubmission | None) -> SweTask:
+ """Return a copy with the agent's graded submission applied."""
+ patch = (submission.model_patch if submission else "") or ""
+ return replace(self, model_patch=patch)
+
+
+def harness_family_key(dataset_name: str) -> str:
+ """Map a HuggingFace dataset name to a harness registry key."""
+ for needle, key in _HARNESS_FAMILY_ALIASES:
+ if needle in dataset_name:
+ return key
+ return "swe-bench"
+
+
+def instance_image(container_formatter: Any, instance_id: str) -> str:
+ fmt = container_formatter[0] if isinstance(container_formatter, list) else container_formatter
+ fmt = fmt or "swebench/sweb.eval.x86_64.{instance_id}"
+ if fmt.endswith(".sif") or fmt.startswith(("/", ".")):
+ return fmt.format(instance_id=instance_id)
+ if fmt.startswith("docker://"):
+ fmt = fmt[len("docker://") :]
+ tag = instance_id.replace("__", "_1776_").lower()
+ image = fmt.format(instance_id=tag)
+ if ":" not in image.rsplit("/", 1)[-1]:
+ image += ":latest"
+ return image
+
+
+def _as_list(value: Any) -> list[str]:
+ if isinstance(value, str):
+ try:
+ return list(json.loads(value))
+ except (json.JSONDecodeError, TypeError):
+ return [value] if value else []
+ return list(value or [])
+
+
+def merge_row_metadata(
+ verifier_metadata: dict[str, Any] | None,
+ responses_metadata: dict[str, Any] | None,
+) -> dict[str, Any]:
+ """Merge dataset row fields from verifier and responses metadata."""
+ return _merge_row_metadata(verifier_metadata, responses_metadata)
+
+
+def _merge_row_metadata(
+ verifier_metadata: dict[str, Any] | None,
+ responses_metadata: dict[str, Any] | None,
+) -> dict[str, Any]:
+ info: dict[str, Any] = {}
+ if responses_metadata:
+ info.update(responses_metadata)
+ if verifier_metadata:
+ info.update(verifier_metadata)
+ return info
+
+
+def _initial_observation(row: dict[str, Any], responses_metadata: dict[str, Any] | None) -> str:
+ if row.get("problem_statement"):
+ return str(row["problem_statement"])
+ params = row.get("responses_create_params")
+ if isinstance(params, dict):
+ raw_input = params.get("input")
+ elif responses_metadata is not None:
+ raw_input = None
+ else:
+ raw_input = None
+ if raw_input is None and hasattr(row.get("responses_create_params"), "input"):
+ raw_input = row["responses_create_params"].input # type: ignore[union-attr]
+ if isinstance(raw_input, str):
+ return raw_input
+ if isinstance(raw_input, list) and raw_input:
+ first = raw_input[0]
+ if isinstance(first, dict):
+ return str(first.get("content", ""))
+ return ""
+
+
+def build_task(
+ row: dict[str, Any],
+ *,
+ container_formatter: str,
+ flat_eval: bool = True,
+ responses_metadata: dict[str, Any] | None = None,
+) -> SweTask:
+ """Build a ``SweTask`` from merged dataset / verifier metadata."""
+ inst_raw = row.get("instance_dict")
+ inst = json.loads(inst_raw) if isinstance(inst_raw, str) else dict(inst_raw or {})
+ dataset_name = str(row.get("dataset_name", ""))
+ instance_id = row["instance_id"]
+ image = instance_image(row.get("container_formatter") or container_formatter, instance_id)
+
+ return SweTask(
+ instance_id=instance_id,
+ image=image,
+ base_commit=inst.get("base_commit"),
+ repo_workdir="/testbed",
+ test_patch=inst.get("test_patch", ""),
+ fail_to_pass=_as_list(inst.get("FAIL_TO_PASS") or inst.get("fail_to_pass")),
+ pass_to_pass=_as_list(inst.get("PASS_TO_PASS") or inst.get("pass_to_pass")),
+ benchmark=harness_family_key(dataset_name),
+ split=str(row.get("split", "test")),
+ dataset_name=dataset_name,
+ problem_statement=_initial_observation(row, responses_metadata),
+ metadata={"instance_dict": inst, "flat_eval": flat_eval, "dataset_name": dataset_name},
+ )
+
+
+def parse_task_from_request(
+ body: TaskRunBody,
+ *,
+ container_formatter: str,
+ flat_eval: bool = True,
+ environment: str = ENVIRONMENT_NAME,
+) -> SweTask:
+ """Parse a first-class Task from an agent ``/run`` or Environment HTTP body."""
+ responses_metadata = (body.responses_create_params.metadata or {}) if body.responses_create_params else {}
+ row = merge_row_metadata(body.verifier_metadata, responses_metadata)
+ if "instance_id" not in row:
+ raise ValueError(
+ "Task requires verifier_metadata.instance_id (or responses_create_params.metadata.instance_id)"
+ )
+ return build_task(
+ row,
+ container_formatter=container_formatter,
+ flat_eval=flat_eval,
+ responses_metadata=responses_metadata,
+ )
+
+
+def parse_submission(verifier_metadata: dict[str, Any] | None) -> TaskSubmission:
+ """Extract the agent submission from verify request metadata."""
+ meta = dict(verifier_metadata or {})
+ patch = meta.get("model_patch") or meta.get("git_patch") or ""
+ return TaskSubmission(model_patch=patch if isinstance(patch, str) else str(patch))
diff --git a/resources_servers/swe_bench/task_builder.py b/resources_servers/swe_bench/task_builder.py
new file mode 100644
index 0000000000..3c4df6d181
--- /dev/null
+++ b/resources_servers/swe_bench/task_builder.py
@@ -0,0 +1,20 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Backward-compatible re-exports — prefer ``resources_servers.swe_bench.task``."""
+
+from resources_servers.swe_bench.task import (
+ SweTask,
+)
+from resources_servers.swe_bench.task import (
+ build_task as build_swetask,
+)
+from resources_servers.swe_bench.task import (
+ harness_family_key as benchmark_key,
+)
+from resources_servers.swe_bench.task import (
+ merge_row_metadata as problem_info_from_row,
+)
+
+
+__all__ = ["SweTask", "benchmark_key", "build_swetask", "problem_info_from_row"]
diff --git a/resources_servers/swe_bench/tests/__init__.py b/resources_servers/swe_bench/tests/__init__.py
new file mode 100644
index 0000000000..777f2341ac
--- /dev/null
+++ b/resources_servers/swe_bench/tests/__init__.py
@@ -0,0 +1 @@
+"""Test suite for the swe_env agent harness."""
diff --git a/resources_servers/swe_bench/tests/conftest.py b/resources_servers/swe_bench/tests/conftest.py
new file mode 100644
index 0000000000..5bc774e7c4
--- /dev/null
+++ b/resources_servers/swe_bench/tests/conftest.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Pytest collection guard for the swe_env tests.
+
+The flat-eval parser fixtures are recorded eval logs whose lines begin with the
+SWE-bench ``>>>>>`` sentinels. Under doctest collection those look like
+(malformed) ``>>>`` prompts, so the fixtures directory is excluded from
+collection entirely. It holds only data, never tests.
+"""
+
+from __future__ import annotations
+
+
+collect_ignore_glob = ["fixtures/*"]
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt
new file mode 100644
index 0000000000..bb67958525
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt
@@ -0,0 +1,9 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+error: while searching for:
+ def format_signature(self):
+error: patch failed: sphinx/ext/autodoc/__init__.py:120
+error: sphinx/ext/autodoc/__init__.py: patch does not apply
+>>>>> Patch Apply Failed
++ git checkout abc123 tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt
new file mode 100644
index 0000000000..bc8d678e61
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt
@@ -0,0 +1,14 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+collected 3 items
+>>>>> End Test Output
+PASSED tests/test_ext_autodoc.py::test_format_signature
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+=================== 3 passed in 1.92s =========================================
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt
new file mode 100644
index 0000000000..c4f0e56654
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt
@@ -0,0 +1,11 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
++ git apply -v /tmp/test_patch.diff
+error: patch failed: tests/test_ext_autodoc.py:1
+error: tests/test_ext_autodoc.py: patch does not apply
++ python -m pytest tests/test_ext_autodoc.py
+ERROR: file or directory not found: tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt
new file mode 100644
index 0000000000..1d0ba6a53a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt
@@ -0,0 +1,25 @@
++ source /opt/miniconda3/bin/activate
++ conda activate testbed
++ git config --global --add safe.directory /testbed
++ cd /testbed
++ git status
++ git restore .
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
++ git apply -v /tmp/test_patch.diff
+Checking patch tests/test_ext_autodoc.py...
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+PASSED tests/test_ext_autodoc.py::test_format_signature
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+SKIPPED tests/test_ext_autodoc.py::test_optional_feature
+=================== 3 passed, 1 skipped in 2.41s ===============================
+>>>>> End Test Output
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt
new file mode 100644
index 0000000000..0a27e668e1
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt
@@ -0,0 +1,10 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+>>>>> Tests Timed Out
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt
new file mode 100644
index 0000000000..59dc10159f
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt
@@ -0,0 +1,16 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Checking patch tests/test_ext_autodoc.py...
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+FAILED tests/test_ext_autodoc.py::test_format_signature - AssertionError: signature mismatch
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+=================== 2 passed, 1 failed in 2.10s ================================
+>>>>> End Test Output
++ git checkout abc123 tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt
new file mode 100644
index 0000000000..5f1200be91
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt
@@ -0,0 +1,6 @@
+{"Time":"2026-06-23T00:00:00Z","Action":"run","Package":"github.com/acme/widget","Test":"TestAlpha"}
+{"Time":"2026-06-23T00:00:00Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestAlpha","Elapsed":0.01}
+{"Time":"2026-06-23T00:00:01Z","Action":"run","Package":"github.com/acme/widget","Test":"TestBeta"}
+{"Time":"2026-06-23T00:00:01Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestBeta","Elapsed":0.02}
+{"Time":"2026-06-23T00:00:02Z","Action":"run","Package":"github.com/acme/widget","Test":"TestGamma"}
+{"Time":"2026-06-23T00:00:02Z","Action":"fail","Package":"github.com/acme/widget","Test":"TestGamma","Elapsed":0.01}
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml
new file mode 100644
index 0000000000..028b436db3
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml
@@ -0,0 +1,10 @@
+
+
+
+
+
+
+ boom
+
+
+
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt
new file mode 100644
index 0000000000..d566714983
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt
@@ -0,0 +1,15 @@
+============================= test session starts ==============================
+platform linux -- Python 3.12.0, pytest-8.0.0
+collected 3 items
+
+src/pkg/tests/test_widget.py::test_alpha PASSED [ 33%]
+src/pkg/tests/test_widget.py::test_beta PASSED [ 66%]
+src/pkg/tests/test_widget.py::test_gamma FAILED [100%]
+
+=================================== FAILURES ===================================
+________________________________ test_gamma ___________________________________
+ assert 1 == 2
+E assert 1 == 2
+=========================== short test summary info ============================
+FAILED src/pkg/tests/test_widget.py::test_gamma - assert 1 == 2
+========================= 2 passed, 1 failed in 0.12s ==========================
diff --git a/resources_servers/swe_bench/tests/test_app.py b/resources_servers/swe_bench/tests/test_app.py
new file mode 100644
index 0000000000..3e50958ab7
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_app.py
@@ -0,0 +1,95 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import json
+from unittest.mock import MagicMock
+
+import pytest
+
+import resources_servers.swe_bench.tests.test_swe_env # noqa: F401 — registers fake-swe provider
+from nemo_gym.openai_utils import NeMoGymResponse, NeMoGymResponseCreateParamsNonStreaming
+from nemo_gym.server_utils import ServerClient
+from resources_servers.swe_bench.app import (
+ SweBenchResourcesServer,
+ SweBenchResourcesServerConfig,
+ SweBenchSeedSessionRequest,
+ SweBenchVerifyRequest,
+)
+
+
+@pytest.fixture
+def server() -> SweBenchResourcesServer:
+ return SweBenchResourcesServer(
+ config=SweBenchResourcesServerConfig(
+ host="127.0.0.1",
+ port=12346,
+ entrypoint="app.py",
+ name="swe_bench",
+ sandbox_provider={"fake-swe": {}},
+ ),
+ server_client=MagicMock(spec=ServerClient),
+ )
+
+
+def _sample_row() -> dict:
+ inst = {
+ "instance_id": "astropy__astropy-12907",
+ "base_commit": "abc123",
+ "test_patch": "",
+ "FAIL_TO_PASS": '["tests/test_x.py::a"]',
+ "PASS_TO_PASS": '["tests/test_x.py::b"]',
+ }
+ meta = {
+ "instance_id": "astropy__astropy-12907",
+ "dataset_name": "princeton-nlp/SWE-bench_Verified",
+ "split": "test",
+ "problem_statement": "Fix the bug.",
+ "instance_dict": json.dumps(inst),
+ }
+ return {
+ "responses_create_params": NeMoGymResponseCreateParamsNonStreaming(
+ input=[{"role": "user", "content": "Fix the bug."}],
+ metadata=meta,
+ ),
+ "verifier_metadata": meta,
+ }
+
+
+@pytest.mark.asyncio
+async def test_seed_session_agent_in_env(server: SweBenchResourcesServer) -> None:
+ body = SweBenchSeedSessionRequest(**_sample_row())
+ resp = await server.seed_session(body)
+ assert resp.environment == "swe_bench"
+ assert resp.placement.topology == "agent_in_env"
+ assert resp.sandbox.spec["image"].startswith("swebench/")
+ assert resp.task.task_id == "astropy__astropy-12907"
+ assert resp.task.harness_family == "swe-bench"
+ assert resp.task.dataset_name == "princeton-nlp/SWE-bench_Verified"
+ assert resp.verifier_metadata["instance_id"] == "astropy__astropy-12907"
+
+
+@pytest.mark.asyncio
+async def test_verify_empty_patch(server: SweBenchResourcesServer) -> None:
+ row = _sample_row()
+ row["verifier_metadata"] = {**row["verifier_metadata"], "model_patch": ""}
+ body = SweBenchVerifyRequest(
+ **row,
+ response=NeMoGymResponse(
+ id="r1",
+ created_at=0,
+ model="m",
+ object="response",
+ output=[],
+ parallel_tool_calls=False,
+ tool_choice="auto",
+ tools=[],
+ ),
+ )
+ resp = await server.verify(body)
+ assert resp.task_id == "astropy__astropy-12907"
+ assert resp.environment == "swe_bench"
+ assert resp.reward == 0.0
+ assert resp.patch_exists is False
+ assert resp.resolved is False
diff --git a/resources_servers/swe_bench/tests/test_flat_eval.py b/resources_servers/swe_bench/tests/test_flat_eval.py
new file mode 100644
index 0000000000..35a00fc580
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_flat_eval.py
@@ -0,0 +1,594 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the opt-in flat (host-graded) eval mode of the nested families.
+
+The suite has two layers:
+
+* Parser unit tests on recorded fixture logs cover the SWE-bench eval-script log
+ parser (``parse_eval_log``) on a success log, a failure log, the bad-code logs
+ (patch-apply-failed / timeout), a no-markers log, and the
+ output-outside-markers fallback. The fixtures use the
+ ``>>>>> Start/End Test Output`` shape the SWE-bench eval script emits.
+
+* Flat run_eval and grade via FakeSandbox drive the flat path of both nested
+ harnesses (``swe-bench``, ``r2e-gym``) end-to-end with a scripted provider that
+ returns a fixture log, asserting ``resolved`` is computed from ``FAIL_TO_PASS``
+ / ``PASS_TO_PASS``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from pathlib import Path
+
+import pytest
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses import flat_eval
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+_FIXTURES = Path(__file__).parent / "fixtures" / "flat_eval"
+
+
+def _fixture(name: str) -> str:
+ """Read a recorded fixture log by name.
+
+ Fixtures are stored with a ``.txt`` suffix, so a caller may pass either the
+ ``.log`` stem name or the real ``.txt`` name.
+
+ Args:
+ name: The fixture file name, with either a ``.log`` or ``.txt`` suffix.
+
+ Returns:
+ The fixture file contents as text.
+ """
+ path = _FIXTURES / name
+ if not path.exists() and path.suffix == ".log":
+ path = path.with_suffix(".txt")
+ return path.read_text()
+
+
+# ---- parser: recorded fixture logs (CI) -------------------------------------
+
+
+def test_parse_success_log_all_pass():
+ """A success log parses to a status map with the expected passed and skipped tests."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("resolved_success.log"))
+ assert applied is True
+ assert status_map == {
+ "tests/test_ext_autodoc.py::test_format_signature": "PASSED",
+ "tests/test_ext_autodoc.py::test_autodoc_inherited": "PASSED",
+ "tests/test_ext_autodoc.py::test_autodoc_exclude_members": "PASSED",
+ "tests/test_ext_autodoc.py::test_optional_feature": "SKIPPED",
+ }
+ assert sorted(flat_eval.passed_tests(status_map)) == [
+ "tests/test_ext_autodoc.py::test_autodoc_exclude_members",
+ "tests/test_ext_autodoc.py::test_autodoc_inherited",
+ "tests/test_ext_autodoc.py::test_format_signature",
+ ]
+
+
+def test_parse_failure_log_strips_failed_reason():
+ """A failure log parses with the failure reason stripped down to the node id."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("unresolved_failure.log"))
+ assert applied is True
+ # The "FAILED - " line keeps only the node id.
+ assert status_map["tests/test_ext_autodoc.py::test_format_signature"] == "FAILED"
+ assert "tests/test_ext_autodoc.py::test_autodoc_inherited" in flat_eval.passed_tests(status_map)
+
+
+def test_parse_apply_patch_failed_is_untrusted():
+ """A patch-apply-failed log yields an empty status map and patch_applied False."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("apply_patch_failed.log"))
+ assert status_map == {}
+ assert applied is False
+
+
+def test_parse_timeout_is_untrusted():
+ """A timeout log yields an empty status map and patch_applied False."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("tests_timeout.log"))
+ assert status_map == {}
+ assert applied is False
+
+
+def test_parse_no_markers_is_untrusted():
+ """A log with no test-output markers yields an empty status map and patch_applied False."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("no_markers.log"))
+ assert status_map == {}
+ assert applied is False
+
+
+def test_parse_fallback_outside_markers():
+ """Per-test lines appearing after the End marker are recovered by the whole-log fallback."""
+ status_map, applied = flat_eval.parse_eval_log(_fixture("fallback_outside_markers.log"))
+ assert applied is True
+ assert len(flat_eval.passed_tests(status_map)) == 3
+
+
+def test_parse_duplicate_node_last_status_wins():
+ """For a duplicated node id the last reported status wins.
+
+ A node first reported FAILED then re-reported PASSED (e.g. via a rerun plugin)
+ ends up PASSED, and vice versa.
+ """
+ log = "\n".join(
+ [
+ flat_eval.APPLY_PATCH_PASS,
+ flat_eval.START_TEST_OUTPUT,
+ "FAILED tests/test_x.py::test_flaky",
+ "PASSED tests/test_x.py::test_flaky",
+ "PASSED tests/test_x.py::test_regressed",
+ "FAILED tests/test_x.py::test_regressed",
+ flat_eval.END_TEST_OUTPUT,
+ ]
+ )
+ status_map, applied = flat_eval.parse_eval_log(log)
+ assert applied is True
+ # Last line wins for each node, not the first.
+ assert status_map["tests/test_x.py::test_flaky"] == "PASSED"
+ assert status_map["tests/test_x.py::test_regressed"] == "FAILED"
+ assert flat_eval.passed_tests(status_map) == ["tests/test_x.py::test_flaky"]
+
+
+def test_parse_xfail_counts_as_pass():
+ """An XFAIL node counts as a passed test."""
+ log = "\n".join(
+ [
+ flat_eval.APPLY_PATCH_PASS,
+ flat_eval.START_TEST_OUTPUT,
+ "XFAIL tests/test_x.py::test_known_bug",
+ "PASSED tests/test_x.py::test_ok",
+ flat_eval.END_TEST_OUTPUT,
+ ]
+ )
+ status_map, applied = flat_eval.parse_eval_log(log)
+ assert applied is True
+ assert set(flat_eval.passed_tests(status_map)) == {
+ "tests/test_x.py::test_known_bug",
+ "tests/test_x.py::test_ok",
+ }
+
+
+# ---- flat_grade over parsed fixtures (CI) -----------------------------------
+
+
+def _task(benchmark: str = "swe-bench", **overrides) -> SweTask:
+ """Build a SweTask with sensible defaults, overridable per keyword.
+
+ Args:
+ benchmark: The benchmark name for the task.
+ **overrides: Field overrides merged onto the default task fields.
+
+ Returns:
+ A SweTask configured for the given benchmark.
+ """
+ base = dict(
+ instance_id="repo__inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ model_patch="diff --git a/x b/x\n",
+ fail_to_pass=["tests/test_ext_autodoc.py::test_format_signature"],
+ pass_to_pass=["tests/test_ext_autodoc.py::test_autodoc_inherited"],
+ benchmark=benchmark,
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+def _flat_artifacts(log: str) -> EvalArtifacts:
+ """Wrap an eval log in flat-eval EvalArtifacts.
+
+ Args:
+ log: The eval-script log text.
+
+ Returns:
+ EvalArtifacts carrying the log with a clean (non-error) flat raw payload.
+ """
+ return EvalArtifacts(test_output=log, return_code=0, patch_applied=True, raw={"error_type": None, "flat": True})
+
+
+def test_flat_grade_resolved_on_success():
+ """Flat grading resolves a success log with reward 1.0."""
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("resolved_success.log")))
+ assert report.resolved is True
+ assert report.patch_applied is True
+ assert report.patch_exists is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_unresolved_on_failure():
+ """Flat grading leaves a failure log unresolved with reward 0.0."""
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("unresolved_failure.log")))
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_unresolved_on_apply_failed():
+ """A failed patch apply grades as a legitimate unresolved, not an infra mask."""
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("apply_patch_failed.log")))
+ assert report.resolved is False
+ assert report.patch_applied is False
+ assert report.error_kind is None
+ assert reward_from_report(report) == 0.0
+
+
+# ---- consistency of flat grading --------------------------------------------
+#
+# Flat grading takes ``resolved`` straight from the parser's verdict (all F2P +
+# all P2P passed) and never re-gates it on ``patch_applied``. The parser's
+# ``log_patch_applied`` flag never changes ``resolved`` relative to a pure
+# ``compute_resolved`` verdict: whenever ``parse_eval_log`` reports
+# ``patch_applied=False`` it also returns an empty status map, so
+# ``compute_resolved`` already yields False. These tests lock in that invariant
+# so a future edit cannot reintroduce a divergent gate.
+
+
+@pytest.mark.parametrize(
+ "fixture_name",
+ [
+ "resolved_success.log",
+ "unresolved_failure.log",
+ "apply_patch_failed.log",
+ "tests_timeout.log",
+ "no_markers.log",
+ "fallback_outside_markers.log",
+ ],
+)
+def test_flat_grade_resolved_matches_ungated_compute_resolved(fixture_name):
+ """``flat_grade``'s resolved verdict agrees with a bare ``compute_resolved`` over the parsed passed-set.
+
+ The patch-applied gate is redundant and never flips the verdict True<->False.
+
+ Args:
+ fixture_name: The recorded fixture log to parse and grade.
+ """
+ from resources_servers.swe_bench.harness import compute_resolved
+
+ task = _task()
+ log = _fixture(fixture_name)
+ status_map, _applied = flat_eval.parse_eval_log(log)
+ ungated = compute_resolved(
+ fail_to_pass=task.fail_to_pass,
+ pass_to_pass=task.pass_to_pass,
+ passed=flat_eval.passed_tests(status_map),
+ )
+ report = flat_eval.flat_grade(task, _flat_artifacts(log))
+ assert report.resolved is ungated
+
+
+@pytest.mark.parametrize(
+ "bad_code_attr",
+ ["APPLY_PATCH_FAIL", "RESET_FAILED", "TESTS_ERROR", "TESTS_TIMEOUT"],
+)
+def test_parse_eval_log_bad_code_empties_status_map_even_with_status_lines(bad_code_attr):
+ """A bad code forces an empty status map and patch_applied False even with per-test status lines.
+
+ This is what makes the flat_grade patch-applied gate redundant: no path yields
+ patch_applied=False together with a non-empty status map.
+
+ Args:
+ bad_code_attr: Name of the bad-code marker attribute on ``flat_eval``.
+ """
+ bad_code = getattr(flat_eval, bad_code_attr)
+ log = "\n".join(
+ [
+ bad_code,
+ flat_eval.START_TEST_OUTPUT,
+ "PASSED tests/test_ext_autodoc.py::test_format_signature",
+ "PASSED tests/test_ext_autodoc.py::test_autodoc_inherited",
+ flat_eval.END_TEST_OUTPUT,
+ ]
+ )
+ status_map, applied = flat_eval.parse_eval_log(log)
+ assert applied is False
+ assert status_map == {}
+ # And it grades as a legitimate unresolved (not an infra mask): error_kind
+ # stays None, resolved False -> reward 0.0, matching the flat families.
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+ assert report.resolved is False
+ assert report.error_kind is None
+ assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_resolved_does_not_gate_on_artifact_patch_applied():
+ """Flat ``resolved`` is the parser's verdict only and ignores the artifact's patch_applied flag.
+
+ Even if the EvalArtifacts carries patch_applied False (e.g. the model patch
+ did not cleanly apply), a passing eval log still resolves, since grading is
+ based on the tests rather than the apply status.
+ """
+ artifacts = EvalArtifacts(
+ test_output=_fixture("resolved_success.log"),
+ return_code=0,
+ patch_applied=False,
+ raw={"error_type": None, "flat": True},
+ )
+ report = flat_eval.flat_grade(_task(), artifacts)
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_neutral_skipped_required_test_is_not_a_failure():
+ """A required test reported SKIPPED is neutral (excluded), not a failure.
+
+ This mirrors swebench's ``get_eval_tests_report`` + ``get_resolution_status``: a
+ required test counts as a failure only when absent or FAILED/ERROR. A neutral
+ status (SKIPPED/XPASS) is excluded from both the success and failure tallies, so a
+ run whose only "non-pass" required test is SKIPPED still resolves. A bare
+ ``passed``-set membership check (the prior behavior) would have treated the
+ SKIPPED test as a failure and wrongly graded it unresolved.
+ """
+ log = "\n".join(
+ [
+ flat_eval.APPLY_PATCH_PASS,
+ flat_eval.START_TEST_OUTPUT,
+ "PASSED tests/test_ext_autodoc.py::test_format_signature",
+ "SKIPPED tests/test_ext_autodoc.py::test_autodoc_inherited",
+ flat_eval.END_TEST_OUTPUT,
+ ]
+ )
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+ # F2P passed; the SKIPPED P2P is neutral (excluded) -> zero failures -> resolved.
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_absent_required_test_is_a_failure():
+ """A required test absent from the status map is a failure (not neutral).
+
+ Per swebench's ``test_failed`` (``case not in sm``), an absent required test counts
+ as a failure, so the run must grade unresolved.
+ """
+ log = "\n".join(
+ [
+ flat_eval.APPLY_PATCH_PASS,
+ flat_eval.START_TEST_OUTPUT,
+ "PASSED tests/test_ext_autodoc.py::test_format_signature",
+ flat_eval.END_TEST_OUTPUT,
+ ]
+ )
+ # P2P (test_autodoc_inherited) is absent from the log -> failure -> unresolved.
+ report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_masks_infra_error():
+ """Flat grading masks an infra timeout to reward 0.0 with a timeout error kind."""
+ artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout", "flat": True})
+ report = flat_eval.flat_grade(_task(), artifacts)
+ assert report.error_kind == "timeout"
+ assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_unbuildable_eval_script_is_unmasked_unresolved():
+ """An unbuildable / missing eval script grades UNMASKED unresolved (reward 0), not eval_error.
+
+ Per main, only genuine sandbox/timeout infra failures are masked; an empty/unbuildable eval
+ spec produces no test markers and so grades as a legitimate unresolved (``error_kind`` None).
+ """
+ artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "eval_error", "flat": True})
+ report = flat_eval.flat_grade(_task(), artifacts)
+ assert report.error_kind is None
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+# ---- gating (CI) ------------------------------------------------------------
+
+
+def test_flat_eval_enabled_harness_flag():
+ """The harness-level flat-eval flag enables flat eval."""
+ assert flat_eval.flat_eval_enabled(True, _task()) is True
+
+
+def test_flat_eval_enabled_task_metadata():
+ """Per-task ``flat_eval`` metadata enables flat eval."""
+ assert flat_eval.flat_eval_enabled(False, _task(metadata={"flat_eval": True})) is True
+
+
+def test_flat_eval_disabled_by_default():
+ """Flat eval is disabled when neither the harness flag nor task metadata enables it."""
+ assert flat_eval.flat_eval_enabled(False, _task()) is False
+
+
+def test_swebench_supports_provider_gating():
+ """The swe-bench harness is host-graded (flat), so it runs on any exec-capable provider."""
+ harness = SweBenchHarness("swe-bench")
+ assert harness.supports_provider("docker") is True
+ assert harness.supports_provider("apptainer") is True
+ assert harness.supports_provider("opensandbox") is True
+ assert harness.grade_strategy == "flat-host-grade"
+
+
+def test_r2egym_supports_provider_gating():
+ """The r2e-gym harness is host-graded (flat), so it runs on any exec-capable provider."""
+ harness = R2EGymHarness()
+ assert harness.supports_provider("docker") is True
+ assert harness.supports_provider("apptainer") is True
+ assert harness.supports_provider("opensandbox") is True
+ assert harness.grade_strategy == "flat-host-grade"
+
+
+# ---- flat run_eval end-to-end via FakeSandbox (CI) --------------------------
+
+
+class _FakeFlatProvider:
+ """Scripted provider: ``bash eval.sh ...`` streams a fixture log; ``cat`` echoes it."""
+
+ name = "fake-flat-eval"
+
+ def __init__(self, *, log_text="", run_rc=0, error_type=None, stream_empty=False, **_):
+ """Configure the scripted flat-eval provider's responses.
+
+ Args:
+ log_text: The eval-script log text returned by the run and ``cat``.
+ run_rc: Return code returned for the eval-script run.
+ error_type: Optional error type attached to the run result.
+ stream_empty: When True, the eval-script run streams empty stdout so
+ the harness falls back to reading the tee'd log file.
+ **_: Ignored extra keyword arguments.
+ """
+ self._log_text = log_text
+ self._run_rc = run_rc
+ self._error_type = error_type
+ self._stream_empty = stream_empty
+ self.commands: list[str] = []
+ self.uploaded: dict[str, str] = {}
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ self.commands.append(command)
+ if command.startswith("cat "):
+ return SandboxExecResult(stdout=self._log_text, stderr="", return_code=0)
+ # The eval script run.
+ stdout = "" if self._stream_empty else self._log_text
+ return SandboxExecResult(stdout=stdout, stderr="", return_code=self._run_rc, error_type=self._error_type)
+
+ async def upload_file(self, handle, local_path, remote_path):
+ try:
+ with open(local_path, encoding="utf-8") as fh:
+ self.uploaded[remote_path] = fh.read()
+ except OSError:
+ self.uploaded[remote_path] = ""
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-flat-eval", _FakeFlatProvider, override=True)
+
+
+def _drive_flat(harness, task, *, log_text, run_rc=0, error_type=None, stream_empty=False):
+ """Drive materialize -> run_eval -> grade for a flat harness via the scripted provider.
+
+ Args:
+ harness: The flat-capable harness under test.
+ task: The SweTask to evaluate.
+ log_text: The eval-script log text the provider returns.
+ run_rc: Return code returned for the eval-script run.
+ error_type: Optional error type attached to the run result.
+ stream_empty: When True, the run streams empty stdout so the harness falls
+ back to reading the tee'd log file.
+
+ Returns:
+ A tuple of the graded report, the EvalArtifacts, and the provider instance.
+ """
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def _go():
+ provider = {
+ "fake-flat-eval": {
+ "log_text": log_text,
+ "run_rc": run_rc,
+ "error_type": error_type,
+ "stream_empty": stream_empty,
+ }
+ }
+ env = await AsyncSweEnvironment.start(provider, harness.build_spec(task))
+ try:
+ await harness.materialize(env, task)
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts), artifacts, env.sandbox._provider
+ finally:
+ await env.cleanup()
+
+ return asyncio.run(_go())
+
+
+def test_swebench_flat_run_eval_resolved():
+ """The swe-bench flat path resolves a success run and uploads the eval script."""
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"eval_script": "echo running", "flat_eval": True})
+ report, artifacts, provider = _drive_flat(harness, task, log_text=_fixture("resolved_success.log"))
+ assert artifacts.raw["flat"] is True
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+ # The eval script was uploaded into the sandbox.
+ assert provider.uploaded.get(flat_eval.EVAL_SCRIPT_PATH, "").startswith("echo running")
+
+
+def test_swebench_flat_run_eval_unresolved():
+ """The swe-bench flat path leaves a failure run unresolved."""
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"eval_script": "echo running"})
+ report, _artifacts, _ = _drive_flat(harness, task, log_text=_fixture("unresolved_failure.log"))
+ assert report.resolved is False
+
+
+def test_swebench_flat_run_eval_stream_empty_uses_log_file():
+ """When streamed output is empty, run_eval reads back the tee'd log file."""
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"eval_script": "echo running"})
+ report, _artifacts, provider = _drive_flat(
+ harness, task, log_text=_fixture("resolved_success.log"), stream_empty=True
+ )
+ assert any(cmd.startswith("cat ") for cmd in provider.commands)
+ assert report.resolved is True
+
+
+def test_swebench_flat_run_eval_masks_sandbox_error():
+ """The swe-bench flat path masks a sandbox error reported by the run."""
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"eval_script": "echo running"})
+ report, artifacts, _ = _drive_flat(harness, task, log_text="", run_rc=1, error_type="sandbox")
+ assert artifacts.raw["error_type"] == "sandbox"
+ assert report.error_kind == "sandbox"
+
+
+def test_swebench_flat_run_eval_missing_script_is_unmasked_unresolved():
+ """A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error.
+
+ ``flat_run_eval`` still tags the artifact ``error_type == "eval_error"`` (so callers can log
+ it), but grading no longer masks on it: per main only genuine sandbox/timeout infra failures
+ are masked, and an empty spec simply produces no test markers and grades unresolved.
+ """
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={}) # no eval_script
+ report, artifacts, _ = _drive_flat(harness, task, log_text="")
+ assert artifacts.raw["error_type"] == "eval_error"
+ assert report.error_kind is None
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_r2egym_flat_run_eval_resolved_via_task_metadata():
+ """Per-task ``flat_eval`` metadata drives the r2e-gym flat path to a resolved run."""
+ harness = R2EGymHarness()
+ task = _task(benchmark="r2e-gym", instance_id="r2e__pkg-1", metadata={"eval_script": "echo run"})
+ report, artifacts, _ = _drive_flat(harness, task, log_text=_fixture("resolved_success.log"))
+ assert artifacts.raw["flat"] is True
+ assert report.resolved is True
diff --git a/resources_servers/swe_bench/tests/test_lifecycle.py b/resources_servers/swe_bench/tests/test_lifecycle.py
new file mode 100644
index 0000000000..dbde539ada
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_lifecycle.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Sandbox lifecycle (``acquire_sandbox``) and ``verify_task`` happy/timeout/empty paths.
+
+These tests cover always-teardown on context exit and the fresh-sandbox verify
+sequence, including the resolved, empty-patch fast path, and eval-timeout cases.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+import pytest
+
+import resources_servers.swe_bench.harnesses # noqa: F401 (register harnesses)
+from nemo_gym.sandbox import SandboxExecResult, SandboxHandle, SandboxStatus
+from resources_servers.swe_bench.harness import SweTask
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+from resources_servers.swe_bench.verify_task import verify_task
+
+
+class _CountingProvider:
+ """Provider instance passed directly so the test can count create/close/exec.
+
+ Args:
+ exec_sleep: Seconds to sleep inside each ``exec`` call, used to simulate a
+ slow evaluation that triggers the eval timeout.
+ test_output: Stdout returned for pytest commands. The trailing-status
+ pytest format is the shape the test parser recognizes, and the ``.py``
+ path normalizes to the F2P id in ``_task``.
+ """
+
+ name = "fake-life"
+
+ def __init__(self, *, exec_sleep=0.0, test_output="tests/test_x.py::a PASSED\n"):
+ self.create_count = 0
+ self.close_count = 0
+ self._exec_sleep = exec_sleep
+ self._test_output = test_output
+
+ async def create(self, spec):
+ self.create_count += 1
+ return SandboxHandle(
+ sandbox_id=f"sb-{self.create_count}", provider_name=self.name, raw={"workdir": spec.workdir}
+ )
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ if self._exec_sleep:
+ await asyncio.sleep(self._exec_sleep)
+ if "pytest" in command:
+ return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, *a, **k):
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ self.close_count += 1
+
+ async def aclose(self):
+ return None
+
+
+def _task(**kw) -> SweTask:
+ """Build a SweTask with sensible defaults, overridable per keyword.
+
+ Args:
+ **kw: Field overrides merged onto the default task fields.
+
+ Returns:
+ A SweTask configured for the swe-bench-ext benchmark.
+ """
+ base = dict(
+ instance_id="inst-1",
+ image="img:tag",
+ base_commit="HEAD",
+ test_command="python -m pytest -rA -q",
+ model_patch="diff --git a/x b/x\n",
+ test_framework="pytest",
+ fail_to_pass=["tests/test_x.py::a"],
+ benchmark="swe-bench-ext",
+ )
+ base.update(kw)
+ return SweTask(**base)
+
+
+# ---- acquire_sandbox: starts an env, ALWAYS stops it ------------------------
+
+
+def test_acquire_sandbox_starts_and_cleans_up():
+ """``acquire_sandbox`` creates one sandbox and tears it down on normal exit."""
+ provider = _CountingProvider()
+
+ async def run():
+ spec = SweBenchExtHarness().build_spec(_task())
+ async with acquire_sandbox(provider, spec, instance_id="inst-1") as env:
+ assert env.sandbox_id is not None
+ return provider.create_count, provider.close_count
+
+ created, closed = asyncio.run(run())
+ assert created == 1
+ assert closed == 1 # torn down on normal exit
+
+
+def test_acquire_sandbox_cleans_up_on_exception():
+ """``acquire_sandbox`` tears down the sandbox even when the body raises."""
+ provider = _CountingProvider()
+
+ async def run():
+ spec = SweBenchExtHarness().build_spec(_task())
+ with pytest.raises(RuntimeError):
+ async with acquire_sandbox(provider, spec) as env:
+ assert env.sandbox_id is not None
+ raise RuntimeError("boom")
+
+ asyncio.run(run())
+ assert provider.close_count == 1 # torn down even on exception
+
+
+# ---- verify_task: resolved / empty-patch fast path / eval-timeout mask -------
+
+
+def test_verify_task_resolved_in_fresh_sandbox():
+ """``verify_task`` resolves a passing task in a freshly created sandbox."""
+ provider = _CountingProvider()
+ report = asyncio.run(verify_task(provider, _task()))
+ assert report.resolved is True
+ assert provider.create_count == 1
+ assert provider.close_count == 1
+
+
+def test_verify_task_empty_patch_fast_path_no_create():
+ """An empty model patch short-circuits to unresolved without creating a sandbox."""
+ provider = _CountingProvider()
+ report = asyncio.run(verify_task(provider, _task(model_patch="")))
+ assert report.patch_exists is False
+ assert report.resolved is False
+ assert provider.create_count == 0 # no sandbox spun up for an empty patch
+
+
+def test_verify_task_eval_timeout_masks():
+ """An evaluation that exceeds the eval timeout is masked as an eval_timeout error."""
+ provider = _CountingProvider(exec_sleep=0.5)
+ report = asyncio.run(verify_task(provider, _task(), eval_timeout_s=0.05))
+ assert report.error_kind == "eval_timeout"
diff --git a/resources_servers/swe_bench/tests/test_model_endpoint.py b/resources_servers/swe_bench/tests/test_model_endpoint.py
new file mode 100644
index 0000000000..c55232d863
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_model_endpoint.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the model-server egress primitive that resolves a model endpoint per provider."""
+
+from __future__ import annotations
+
+import pytest
+
+from resources_servers.swe_bench.self_drive import ModelEgressUnavailable, ModelEndpoint, resolve
+
+
+def test_apptainer_uses_host_loopback_by_default():
+ """Apptainer resolves to the host loopback base URL when none is configured."""
+ ep = resolve("apptainer", {"model": "qwen"})
+ assert ep.base_url == "http://127.0.0.1:8000/v1"
+ assert ep.model == "qwen"
+
+
+def test_docker_uses_configured_base_when_present():
+ """Docker uses the explicitly configured base URL."""
+ ep = resolve("docker", {"base_url": "http://10.0.0.5:8000/v1"})
+ assert ep.base_url == "http://10.0.0.5:8000/v1"
+
+
+def test_opensandbox_requires_service_url():
+ """Opensandbox raises when no reachable service URL is supplied."""
+ with pytest.raises(ModelEgressUnavailable):
+ resolve("opensandbox", {"base_url": "http://127.0.0.1:8000/v1"})
+
+
+def test_opensandbox_with_service_url_ok():
+ """Opensandbox resolves to the provided service URL."""
+ ep = resolve("opensandbox", {"model": "m"}, opensandbox_service_url="http://gym-model.svc.cluster.local/v1")
+ assert ep.base_url == "http://gym-model.svc.cluster.local/v1"
+
+
+def test_to_sandbox_env_is_minimal():
+ """The sandbox env carries only the base URL, API key, and model name."""
+ ak_value = "abc-test"
+ env = ModelEndpoint(base_url="http://h/v1", api_key=ak_value, model="m").to_sandbox_env()
+ assert env["OPENAI_BASE_URL"] == "http://h/v1"
+ assert env["OPENAI_API_KEY"] == ak_value
+ assert env["NEMO_GYM_MODEL"] == "m"
+ # never leaks a full global-config dict
+ assert "NEMO_GYM_CONFIG_DICT" not in env
diff --git a/resources_servers/swe_bench/tests/test_nv_internal.py b/resources_servers/swe_bench/tests/test_nv_internal.py
new file mode 100644
index 0000000000..3b1f049f2d
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_nv_internal.py
@@ -0,0 +1,547 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the nv-internal-1 harness, driven by a FakeSandbox provider.
+
+nv-internal-1 is flat + host-graded, so it runs on any exec-capable provider.
+The scripted provider returns the parsing_script ``output.json`` report on the
+``cat /root/output.json`` hop; grading is a pure host-side parse.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.nv_internal import (
+ NV_DEFAULT_WORKDIR,
+ NVInternalHarness,
+ _coerce_test_list,
+ _format_test_files,
+ _nv_workdir,
+ _parse_dockerfile_env,
+ _resolve_required_tests,
+ parse_passed_tests,
+)
+from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class _FakeProvider:
+ """Scripted provider: ``cat /root/output.json`` returns a canned report."""
+
+ name = "fake-nv"
+
+ def __init__(self, *, report="", apply_rc=0, **_):
+ """Configure the scripted provider's responses.
+
+ Args:
+ report: JSON report stdout returned for ``cat /root/output.json``.
+ apply_rc: Return code returned for ``git apply`` commands.
+ **_: Ignored extra keyword arguments.
+ """
+ self._report = report
+ self._apply_rc = apply_rc
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ if "cat /root/output.json" in command:
+ return SandboxExecResult(stdout=self._report, stderr="", return_code=0)
+ if "git apply" in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, *a, **k):
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-nv", _FakeProvider, override=True)
+
+
+class _RecordingProvider:
+ """Provider that records exec ``cwd`` per command and captures uploads.
+
+ Uploads are captured as ``{target_path: content}`` by reading the temp file
+ that ``write_text`` hands to ``upload_file``; execs are captured as a list of
+ ``(command, cwd)`` so tests can assert which directory each hop ran in.
+ """
+
+ name = "fake-nv-rec"
+
+ def __init__(self, *, report="", **_):
+ """Configure the recording provider's canned report.
+
+ Args:
+ report: JSON report stdout returned for ``cat /root/output.json``.
+ **_: Ignored extra keyword arguments.
+ """
+ self._report = report
+ self.execs: list[tuple[str, str | None]] = []
+ self.uploads: dict[str, str] = {}
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ self.execs.append((command, cwd))
+ if "cat /root/output.json" in command:
+ return SandboxExecResult(stdout=self._report, stderr="", return_code=0)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, handle, source_path, target_path):
+ with open(source_path, encoding="utf-8") as fh:
+ self.uploads[target_path] = fh.read()
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-nv-rec", _RecordingProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+ """Build an nv-internal-1 SweTask with sensible defaults, overridable per keyword.
+
+ Args:
+ **overrides: Field overrides merged onto the default task fields.
+
+ Returns:
+ A SweTask configured for the nv-internal-1 benchmark.
+ """
+ base = dict(
+ instance_id="nv-inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/app",
+ model_patch="diff --git a/x b/x\n",
+ fail_to_pass=["pkg/test_x.py::a"],
+ pass_to_pass=["pkg/test_x.py::b"],
+ benchmark="nv-internal-1",
+ metadata={
+ "run_script": "echo run\n",
+ "parsing_script": "import sys\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+def _report(*passed, failed=()):
+ """Build a JSON test report with the given passed and failed test names.
+
+ Args:
+ *passed: Names of tests reported as PASSED.
+ failed: Names of tests reported as FAILED.
+
+ Returns:
+ The report serialized as a JSON string under a ``tests`` key.
+ """
+ tests = [{"name": name, "status": "PASSED"} for name in passed]
+ tests += [{"name": name, "status": "FAILED"} for name in failed]
+ return json.dumps({"tests": tests})
+
+
+async def _run(provider_cfg, task) -> SweEvalReport:
+ """Drive reset -> materialize -> run_eval -> grade against a scripted provider.
+
+ Args:
+ provider_cfg: Provider configuration mapping for the ``fake-nv`` provider.
+ task: The SweTask to evaluate.
+
+ Returns:
+ The graded SweEvalReport for the run.
+ """
+ harness = NVInternalHarness()
+ env = await AsyncSweEnvironment.start({"fake-nv": provider_cfg}, harness.build_spec(task))
+ try:
+ await harness.reset_repo(env, task)
+ await harness.materialize(env, task)
+ artifacts = await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+ return harness.grade(task, artifacts)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_parse_passed_tests():
+ """``parse_passed_tests`` returns only PASSED names and ignores malformed entries."""
+ report = {"tests": [{"name": "a", "status": "PASSED"}, {"name": "b", "status": "FAILED"}]}
+ assert parse_passed_tests(report) == ["a"]
+ assert parse_passed_tests({}) == []
+ # Malformed entries are ignored, not crashed on.
+ assert parse_passed_tests({"tests": ["junk", {"status": "PASSED"}]}) == []
+
+
+def test_format_test_files():
+ """``_format_test_files`` joins list/JSON/CSV inputs into a comma-separated string."""
+ assert _format_test_files(["a", "b"]) == "a,b"
+ assert _format_test_files('["a", "b"]') == "a,b"
+ assert _format_test_files("a,b") == "a,b"
+ assert _format_test_files(None) == ""
+
+
+def test_format_test_files_single_quoted_list():
+ """``_format_test_files`` parses repr-style single-quoted lists.
+
+ Single-quoted lists are not valid JSON, so they are parsed with
+ ``ast.literal_eval``; unparseable bracketed text falls back to the raw string.
+ """
+ assert _format_test_files("['pkg/test_x.py', 'pkg/test_y.py']") == "pkg/test_x.py,pkg/test_y.py"
+ # A single-element single-quoted list.
+ assert _format_test_files("['only.py']") == "only.py"
+ # Unparseable bracketed text falls back to the raw string, not a crash.
+ assert _format_test_files("[not a list") == "[not a list"
+
+
+def test_build_spec():
+ """The nv-internal-1 harness builds a sandbox spec from a task."""
+ harness = NVInternalHarness()
+ assert harness.name == "nv-internal-1"
+ assert harness.grade_strategy == "flat-host-grade"
+ spec = harness.build_spec(_task())
+ assert spec.image == "img:tag"
+ assert spec.workdir == "/app"
+ assert spec.metadata["instance_id"] == "nv-inst-1"
+
+
+def test_supports_any_provider():
+ """The nv-internal-1 harness supports any exec-capable provider."""
+ assert NVInternalHarness().supports_provider("docker") is True
+ assert NVInternalHarness().supports_provider("apptainer") is True
+
+
+def test_grade_masks_on_infra_error():
+ """Grading masks an infra timeout to reward 0.0 and records its error kind."""
+ harness = NVInternalHarness()
+ report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+ assert report.error_kind == "timeout"
+ assert reward_from_report(report) == 0.0
+
+
+def test_grade_masks_on_sandbox_error():
+ """Grading masks a sandbox error to reward 0.0 and records its error kind."""
+ harness = NVInternalHarness()
+ report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "sandbox"}))
+ assert report.error_kind == "sandbox"
+ assert reward_from_report(report) == 0.0
+
+
+def test_grade_empty_report_is_unresolved():
+ """An empty report grades as unresolved."""
+ harness = NVInternalHarness()
+ report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=0, patch_applied=True))
+ assert report.resolved is False
+
+
+def test_grade_malformed_report_is_unresolved():
+ """A malformed (non-JSON) report grades as unresolved."""
+ harness = NVInternalHarness()
+ report = harness.grade(_task(), EvalArtifacts(test_output="not json", return_code=0, patch_applied=True))
+ assert report.resolved is False
+
+
+# ---- full reset -> materialize -> run_eval -> grade -------------------------
+
+
+def test_resolved():
+ """A run with all required tests passing resolves with reward 1.0."""
+ report = _report("pkg/test_x.py::a", "pkg/test_x.py::b")
+ result = asyncio.run(_run({"report": report}, _task()))
+ assert result.patch_applied is True
+ assert result.resolved is True
+ assert reward_from_report(result) == 1.0
+
+
+def test_unresolved_failing_required_test():
+ """A failing fail-to-pass test leaves the run unresolved with reward 0.0."""
+ report = _report("pkg/test_x.py::b", failed=["pkg/test_x.py::a"])
+ result = asyncio.run(_run({"report": report}, _task()))
+ assert result.resolved is False
+ assert reward_from_report(result) == 0.0
+
+
+def test_unresolved_missing_required_test():
+ """A required test missing from the report leaves the run unresolved."""
+ report = _report("pkg/test_x.py::a")
+ result = asyncio.run(_run({"report": report}, _task()))
+ assert result.resolved is False
+
+
+def test_patch_apply_rc_does_not_gate_resolved():
+ """A non-zero patch-apply return code does not gate ``resolved``.
+
+ Grading derives ``resolved`` from the tests alone, so a rejected patch
+ (apply_rc != 0) with all required tests passing is still resolved.
+ """
+ report = _report("pkg/test_x.py::a", "pkg/test_x.py::b")
+ result = asyncio.run(_run({"report": report, "apply_rc": 1}, _task()))
+ assert result.patch_applied is False
+ assert result.resolved is True
+ assert reward_from_report(result) == 1.0
+
+
+# ---- *_select precedence ----------------------------------------------------
+
+
+def test_resolve_required_tests_prefers_select_keys():
+ """``fail_to_pass_select`` / ``pass_to_pass_select`` take precedence over the plain keys."""
+ task = _task(
+ fail_to_pass=["plain::f2p"],
+ pass_to_pass=["plain::p2p"],
+ metadata={
+ "fail_to_pass_select": ["sel::f2p"],
+ "pass_to_pass_select": ["sel::p2p"],
+ },
+ )
+ f2p, p2p = _resolve_required_tests(task)
+ assert f2p == ["sel::f2p"]
+ assert p2p == ["sel::p2p"]
+
+
+def test_resolve_required_tests_falls_back_to_plain_keys():
+ """Without ``*_select`` keys, the plain fail_to_pass / pass_to_pass keys are used."""
+ task = _task(fail_to_pass=["plain::f2p"], pass_to_pass=["plain::p2p"], metadata={})
+ f2p, p2p = _resolve_required_tests(task)
+ assert f2p == ["plain::f2p"]
+ assert p2p == ["plain::p2p"]
+
+
+def test_resolve_required_tests_parses_stringified_select():
+ """A ``*_select`` value given as a repr-style stringified list is parsed."""
+ task = _task(
+ metadata={
+ "fail_to_pass_select": "['sel::f2p']",
+ "pass_to_pass_select": "['sel::p2p']",
+ },
+ )
+ f2p, p2p = _resolve_required_tests(task)
+ assert f2p == ["sel::f2p"]
+ assert p2p == ["sel::p2p"]
+
+
+def test_coerce_test_list():
+ """``_coerce_test_list`` accepts lists and stringified lists, returning [] on bad input."""
+ assert _coerce_test_list(["a", "b"]) == ["a", "b"]
+ assert _coerce_test_list("['a', 'b']") == ["a", "b"]
+ assert _coerce_test_list('["a", "b"]') == ["a", "b"]
+ assert _coerce_test_list("not a list") == []
+ assert _coerce_test_list("[broken") == []
+
+
+def test_resolved_uses_select_tests_end_to_end():
+ """End to end, ``*_select`` precedence resolves a run whose report has only the select tests."""
+ # The report only contains the *_select tests; the plain keys would be unmet.
+ report = _report("sel::f2p", "sel::p2p")
+ task = _task(
+ fail_to_pass=["plain::f2p"],
+ pass_to_pass=["plain::p2p"],
+ metadata={
+ "run_script": "echo run\n",
+ "parsing_script": "import sys\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ "fail_to_pass_select": ["sel::f2p"],
+ "pass_to_pass_select": ["sel::p2p"],
+ },
+ )
+ result = asyncio.run(_run({"report": report}, task))
+ assert result.resolved is True
+
+
+# ---- dockerfile ENV replay --------------------------------------------------
+
+
+def test_parse_dockerfile_env_equals_and_space_forms():
+ """``_parse_dockerfile_env`` parses both ``ENV K=V`` and ``ENV K V`` forms, skipping non-ENV lines."""
+ task = _task(
+ metadata={
+ "base_dockerfile": "FROM ubuntu\nENV FOO=bar\nENV SPACED spaced_value\n",
+ "instance_dockerfile": "ENV BAZ = qux\nRUN echo hi\n",
+ },
+ )
+ env = _parse_dockerfile_env(task)
+ assert env["FOO"] == "bar"
+ assert env["SPACED"] == "spaced_value"
+ assert env["BAZ"] == "qux"
+ assert "RUN" not in env
+
+
+def test_parse_dockerfile_env_absent_is_noop():
+ """``_parse_dockerfile_env`` returns an empty mapping when no dockerfile is present."""
+ assert _parse_dockerfile_env(_task(metadata={})) == {}
+
+
+def test_build_spec_injects_dockerfile_env():
+ """``build_spec`` injects dockerfile ENV entries while preserving the existing git env."""
+ task = _task(metadata={"base_dockerfile": "ENV PATH=/custom/bin:$PATH\n"})
+ spec = NVInternalHarness().build_spec(task)
+ # Existing git env preserved; dockerfile ENV injected.
+ assert spec.env["GIT_CONFIG_GLOBAL"] == "/dev/null"
+ assert spec.env["PATH"] == "/custom/bin:$PATH"
+
+
+# ---- dotted script keys are uploaded ----------------------------------------
+
+
+async def _run_recording(task) -> _RecordingProvider:
+ """Drive reset -> materialize -> run_eval with a recording provider.
+
+ Args:
+ task: The SweTask to evaluate.
+
+ Returns:
+ The recording provider, so tests can inspect captured execs and uploads.
+ """
+ provider = _RecordingProvider(report=_report("pkg/test_x.py::a", "pkg/test_x.py::b"))
+ harness = NVInternalHarness()
+ env = await AsyncSweEnvironment.start(provider, harness.build_spec(task))
+ try:
+ await harness.reset_repo(env, task)
+ await harness.materialize(env, task)
+ await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+ return provider
+
+
+def test_materialize_reads_dotted_script_keys():
+ """``materialize`` uploads scripts stored under the dotted keys ``run_script.sh`` / ``parsing_script.py``."""
+ task = _task(
+ repo_workdir="/app",
+ metadata={
+ "run_script.sh": "echo DOTTED_RUN\n",
+ "parsing_script.py": "print('DOTTED_PARSE')\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ provider = asyncio.run(_run_recording(task))
+ assert provider.uploads["/root/run_script.sh"] == "echo DOTTED_RUN\n"
+ assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED_PARSE')\n"
+
+
+def test_materialize_dotted_keys_take_precedence_over_extensionless():
+ """When both dotted and extensionless script keys are present, the dotted keys win."""
+ task = _task(
+ repo_workdir="/app",
+ metadata={
+ "run_script.sh": "echo DOTTED\n",
+ "run_script": "echo EXTLESS\n",
+ "parsing_script.py": "print('DOTTED')\n",
+ "parsing_script": "print('EXTLESS')\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ provider = asyncio.run(_run_recording(task))
+ assert provider.uploads["/root/run_script.sh"] == "echo DOTTED\n"
+ assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED')\n"
+
+
+def test_materialize_falls_back_to_extensionless_keys():
+ """When only the extensionless script keys are present, they are used."""
+ task = _task(
+ repo_workdir="/app",
+ metadata={
+ "run_script": "echo EXTLESS_RUN\n",
+ "parsing_script": "print('EXTLESS_PARSE')\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ provider = asyncio.run(_run_recording(task))
+ assert provider.uploads["/root/run_script.sh"] == "echo EXTLESS_RUN\n"
+ assert provider.uploads["/root/parsing_script.py"] == "print('EXTLESS_PARSE')\n"
+
+
+# ---- hops run in /app -------------------------------------------------------
+
+
+def test_nv_workdir_defaults_to_app():
+ """``_nv_workdir`` maps the generic /testbed default (or empty) to /app, honoring pinned paths."""
+ assert _nv_workdir(_task(repo_workdir="/testbed")) == NV_DEFAULT_WORKDIR
+ assert _nv_workdir(_task(repo_workdir="")) == NV_DEFAULT_WORKDIR
+ # A row that pins a non-default workdir is honored.
+ assert _nv_workdir(_task(repo_workdir="/srv/repo")) == "/srv/repo"
+ assert _nv_workdir(_task(repo_workdir="/app")) == "/app"
+
+
+def test_build_spec_workdir_defaults_to_app_for_generic_default():
+ """``build_spec`` rewrites the generic /testbed default workdir to /app."""
+ spec = NVInternalHarness().build_spec(_task(repo_workdir="/testbed"))
+ assert spec.workdir == NV_DEFAULT_WORKDIR
+
+
+def test_all_hops_run_in_app_for_generic_default():
+ """With the generic /testbed default, every reset/apply/run/parse/cat hop runs in /app."""
+ task = _task(
+ repo_workdir="/testbed",
+ metadata={
+ "run_script.sh": "echo run\n",
+ "parsing_script.py": "import sys\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ provider = asyncio.run(_run_recording(task))
+ cwds = {cwd for _, cwd in provider.execs}
+ assert cwds == {NV_DEFAULT_WORKDIR}
+ # Spot-check that the key hops were exercised in /app.
+ by_cwd = {cmd: cwd for cmd, cwd in provider.execs}
+ assert any("git reset --hard" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+ assert any("git apply" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+ assert any("run_script.sh" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+ assert any("parsing_script.py" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+ assert by_cwd["cat /root/output.json"] == "/app"
+
+
+def test_all_hops_honor_explicit_non_default_workdir():
+ """A row that pins ``repo_workdir`` to a non-default path runs every hop there."""
+ task = _task(
+ repo_workdir="/srv/repo",
+ metadata={
+ "run_script.sh": "echo run\n",
+ "parsing_script.py": "import sys\n",
+ "selected_test_files_to_run": ["pkg/test_x.py"],
+ },
+ )
+ provider = asyncio.run(_run_recording(task))
+ assert {cwd for _, cwd in provider.execs} == {"/srv/repo"}
diff --git a/resources_servers/swe_bench/tests/test_r2egym.py b/resources_servers/swe_bench/tests/test_r2egym.py
new file mode 100644
index 0000000000..b8a671b42a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_r2egym.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the r2e-gym flat (host-graded) harness.
+
+r2e-gym now grades host-side via the shared flat-eval path (the apptainer-only nested
+``run_local_evaluation`` grader was removed when PR #1694 took over the apptainer provider; the
+nested re-wiring is tracked for a follow-up PR). These tests cover provisioning, the agent-phase
+test-hiding command shape, ``reset_repo``, and the flat ``run_eval`` + ``grade`` path against a
+scripted ``_FakeProvider``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+
+
+_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n"
+
+
+class _FakeProvider:
+ """Scripted provider: returns a canned eval log for the eval-script run; records uploads."""
+
+ name = "fake-r2egym"
+
+ def __init__(self, *, log_text="", exec_rc=0, **_):
+ self._log_text = log_text
+ self._exec_rc = exec_rc
+ self.uploaded: dict[str, str] = {}
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ rc = 0 if command.startswith("cat ") else self._exec_rc
+ return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc)
+
+ async def upload_file(self, handle, local_path, remote_path):
+ try:
+ with open(local_path, encoding="utf-8") as fh:
+ self.uploaded[remote_path] = fh.read()
+ except OSError:
+ self.uploaded[remote_path] = ""
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-r2egym", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+ """Build an r2e-gym ``SweTask`` with sensible defaults."""
+ base = dict(
+ instance_id="repo__inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ model_patch="diff --git a/x b/x\n",
+ fail_to_pass=["t::a"],
+ pass_to_pass=["t::b"],
+ benchmark="r2e-gym",
+ split="test",
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+def test_harness_identity():
+ harness = R2EGymHarness()
+ assert harness.name == "r2e-gym"
+ assert harness.grade_strategy == "flat-host-grade"
+
+
+def test_build_spec_image_workdir_metadata():
+ spec = R2EGymHarness().build_spec(_task())
+ assert spec.image == "img:tag"
+ assert spec.workdir == "/testbed"
+ assert spec.metadata["harness"] == "r2e-gym"
+
+
+def test_build_spec_truncates_long_instance_id():
+ spec = R2EGymHarness().build_spec(_task(instance_id="x" * 100))
+ assert len(spec.metadata["instance_id"]) == 63
+
+
+def test_supports_provider_any_exec_capable():
+ harness = R2EGymHarness()
+ assert harness.supports_provider("docker") is True
+ assert harness.supports_provider("apptainer") is True
+
+
+def test_hide_eval_tests_commands_shape():
+ commands = R2EGymHarness().hide_eval_tests_commands()
+ assert len(commands) == 3
+ assert all("r2e_tests" in c for c in commands)
+
+
+def test_materialize_writes_patch_diff():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = R2EGymHarness()
+ task = _task()
+ env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+ await harness.materialize(env, task)
+ return env.sandbox._provider
+
+ provider = asyncio.run(run())
+ assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n"
+
+
+def test_run_eval_then_grade_flat_resolved():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = R2EGymHarness()
+ task = _task(metadata={"eval_script": "echo run"})
+ env = await AsyncSweEnvironment.start({"fake-r2egym": {"log_text": _PASSING_LOG}}, harness.build_spec(task))
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts)
+
+ report = asyncio.run(run())
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_run_eval_missing_eval_script_is_unmasked_unresolved():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ # A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error:
+ # only genuine sandbox/timeout infra failures are masked.
+ async def run():
+ harness = R2EGymHarness()
+ task = _task()
+ env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts)
+
+ report = asyncio.run(run())
+ assert report.error_kind is None
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_reset_repo_is_noop():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = R2EGymHarness()
+ task = _task()
+ env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+ await harness.reset_repo(env, task) # must not raise
+
+ asyncio.run(run())
diff --git a/resources_servers/swe_bench/tests/test_swe_bench_ext.py b/resources_servers/swe_bench/tests/test_swe_bench_ext.py
new file mode 100644
index 0000000000..673689ed6e
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_bench_ext.py
@@ -0,0 +1,491 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the swe-bench-ext harness grading.
+
+These cover two grading behaviors:
+
+* ``grade`` delegates to the vendored lighthouse parser
+ (``parse_and_check_tests``) — so junit-xml parsing, ``normalize_test_id`` plus
+ 4-stage fuzzy matching, the 20+ framework dispatch, and the
+ ``::build``/``::compile`` synthetic-PASS injection all drive ``resolved``.
+ Recorded fixture logs (one per parser path) anchor the expectation.
+* ``resolved`` is the parser's verdict only; a failed ``git apply`` is recorded
+ in ``patch_applied`` but never gates ``resolved``.
+
+The harness is flat / host-graded (no nested container), so ``run_eval`` runs
+against a scripted ``FakeSandbox`` rather than a real image.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from pathlib import Path
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+
+
+_FIXTURES = Path(__file__).parent / "fixtures" / "swe_bench_ext"
+
+
+def _fixture(name: str) -> str:
+ """Read a recorded fixture log by file name.
+
+ Args:
+ name: The fixture file name under the ``swe_bench_ext`` fixtures dir.
+
+ Returns:
+ str: The fixture file contents.
+ """
+ return (_FIXTURES / name).read_text()
+
+
+def _task(**overrides) -> SweTask:
+ """Build a swe-bench-ext ``SweTask`` with sensible defaults.
+
+ Args:
+ **overrides: Field values overriding the defaults.
+
+ Returns:
+ SweTask: A task populated from the defaults merged with overrides.
+ """
+ base = dict(
+ instance_id="repo__inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ test_command="python -m pytest -rA -q",
+ test_framework="pytest",
+ model_patch="diff --git a/x b/x\n",
+ fail_to_pass=["tests/test_core.py::test_fix_applied"],
+ pass_to_pass=["tests/test_core.py::test_regression_guard"],
+ benchmark="swe-bench-ext",
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+def _artifacts(test_output: str, *, patch_applied: bool = True, error_type=None) -> EvalArtifacts:
+ """Build ``EvalArtifacts`` for a graded run.
+
+ Args:
+ test_output: The captured test transcript handed to the parser.
+ patch_applied: Whether the model patch applied cleanly.
+ error_type: Infrastructure error kind, or None for a clean run.
+
+ Returns:
+ EvalArtifacts: The artifacts passed to ``grade``.
+ """
+ return EvalArtifacts(
+ test_output=test_output,
+ return_code=0,
+ patch_applied=patch_applied,
+ raw={"error_type": error_type},
+ )
+
+
+# --- vendored parser drives resolved ----------------------------------------
+
+
+def test_grade_junit_xml_resolved():
+ """junit-xml parsing + fuzzy id matching resolves a clean F2P/P2P pass."""
+ harness = SweBenchExtHarness()
+ report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml")))
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+ # The parser report is surfaced for inspection.
+ assert report.tests_status["framework"] == "pytest"
+ assert report.tests_status["f2p_passed"] == 1
+ assert report.tests_status["p2p_passed"] == 1
+
+
+def test_grade_junit_xml_unresolved_when_p2p_fails():
+ harness = SweBenchExtHarness()
+ task = _task(pass_to_pass=["tests/test_core.py::test_unrelated_broken"])
+ report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml")))
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_grade_pytest_text_fuzzy_id_match():
+ """Normalized/fuzzy id matching: ``src/pkg/...py::test`` log id resolves a
+ differently-delimited expected id via normalize_test_id."""
+ harness = SweBenchExtHarness()
+ task = _task(
+ fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha"],
+ pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+ )
+ report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_grade_pytest_text_unresolved_when_f2p_fails():
+ harness = SweBenchExtHarness()
+ task = _task(
+ fail_to_pass=["src/pkg/tests/test_widget.py::test_gamma"],
+ pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+ )
+ report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+ assert report.resolved is False
+
+
+def test_grade_build_synthetic_pass_injection():
+ """An F2P entry ending ``::build`` not present in the parsed output is
+ injected as PASSED (synthetic build/compile handling)."""
+ harness = SweBenchExtHarness()
+ task = _task(
+ fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha", "mypkg::build"],
+ pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+ )
+ report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+ assert report.resolved is True
+ assert report.tests_status["fail_to_pass_results"]["mypkg::build"] == "PASSED"
+
+
+def test_grade_non_pytest_framework_go_json():
+ """A non-pytest framework (``go``) dispatches to the go-json parser."""
+ harness = SweBenchExtHarness()
+ task = _task(
+ test_framework="go",
+ fail_to_pass=["github.com/acme/widget::TestAlpha"],
+ pass_to_pass=["github.com/acme/widget::TestBeta"],
+ )
+ report = harness.grade(task, _artifacts(_fixture("go_json.txt")))
+ assert report.resolved is True
+ assert report.tests_status["framework"] == "go"
+
+
+def test_grade_non_pytest_framework_go_json_unresolved():
+ harness = SweBenchExtHarness()
+ task = _task(
+ test_framework="go",
+ fail_to_pass=["github.com/acme/widget::TestGamma"],
+ pass_to_pass=["github.com/acme/widget::TestBeta"],
+ )
+ report = harness.grade(task, _artifacts(_fixture("go_json.txt")))
+ assert report.resolved is False
+
+
+# --- empty framework is passed VERBATIM (NOT coerced to pytest) --------------
+
+
+def test_grade_empty_framework_passed_verbatim_not_coerced_to_pytest():
+ """``test_framework`` is passed through UNCHANGED — an empty framework reaches
+ ``parse_and_check_tests`` as ``""`` and hits the parser's auto-detect path, NOT
+ the pytest junit-xml parser.
+
+ Coercing ``""`` -> ``"pytest"`` would let junit-xml parse and report
+ ``resolved`` for an instance that should auto-detect. We assert the framework
+ reaches the parser verbatim (recorded in ``report.framework``) and that
+ junit-xml is therefore NOT parsed under an empty framework.
+ """
+ harness = SweBenchExtHarness()
+ task = _task(test_framework="")
+ report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml")))
+ # Framework recorded verbatim — not silently rewritten to "pytest".
+ assert report.tests_status["framework"] == ""
+ # Auto-detect path does not understand junit-xml -> nothing parsed -> unresolved.
+ assert report.tests_status["parsed_count"] == 0
+ assert report.resolved is False
+
+
+def test_grade_empty_framework_uses_autodetect_path():
+ """An empty framework grades via parse_test_output's auto-detect path (TAP /
+ Mocha-Hardhat console) when the instance ships no framework. Here a TAP
+ transcript resolves without any framework hint."""
+ harness = SweBenchExtHarness()
+ tap_output = (
+ "<<>>\n"
+ "TAP version 13\n"
+ "1..2\n"
+ "ok 1 - test_fix_applied\n"
+ "ok 2 - test_regression_guard\n"
+ "<<>>\n"
+ )
+ task = _task(
+ test_framework="",
+ fail_to_pass=["test_fix_applied"],
+ pass_to_pass=["test_regression_guard"],
+ )
+ report = harness.grade(task, _artifacts(tap_output))
+ assert report.tests_status["framework"] == ""
+ assert report.tests_status["parsed_count"] >= 2
+ assert report.resolved is True
+
+
+def test_run_eval_and_grade_share_framework_value():
+ """run_eval (flag/result-file selection) and grade (parsing) use the SAME
+ framework. With an empty framework, run_eval must NOT inject pytest's
+ ``--junitxml`` flag and must wrap the bare command, and grade must parse under
+ ``""`` — proving the two share ``_resolve_framework`` rather than diverging on a
+ pytest default."""
+ task = _task(test_framework="", test_command="run-my-tests")
+ _, _, provider = _run_eval(task, test_output="", run_cmd="run-my-tests")
+ eval_cmds = [c for c in provider.commands if "run-my-tests" in c]
+ assert eval_cmds, "expected the bare framework command to be wrapped"
+ wrapped = eval_cmds[-1]
+ # Empty framework => default framework config => no output flag, no result file.
+ assert "--junitxml" not in wrapped
+ assert "<<>>" not in wrapped
+ # The mkdir parent-dir creation is present regardless.
+ assert "mkdir -p /workspace/test-results" in wrapped
+
+
+def test_run_eval_command_less_row_injects_no_default_runner():
+ """A command-less row runs NO test runner, matching main's SweBenchExtDatasetProcessor.
+
+ Main uses ``inst.get("test_command", "")`` verbatim (empty when absent), so a row that
+ ships no command runs nothing and grades unresolved. The harness must not fabricate a
+ ``python -m pytest`` default that would diverge from main by manufacturing results.
+ """
+ task = _task(test_command="", test_framework="")
+ _, _, provider = _run_eval(task, test_output="", run_cmd="__never__")
+ eval_cmds = [c for c in provider.commands if "git apply" not in c and "cat " not in c]
+ wrapped = eval_cmds[-1]
+ assert "pytest" not in wrapped # no default runner injected
+ # The command slot between the START/END marker echoes is empty (no runner line).
+ assert 'echo "<<>>"\n\necho "<<>>"' in wrapped
+
+
+# --- patch_applied does not gate resolved -----------------------------------
+
+
+def test_grade_resolved_even_when_patch_apply_failed():
+ """Grading is on tests ONLY; a failed apply is recorded but never flips a
+ tests-passing run to unresolved."""
+ harness = SweBenchExtHarness()
+ report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml"), patch_applied=False))
+ assert report.patch_applied is False
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+# --- infra masking ----------------------------------------------------------
+
+
+def test_grade_masks_on_infra_error():
+ harness = SweBenchExtHarness()
+ report = harness.grade(_task(), _artifacts("", error_type="timeout"))
+ assert report.error_kind == "timeout"
+ assert reward_from_report(report) == 0.0
+
+
+# --- run_eval against a scripted FakeSandbox --------------------------------
+
+
+class _FakeExtProvider:
+ """Scripted provider that records git-apply attempts and returns a transcript.
+
+ Args:
+ test_output: The transcript returned for the wrapped eval command.
+ apply_rc: Return code for ``git apply`` commands.
+ run_cmd: Substring identifying the wrapped eval command.
+ git_dir: Directory whose ``.git`` probe succeeds; ``None`` means every
+ probed dir reports a checkout (so the first ladder entry wins).
+ """
+
+ name = "fake-ext"
+
+ def __init__(self, *, test_output="", apply_rc=0, run_cmd="pytest", git_dir=None, **_):
+ self._test_output = test_output
+ self._apply_rc = apply_rc
+ # Marker that identifies the wrapped eval command (defaults to the pytest
+ # command); tests with a custom command pass run_cmd.
+ self._run_cmd = run_cmd
+ # Which directory holds the repo checkout: a ``test -d "/.git"`` probe
+ # succeeds only for this dir. None => every probed dir reports a checkout
+ # (so the first ladder entry, /testbed, wins).
+ self._git_dir = git_dir
+ self.commands: list[str] = []
+ self.exec_cwds: list[str | None] = []
+ self.uploaded: dict[str, str] = {}
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ self.commands.append(command)
+ self.exec_cwds.append(cwd)
+ if command.startswith("test -d "):
+ # The repo-workdir probe: succeed only for the configured git dir (or any
+ # dir when unconfigured).
+ if self._git_dir is None or f'"{self._git_dir}/.git"' in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+ return SandboxExecResult(stdout="", stderr="", return_code=1)
+ if "git apply" in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+ if self._run_cmd in command:
+ return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, handle, local_path, remote_path):
+ try:
+ with open(local_path, encoding="utf-8") as fh:
+ self.uploaded[remote_path] = fh.read()
+ except OSError:
+ self.uploaded[remote_path] = ""
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-ext", _FakeExtProvider, override=True)
+
+
+def _run_eval(task: SweTask, *, test_output: str, apply_rc: int = 0, run_cmd: str = "pytest", git_dir=None):
+ """Run the harness through a scripted provider and return the run outputs.
+
+ Args:
+ task: The task to evaluate.
+ test_output: The transcript the provider returns for the eval command.
+ apply_rc: Return code for ``git apply`` commands.
+ run_cmd: Substring identifying the wrapped eval command.
+ git_dir: Directory whose ``.git`` probe succeeds (None => any dir).
+
+ Returns:
+ tuple: The harness, the produced ``EvalArtifacts``, and the provider
+ instance (for command inspection).
+ """
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = SweBenchExtHarness()
+ env = await AsyncSweEnvironment.start(
+ {"fake-ext": {"test_output": test_output, "apply_rc": apply_rc, "run_cmd": run_cmd, "git_dir": git_dir}},
+ harness.build_spec(task),
+ )
+ await harness.materialize(env, task)
+ artifacts = await harness.run_eval(env, task)
+ return harness, artifacts, env.sandbox._provider
+
+ return asyncio.run(run())
+
+
+def test_run_eval_uses_legacy_apply_flags_and_grades_resolved():
+ task = _task()
+ harness, artifacts, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"))
+ apply_cmds = [c for c in provider.commands if "git apply" in c]
+ assert apply_cmds, "expected a git-apply attempt"
+ # The git-apply flag set, with no --3way fallback.
+ assert all("--reject --recount --ignore-space-change --ignore-whitespace" in c for c in apply_cmds)
+ assert all("--3way" not in c for c in apply_cmds)
+ assert artifacts.patch_applied is True
+ report = harness.grade(task, artifacts)
+ assert report.resolved is True
+
+
+def test_run_eval_apply_failure_still_resolves_on_tests():
+ # End-to-end through run_eval -> grade: a failed apply records
+ # patch_applied=False but a tests-passing run still resolves.
+ task = _task()
+ harness, artifacts, _ = _run_eval(task, test_output=_fixture("pytest_junit.xml"), apply_rc=1)
+ assert artifacts.patch_applied is False
+ report = harness.grade(task, artifacts)
+ assert report.patch_applied is False
+ assert report.resolved is True
+
+
+def test_run_eval_wraps_command_with_structured_output_and_markers():
+ # run_eval wraps the command — add the structured-output flag (--junitxml) via
+ # get_test_command_with_output and run between the SWE_BENCH_EXT markers (plus
+ # result-file dump), so parse_and_check_tests receives junit-xml / marked
+ # output rather than raw "-rA" text it cannot parse.
+ task = _task()
+ _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"))
+ eval_cmds = [c for c in provider.commands if "pytest" in c and "git apply" not in c]
+ assert eval_cmds, "expected a wrapped pytest eval command"
+ wrapped = eval_cmds[-1]
+ assert "<<>>" in wrapped
+ assert "<<>>" in wrapped
+ assert "--junitxml=" in wrapped # structured-output flag from get_test_command_with_output
+ assert "<<>>" in wrapped # junit result-file dumped for the parser
+ # The result-file parent dir is created first.
+ assert "mkdir -p /workspace/test-results" in wrapped
+
+
+# --- repo-workdir fallback ladder (matches main's cd /testbed||/workspace/repo||/app) ----
+
+
+def _eval_cwd(provider) -> str | None:
+ """Return the cwd of the wrapped eval command (the command holding the markers)."""
+ for command, cwd in zip(provider.commands, provider.exec_cwds):
+ if "<<>>" in command:
+ return cwd
+ return None
+
+
+def test_run_eval_resolves_workdir_from_ladder_when_repo_not_at_testbed():
+ """A repo at /workspace/repo (not /testbed) is found via main's fallback ladder.
+
+ Main's eval script runs ``cd /testbed || cd /workspace/repo || cd /app``; the harness
+ must reproduce that so the patches and tests run in the real checkout rather than the
+ hardcoded /testbed default.
+ """
+ task = _task() # default repo_workdir == /testbed
+ _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/workspace/repo")
+ # The patch-apply and the wrapped eval command run in the located checkout.
+ apply_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if "git apply" in cmd]
+ assert apply_cwds and all(cwd == "/workspace/repo" for cwd in apply_cwds)
+ assert _eval_cwd(provider) == "/workspace/repo"
+
+
+def test_run_eval_prefers_explicit_non_default_row_workdir():
+ """An explicit, non-default ``repo_workdir`` holding a checkout wins over the ladder."""
+ task = _task(repo_workdir="/srv/project")
+ _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/srv/project")
+ assert _eval_cwd(provider) == "/srv/project"
+
+
+def test_run_eval_defaults_to_testbed_when_present():
+ """When /testbed holds the checkout it wins (first ladder entry), preserving prior behavior."""
+ task = _task()
+ _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/testbed")
+ assert _eval_cwd(provider) == "/testbed"
+
+
+def test_reset_repo_resolves_workdir_from_ladder():
+ """reset_repo runs ``git reset --hard`` in the located checkout, not a hardcoded /testbed."""
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = SweBenchExtHarness()
+ task = _task()
+ env = await AsyncSweEnvironment.start(
+ {"fake-ext": {"git_dir": "/app"}},
+ harness.build_spec(task),
+ )
+ await harness.reset_repo(env, task)
+ return env.sandbox._provider
+
+ provider = asyncio.run(run())
+ reset_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if cmd.startswith("git reset --hard")]
+ assert reset_cwds == ["/app"]
diff --git a/resources_servers/swe_bench/tests/test_swe_env.py b/resources_servers/swe_bench/tests/test_swe_env.py
new file mode 100644
index 0000000000..e6be92588a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_env.py
@@ -0,0 +1,414 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe_env library, driven by a FakeSandbox provider."""
+
+from __future__ import annotations
+
+import ast
+import asyncio
+from pathlib import Path
+
+import resources_servers.swe_bench.harnesses # noqa: F401 (registers harnesses)
+from nemo_gym.sandbox import (
+ SandboxCreateError,
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench import (
+ compute_resolved,
+ get_harness,
+ list_harnesses,
+ reward_from_report,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.verify_task import ProviderCapabilityError, verify_task
+
+
+# Trailing-status pytest text (`` PASSED``) is the format the test
+# parser recognizes; node ids carry a ``.py`` path so they normalize to the
+# F2P/P2P ids below.
+_PASS_OUTPUT = "tests/test_x.py::a PASSED\ntests/test_x.py::b PASSED\n"
+_F2P_FAIL_OUTPUT = "tests/test_x.py::a FAILED\ntests/test_x.py::b PASSED\n"
+
+
+class _FakeProvider:
+ """Scripted provider: pytest commands return a canned transcript."""
+
+ name = "fake-swe"
+
+ def __init__(self, *, test_output="", test_rc=0, apply_rc=0, create_error=False, sink=None, **_):
+ """Configure the scripted provider's responses.
+
+ Args:
+ test_output: Stdout returned for pytest commands.
+ test_rc: Return code returned for pytest commands.
+ apply_rc: Return code returned for ``git apply`` commands.
+ create_error: When True, ``create`` raises a SandboxCreateError.
+ sink: Optional list each created spec is appended to, for asserting on what
+ ``verify_task`` passed the provider (e.g. the stamped ``ttl_s``).
+ **_: Ignored extra keyword arguments.
+ """
+ self._test_output = test_output
+ self._test_rc = test_rc
+ self._apply_rc = apply_rc
+ self._create_error = create_error
+ self._sink = sink
+
+ async def create(self, spec):
+ if self._sink is not None:
+ self._sink.append(spec)
+ if self._create_error:
+ raise SandboxCreateError("simulated create failure")
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ if "pytest" in command:
+ return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+ if "git apply" in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, *a, **k):
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-swe", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+ """Build a SweTask with sensible defaults, overridable per keyword.
+
+ Args:
+ **overrides: Field overrides merged onto the default task fields.
+
+ Returns:
+ A SweTask configured for the swe-bench-ext benchmark.
+ """
+ base = dict(
+ instance_id="inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ test_command="python -m pytest -rA -q",
+ model_patch="diff --git a/x b/x\n",
+ test_framework="pytest",
+ fail_to_pass=["tests/test_x.py::a"],
+ pass_to_pass=["tests/test_x.py::b"],
+ benchmark="swe-bench-ext",
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_compute_resolved():
+ """``compute_resolved`` is True only when all required tests are in the passed set."""
+ assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a", "b"]) is True
+ assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a"]) is False
+ assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=["a"]) is False
+
+
+def test_compute_resolved_fail_only():
+ """The ``fail_only`` eval type mirrors swebench's ``check_fail_only``.
+
+ A required test is success UNLESS it is present in the status map AND ==FAILED, so an
+ absent test (silent success) still resolves; a present-and-FAILED test does not.
+ """
+ # Required test absent from the status map -> success (silent) -> resolved.
+ assert (
+ compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=[], eval_type="fail_only", status_map={})
+ is True
+ )
+ # A present-and-FAILED required test -> failure -> unresolved.
+ assert (
+ compute_resolved(
+ fail_to_pass=["a"],
+ pass_to_pass=["b"],
+ passed=["b"],
+ eval_type="fail_only",
+ status_map={"a": "FAILED", "b": "PASSED"},
+ )
+ is False
+ )
+ # Present but not FAILED (e.g. SKIPPED/ERROR) -> success under fail_only -> resolved.
+ assert (
+ compute_resolved(
+ fail_to_pass=["a"],
+ pass_to_pass=["b"],
+ passed=[],
+ eval_type="fail_only",
+ status_map={"a": "SKIPPED", "b": "ERROR"},
+ )
+ is True
+ )
+ # Empty required set is still unresolved under fail_only (the validated edge).
+ assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=[], eval_type="fail_only") is False
+
+
+def test_compute_resolved_pass_and_fail_status_map():
+ """The default ``pass_and_fail`` rule with a populated status_map mirrors swebench.
+
+ This is the path that runs for SWE-bench Verified: a required test is a failure only when it
+ is absent or its status is FAILED/ERROR; PASSED/XFAIL pass and any other status (SKIPPED/XPASS)
+ is neutral (excluded, not a failure). Locking it in guards the swebench-equivalence this PR
+ depends on.
+ """
+ f2p, p2p = ["a"], ["b"]
+ # All required tests PASSED -> resolved.
+ assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED", "b": "PASSED"})
+ # A required test FAILED -> unresolved.
+ assert not compute_resolved(
+ fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "FAILED", "b": "PASSED"}
+ )
+ # A required test ERROR -> unresolved.
+ assert not compute_resolved(
+ fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "ERROR", "b": "PASSED"}
+ )
+ # A required test absent from the status_map -> unresolved.
+ assert not compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED"})
+ # XFAIL passes; SKIPPED/XPASS are neutral (not failures) -> resolved.
+ assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "XFAIL", "b": "SKIPPED"})
+
+
+def test_agent_adapters_do_not_call_grading_methods():
+ """Agent-facing swe_env modules never call the grader-only harness methods.
+
+ ``harness.py`` documents a trust boundary: ``reset_repo`` / ``run_eval`` / ``grade`` are used
+ ONLY by the grader (``verify_task``). This AST guard enforces it — the agent adapters
+ (``self_drive``, ``sandbox``) must reach grading through ``verify_task``, never by calling
+ those methods directly — so the boundary the docstring promises cannot silently regress.
+ """
+ grading_only = {"reset_repo", "run_eval", "grade"}
+ adapter_dir = Path(__file__).resolve().parent.parent
+ for module in ("self_drive.py", "sandbox.py"):
+ tree = ast.parse((adapter_dir / module).read_text())
+ referenced = sorted(
+ node.attr for node in ast.walk(tree) if isinstance(node, ast.Attribute) and node.attr in grading_only
+ )
+ assert not referenced, f"{module} calls grader-only methods {referenced}; route grading via verify_task"
+
+
+def test_reward_from_report():
+ """``reward_from_report`` is 1.0 for a resolved report and 0.0 otherwise or when masked."""
+ assert reward_from_report(SweEvalReport(instance_id="i", resolved=True)) == 1.0
+ assert reward_from_report(SweEvalReport(instance_id="i", resolved=False)) == 0.0
+ assert reward_from_report(SweEvalReport(instance_id="i", resolved=True, error_kind="sandbox")) == 0.0
+
+
+def test_registry_and_build_spec():
+ """The swe-bench-ext harness is registered and builds the expected sandbox spec."""
+ assert "swe-bench-ext" in list_harnesses()
+ harness = get_harness("swe-bench-ext")
+ assert isinstance(harness, SweBenchExtHarness)
+ spec = harness.build_spec(_task())
+ assert spec.image == "img:tag"
+ assert spec.workdir == "/testbed"
+ assert spec.metadata["instance_id"] == "inst-1"
+
+
+def test_grade_masks_on_infra_error():
+ """Grading masks an infra error to reward 0.0 and records its error kind."""
+ harness = get_harness("swe-bench-ext")
+ report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+ assert report.error_kind == "timeout"
+ assert reward_from_report(report) == 0.0
+
+
+# ---- verify_task orchestrator (fresh-sandbox, FakeProvider) -----------------
+
+
+def test_verify_task_resolved():
+ """``verify_task`` resolves a task whose required tests all pass."""
+ provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}}
+ report = asyncio.run(verify_task(provider, _task()))
+ assert report.resolved is True
+ assert report.patch_applied is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_verify_task_unresolved():
+ """``verify_task`` leaves a task unresolved when a required test fails."""
+ provider = {"fake-swe": {"test_output": _F2P_FAIL_OUTPUT, "test_rc": 1}}
+ report = asyncio.run(verify_task(provider, _task()))
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_verify_task_empty_patch_fast_path():
+ """An empty model patch short-circuits to an unresolved report."""
+ report = asyncio.run(verify_task({"fake-swe": {}}, _task(model_patch="")))
+ assert report.patch_exists is False
+ assert report.resolved is False
+
+
+def test_verify_task_non_timeout_eval_failure_unmasked():
+ """A non-timeout eval-stage failure is unmasked: resolved=False, reward 0.0.
+
+ Mirrors main's app.py, which catches any eval exception, returns no report file
+ (resolved=False) and leaves eval_timed_out False (so mask_sample stays False).
+ Only a genuine wall-clock eval timeout is masked.
+ """
+ report = asyncio.run(verify_task({"fake-swe": {"create_error": True}}, _task()))
+ assert report.error_kind is None
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_verify_task_golden():
+ """Running with ``run_golden`` applies the golden patch and resolves the task."""
+ provider = {"fake-swe": {"test_output": _PASS_OUTPUT}}
+ task = _task(model_patch="", metadata={"golden_patch": "diff --git a/x b/x\n"})
+ report = asyncio.run(verify_task(provider, task, run_golden=True))
+ assert report.resolved is True
+
+
+def test_verify_task_patch_apply_failure_does_not_gate_resolved():
+ """A failed patch apply is recorded but does not gate ``resolved``.
+
+ The patch is applied best-effort and grading is based on the tests only, so a
+ failed apply (patch_applied=False) does not flip a tests-passing run to
+ unresolved.
+ """
+ provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "apply_rc": 1}}
+ report = asyncio.run(verify_task(provider, _task()))
+ assert report.patch_applied is False
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_unsupported_provider_raises():
+ """``verify_task`` raises when the harness does not support the given provider."""
+
+ class _NestedOnly(SweBenchExtHarness):
+ name = "nested-only-test"
+
+ def supports_provider(self, provider_name: str) -> bool:
+ """Report support for every provider except ``fake-swe``.
+
+ Args:
+ provider_name: The provider name being checked.
+
+ Returns:
+ True for any provider other than ``fake-swe``.
+ """
+ return provider_name != "fake-swe"
+
+ from resources_servers.swe_bench.harness import register_harness
+
+ register_harness(_NestedOnly(), override=True)
+ task = _task(benchmark="nested-only-test")
+ try:
+ asyncio.run(verify_task({"fake-swe": {}}, task))
+ except ProviderCapabilityError:
+ return
+ raise AssertionError("expected ProviderCapabilityError")
+
+
+def test_verify_task_propagates_grader_dependency_error():
+ """``verify_task`` propagates ``GraderDependencyError`` instead of swallowing it to reward-0.
+
+ A missing grading dependency (e.g. swebench for a SWE-bench instance) must fail loud rather
+ than silently degrade the resolve rate, so it is re-raised, not caught by the unmasked
+ eval-stage handler.
+ """
+ from resources_servers.swe_bench.harness import GraderDependencyError, register_harness
+
+ class _MissingGrader(SweBenchExtHarness):
+ name = "missing-grader-test"
+
+ def grade(self, task, artifacts):
+ """Simulate a harness whose required grading dependency is unavailable.
+
+ Args:
+ task: The task being graded.
+ artifacts: The eval artifacts (unused).
+
+ Raises:
+ GraderDependencyError: Always, to exercise the propagation path.
+ """
+ raise GraderDependencyError("grading dependency missing")
+
+ register_harness(_MissingGrader(), override=True)
+ try:
+ asyncio.run(verify_task({"fake-swe": {"test_output": _PASS_OUTPUT}}, _task(benchmark="missing-grader-test")))
+ except GraderDependencyError:
+ return
+ raise AssertionError("expected GraderDependencyError to propagate")
+
+
+def test_verify_task_flat_eval_metadata():
+ """``metadata['flat_eval']`` routes grading through the harness's flat variant."""
+ provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}}
+ report = asyncio.run(verify_task(provider, _task(metadata={"flat_eval": True})))
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_verify_task_stamps_ttl_when_unset():
+ """``verify_task`` stamps ``ttl_s = eval_timeout_s + slack`` when the harness leaves it unset.
+
+ The stamp lets TTL-honoring backends (opensandbox) self-expire an eval sandbox orphaned by a
+ hard crash; harnesses that already set ``ttl_s`` (e.g. swe-bench-ext) keep their own value.
+ """
+ import dataclasses
+
+ from resources_servers.swe_bench.harness import register_harness
+ from resources_servers.swe_bench.verify_task import _TTL_SLACK_S
+
+ class _NoTtl(SweBenchExtHarness):
+ name = "no-ttl-test"
+
+ def build_spec(self, task):
+ """Build the swe-bench-ext spec but clear ``ttl_s`` so verify_task must stamp it.
+
+ Args:
+ task: The task to build a spec for.
+
+ Returns:
+ The base spec with ``ttl_s`` reset to None.
+ """
+ return dataclasses.replace(super().build_spec(task), ttl_s=None)
+
+ register_harness(_NoTtl(), override=True)
+ captured: list = []
+ provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "sink": captured}}
+ asyncio.run(verify_task(provider, _task(benchmark="no-ttl-test"), eval_timeout_s=120))
+ assert captured, "expected create() to be called with a stamped spec"
+ assert captured[-1].ttl_s == 120 + _TTL_SLACK_S
+
+
+def test_report_to_reward_wrapper():
+ """``report_to_reward`` is a thin wrapper that scores a report like ``reward_from_report``."""
+ from resources_servers.swe_bench.verify_task import report_to_reward
+
+ assert report_to_reward(SweEvalReport(instance_id="i", resolved=True)) == 1.0
+ assert report_to_reward(SweEvalReport(instance_id="i", resolved=False)) == 0.0
diff --git a/resources_servers/swe_bench/tests/test_swe_rebench.py b/resources_servers/swe_bench/tests/test_swe_rebench.py
new file mode 100644
index 0000000000..9d6faa1cbf
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_rebench.py
@@ -0,0 +1,483 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe-rebench harness (FakeSandbox provider).
+
+A tiny fake ``agent/log_parsers.py`` is written to a tmp dir so the real
+``_load_rebench_log_parsers`` import and ``NAME_TO_PARSER`` resolution path is
+exercised end to end, then the resolved / unresolved / masked grade paths are
+driven.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import textwrap
+from pathlib import Path
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask
+from resources_servers.swe_bench.harnesses.swe_rebench import (
+ SweRebenchHarness,
+ _normalize_test_name,
+)
+
+
+class _FakeProvider:
+ """Scripted provider: test command returns a canned transcript."""
+
+ name = "fake-rebench"
+
+ def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_):
+ """Initialize the scripted provider.
+
+ Args:
+ test_output: Transcript returned for the test command.
+ test_rc: Return code for the test command.
+ apply_rc: Return code for ``git apply`` commands.
+ """
+ self._test_output = test_output
+ self._test_rc = test_rc
+ self._apply_rc = apply_rc
+
+ async def create(self, spec):
+ raw = {"workdir": spec.workdir, "env": spec.env}
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw=raw)
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ if "git apply" in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+ if "pytest" in command or "test" in command:
+ return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, *a, **k):
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-rebench", _FakeProvider, override=True)
+
+
+class _RecordingProvider:
+ """Scripted provider that records every exec command, in order."""
+
+ name = "recording-rebench"
+ commands: list[str] = []
+ # (command, timeout_s) for every exec, so tests can assert the eval timeout
+ # is threaded into the test exec.
+ exec_calls: list[tuple[str, object]] = []
+
+ def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_):
+ """Initialize the recording provider.
+
+ Args:
+ test_output: Transcript returned for the test command.
+ test_rc: Return code for the test command.
+ apply_rc: Return code for ``git apply`` commands.
+ """
+ self._test_output = test_output
+ self._test_rc = test_rc
+ self._apply_rc = apply_rc
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="rec", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ type(self).commands.append(command)
+ type(self).exec_calls.append((command, timeout_s))
+ if "git apply" in command:
+ return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+ if "pytest" in command or "test" in command:
+ return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+ return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+ async def upload_file(self, *a, **k):
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("recording-rebench", _RecordingProvider, override=True)
+
+
+# A standalone log_parsers module the harness imports dynamically. The parser
+# splits " " lines into {node: STATUS} and exposes a
+# NAME_TO_PARSER registry of callables, matching the shape the harness expects.
+_FAKE_LOG_PARSERS = textwrap.dedent(
+ """
+ def parse_simple(log):
+ results = {}
+ for line in log.splitlines():
+ line = line.strip()
+ if not line:
+ continue
+ node, _, status = line.rpartition(" ")
+ if node and status:
+ results[node] = status
+ return results
+
+ NAME_TO_PARSER = {"simple": parse_simple}
+ """
+)
+
+
+def _write_fake_parsers(tmp_path: Path) -> Path:
+ """Write the fake ``agent/log_parsers.py`` module under a tmp repo dir.
+
+ Args:
+ tmp_path: The pytest tmp dir to create the repo under.
+
+ Returns:
+ Path: The created ``SWE-rebench-V2`` repo directory.
+ """
+ repo_dir = tmp_path / "SWE-rebench-V2"
+ (repo_dir / "agent").mkdir(parents=True)
+ (repo_dir / "agent" / "log_parsers.py").write_text(_FAKE_LOG_PARSERS)
+ return repo_dir
+
+
+def _task(**overrides) -> SweTask:
+ """Build a swe-rebench ``SweTask`` with sensible defaults.
+
+ Args:
+ **overrides: Field values overriding the defaults.
+
+ Returns:
+ SweTask: A task populated from the defaults merged with overrides.
+ """
+ base = dict(
+ instance_id="rebench-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ test_command="python -m pytest -rA -q",
+ model_patch="diff --git a/x b/x\n",
+ test_patch="diff --git a/t b/t\n",
+ fail_to_pass=["t::a"],
+ pass_to_pass=["t::b"],
+ benchmark="swe-rebench",
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_normalize_test_name_strips_timing():
+ assert _normalize_test_name("t::a [ 12 ms ]") == "t::a"
+ assert _normalize_test_name("t::a [0.3s]") == "t::a"
+ assert _normalize_test_name("t::a in 1.2 sec") == "t::a"
+ assert _normalize_test_name("t::a (5 ms)") == "t::a"
+ assert _normalize_test_name(" t::a ") == "t::a"
+ # No timing suffix -> unchanged.
+ assert _normalize_test_name("pkg::mod::test_x") == "pkg::mod::test_x"
+
+
+def test_build_spec_sets_java_env():
+ harness = SweRebenchHarness()
+ spec = harness.build_spec(_task())
+ assert spec.env["_JAVA_OPTIONS"] == "-Djava.net.preferIPv6Addresses=false"
+ assert spec.metadata["harness"] == "swe-rebench"
+ assert spec.image == "img:tag"
+
+
+# ---- grade paths (real dynamic-import of the fake parser) --------------------
+
+
+def test_grade_resolved(tmp_path):
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+ )
+ # Both required tests pass; timing suffix on one exercises normalization.
+ artifacts = EvalArtifacts(test_output="t::a [ 12 ms ] PASSED\nt::b PASSED\n", patch_applied=True)
+ report = harness.grade(task, artifacts)
+ assert report.resolved is True
+ assert report.error_kind is None
+ assert set(report.tests_status["passed"]) == {"t::a", "t::b"}
+
+
+def test_grade_unresolved_missing_pass_to_pass(tmp_path):
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+ )
+ artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b FAILED\n", patch_applied=True)
+ report = harness.grade(task, artifacts)
+ assert report.resolved is False
+ assert report.error_kind is None
+
+
+def test_grade_no_patch_applied_gate(tmp_path):
+ """``resolved`` is the test verdict ONLY and does not gate on patch_applied.
+ So even when the model patch failed to apply (``patch_applied=False``), a run
+ where every F2P/P2P test passes scores resolved=True."""
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+ )
+ artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b PASSED\n", patch_applied=False)
+ report = harness.grade(task, artifacts)
+ assert report.resolved is True
+ assert report.error_kind is None
+
+
+def test_grade_masks_missing_clone():
+ harness = SweRebenchHarness()
+ # No rebench_repo_dir in metadata -> the clone is not provisioned.
+ report = harness.grade(_task(), EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True))
+ assert report.error_kind == "eval_error"
+ assert report.resolved is False
+
+
+def test_grade_masks_unknown_parser(tmp_path):
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "does_not_exist"}},
+ )
+ report = harness.grade(task, EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True))
+ assert report.error_kind == "eval_error"
+
+
+def test_grade_masks_on_infra_error():
+ harness = SweRebenchHarness()
+ report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+ assert report.error_kind == "timeout"
+
+
+# ---- run_eval (FakeSandbox) -------------------------------------------------
+
+
+def test_run_eval_then_grade_resolved(tmp_path):
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={
+ "rebench_repo_dir": str(repo_dir),
+ "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+ },
+ )
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "test_rc": 0}}
+
+ async def _run():
+ spec = harness.build_spec(task)
+ env = await AsyncSweEnvironment.start(provider, spec)
+ try:
+ await harness.reset_repo(env, task)
+ await harness.materialize(env, task)
+ artifacts = await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+ return artifacts
+
+ artifacts = asyncio.run(_run())
+ assert artifacts.patch_applied is True
+ report = harness.grade(task, artifacts)
+ assert report.resolved is True
+
+
+def test_run_eval_patch_not_applied_still_grades_on_tests(tmp_path):
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}})
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ # apply_rc=1 -> model patch fails to apply -> patch_applied False, but grading
+ # is on the tests only (no patch_applied gate), so a run where every F2P/P2P
+ # test passes is still resolved=True.
+ provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "apply_rc": 1}}
+
+ async def _run():
+ spec = harness.build_spec(task)
+ env = await AsyncSweEnvironment.start(provider, spec)
+ try:
+ await harness.run_eval(env, task)
+ return await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+
+ artifacts = asyncio.run(_run())
+ assert artifacts.patch_applied is False
+ assert harness.grade(task, artifacts).resolved is True
+
+
+# ---- apply order ------------------------------------------------------------
+
+
+def test_run_eval_applies_model_patch_before_test_patch(tmp_path):
+ """The model patch (/root/patch.diff) is applied BEFORE the test patch
+ (/root/test_patch.diff)."""
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}})
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ _RecordingProvider.commands = []
+ _RecordingProvider.exec_calls = []
+ provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+ async def _run():
+ spec = harness.build_spec(task)
+ env = await AsyncSweEnvironment.start(provider, spec)
+ try:
+ await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+
+ asyncio.run(_run())
+ applies = [c for c in _RecordingProvider.commands if "git apply" in c]
+ assert len(applies) == 2
+ assert "/root/patch.diff" in applies[0], applies
+ assert "/root/test_patch.diff" in applies[1], applies
+
+
+# ---- eval timeout threaded into the test exec -------------------------------
+
+
+def _rebench_test_exec_timeout(commands_and_timeouts):
+ """Return the timeout_s passed to the test exec (the one running the tests).
+
+ The test block is the only exec that is neither a ``git apply`` nor an
+ install command; in these tests the test command always contains ``pytest``.
+
+ Args:
+ commands_and_timeouts: An iterable of ``(command, timeout_s)`` pairs.
+
+ Returns:
+ The ``timeout_s`` value recorded for the test exec.
+
+ Raises:
+ AssertionError: If no test exec is found in the recorded calls.
+ """
+ for command, timeout_s in commands_and_timeouts:
+ if "git apply" not in command and ("pytest" in command or "test" in command):
+ return timeout_s
+ raise AssertionError(f"no test exec found in {commands_and_timeouts!r}")
+
+
+def test_run_eval_threads_tests_timeout_into_test_exec(tmp_path):
+ """The test exec receives timeout_s = task.metadata['tests_timeout'] when
+ present so a stuck run is bounded instead of hanging the verifier. Uses a
+ non-default value (600) so this distinguishes an explicit override from the
+ 1800 default."""
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={
+ "rebench_repo_dir": str(repo_dir),
+ "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+ "tests_timeout": 600,
+ },
+ )
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ _RecordingProvider.commands = []
+ _RecordingProvider.exec_calls = []
+ provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+ async def _run():
+ spec = harness.build_spec(task)
+ env = await AsyncSweEnvironment.start(provider, spec)
+ try:
+ await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+
+ asyncio.run(_run())
+ assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 600
+
+
+def test_run_eval_tests_timeout_absent_defaults_to_1800(tmp_path):
+ """The timeout (default 30*60) is applied to every swe-rebench run. Rows that
+ carry no tests_timeout (including SWE-bench-Verified) still get the 1800s
+ bound rather than an unbounded (None) run."""
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ metadata={
+ "rebench_repo_dir": str(repo_dir),
+ "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+ },
+ )
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ _RecordingProvider.commands = []
+ _RecordingProvider.exec_calls = []
+ provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+ async def _run():
+ spec = harness.build_spec(task)
+ env = await AsyncSweEnvironment.start(provider, spec)
+ try:
+ await harness.run_eval(env, task)
+ finally:
+ await env.cleanup()
+
+ asyncio.run(_run())
+ assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 1800
+
+
+# ---- grading parity / empty-required ----------------------------------------
+
+
+def test_grade_empty_required_resolves_true(tmp_path):
+ """``resolved`` is purely (fail_to_pass_set <= passed) and
+ (pass_to_pass_set <= passed). With no required tests, both empty sets are
+ subsets of any passed set, so resolved=True — there is no bool(required)
+ requirement."""
+ repo_dir = _write_fake_parsers(tmp_path)
+ harness = SweRebenchHarness()
+ task = _task(
+ fail_to_pass=[],
+ pass_to_pass=[],
+ metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+ )
+ artifacts = EvalArtifacts(test_output="something PASSED\n", patch_applied=True)
+ report = harness.grade(task, artifacts)
+ assert report.resolved is True
+ assert report.error_kind is None
diff --git a/resources_servers/swe_bench/tests/test_swebench.py b/resources_servers/swe_bench/tests/test_swebench.py
new file mode 100644
index 0000000000..282049630a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swebench.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe-bench / swe-bench-multilingual flat (host-graded) harness.
+
+The harness runs the instance's eval script in the sandbox and grades the produced log
+host-side (swebench's per-repo parser, falling back to the generic flat parser), so it runs on
+any exec-capable provider. These tests validate provisioning (``build_spec`` / ``materialize``),
+the flat ``run_eval`` + ``grade`` path, and family validation, against a scripted ``_FakeProvider``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+import pytest
+
+from nemo_gym.sandbox import (
+ SandboxExecResult,
+ SandboxHandle,
+ SandboxStatus,
+ register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+# Canned eval-script log with the SWE-bench sentinels + pytest-style passing lines.
+_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n"
+
+
+class _FakeProvider:
+ """Scripted provider: returns a canned eval log for the eval-script run; records uploads.
+
+ Args:
+ log_text: Text returned by the eval-script (``bash``) and ``cat`` commands.
+ exec_rc: Return code for the eval-script command.
+ """
+
+ name = "fake-swebench"
+
+ def __init__(self, *, log_text="", exec_rc=0, **_):
+ self._log_text = log_text
+ self._exec_rc = exec_rc
+ self.uploaded: dict[str, str] = {}
+ self.commands: list[str] = []
+
+ async def create(self, spec):
+ return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+ async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+ self.commands.append(command)
+ rc = 0 if command.startswith("cat ") else self._exec_rc
+ return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc)
+
+ async def upload_file(self, handle, local_path, remote_path):
+ try:
+ with open(local_path, encoding="utf-8") as fh:
+ self.uploaded[remote_path] = fh.read()
+ except OSError:
+ self.uploaded[remote_path] = ""
+ return None
+
+ async def download_file(self, *a, **k):
+ return None
+
+ async def status(self, handle):
+ return SandboxStatus.RUNNING
+
+ async def close(self, handle):
+ return None
+
+ async def aclose(self):
+ return None
+
+
+register_provider("fake-swebench", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+ """Build a swe-bench ``SweTask`` with sensible defaults."""
+ base = dict(
+ instance_id="repo__inst-1",
+ image="img:tag",
+ base_commit="abc123",
+ repo_workdir="/testbed",
+ model_patch="diff --git a/x b/x\n",
+ fail_to_pass=["t::a"],
+ pass_to_pass=["t::b"],
+ benchmark="swe-bench",
+ split="test",
+ )
+ base.update(overrides)
+ return SweTask(**base)
+
+
+def test_grade_strategy_is_flat():
+ assert SweBenchHarness("swe-bench").grade_strategy == "flat-host-grade"
+ assert SweBenchHarness("swe-bench-multilingual").grade_strategy == "flat-host-grade"
+
+
+def test_unknown_family_rejected():
+ with pytest.raises(ValueError):
+ SweBenchHarness("not-a-family")
+
+
+def test_build_spec_image_workdir_metadata():
+ spec = SweBenchHarness("swe-bench").build_spec(_task())
+ assert spec.image == "img:tag"
+ assert spec.workdir == "/testbed"
+ assert spec.metadata["instance_id"] == "repo__inst-1"
+ assert spec.metadata["harness"] == "swe-bench"
+
+
+def test_build_spec_preserves_task_provider_options():
+ spec = SweBenchHarness("swe-bench").build_spec(_task(metadata={"provider_options": {"network": "host"}}))
+ assert spec.provider_options.get("network") == "host"
+
+
+def test_supports_provider_any_exec_capable():
+ harness = SweBenchHarness("swe-bench")
+ assert harness.supports_provider("docker") is True
+ assert harness.supports_provider("apptainer") is True
+ assert harness.supports_provider("opensandbox") is True
+
+
+def test_with_flat_eval_is_self():
+ harness = SweBenchHarness("swe-bench")
+ assert harness.with_flat_eval() is harness
+
+
+def test_materialize_writes_patch_diff():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = SweBenchHarness("swe-bench")
+ task = _task()
+ env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+ await harness.materialize(env, task)
+ return env.sandbox._provider
+
+ provider = asyncio.run(run())
+ assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n"
+
+
+def test_materialize_empty_patch_writes_nothing():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ async def run():
+ harness = SweBenchHarness("swe-bench")
+ task = _task(model_patch="")
+ env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+ await harness.materialize(env, task)
+ return env.sandbox._provider
+
+ provider = asyncio.run(run())
+ assert "/root/patch.diff" not in provider.uploaded
+
+
+def test_run_eval_then_grade_flat_resolved():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ # eval_script preset so flat_run_eval executes it; no instance_dict -> grade falls back to
+ # the generic flat parser over the canned passing log.
+ async def run():
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"eval_script": "echo run"})
+ env = await AsyncSweEnvironment.start({"fake-swebench": {"log_text": _PASSING_LOG}}, harness.build_spec(task))
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts)
+
+ report = asyncio.run(run())
+ assert report.resolved is True
+ assert reward_from_report(report) == 1.0
+
+
+def test_run_eval_missing_eval_script_is_unmasked_unresolved():
+ from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+ # No instance_dict + no preset eval_script -> _flat_eval_script returns "" -> the run tags an
+ # eval_error, but grading no longer masks it: per main an unbuildable/empty spec grades as a
+ # legitimate unmasked unresolved (reward 0), not an eval_error mask.
+ async def run():
+ harness = SweBenchHarness("swe-bench")
+ task = _task()
+ env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts)
+
+ report = asyncio.run(run())
+ assert report.error_kind is None
+ assert report.resolved is False
+ assert reward_from_report(report) == 0.0
+
+
+def test_grade_masks_on_infra_error():
+ report = SweBenchHarness("swe-bench").grade(_task(), EvalArtifacts(raw={"error_type": "timeout"}))
+ assert report.error_kind == "timeout"
+ assert reward_from_report(report) == 0.0
+
+
+def test_flat_eval_script_empty_without_instance_dict():
+ assert SweBenchHarness("swe-bench")._flat_eval_script(_task()) == ""
+
+
+def test_grade_fails_loud_when_swebench_unavailable(monkeypatch):
+ """A SWE-bench instance whose ``swebench`` install is missing fails loud, not silent-degrade.
+
+ Degrading to the generic pytest-only parser would mis-score non-pytest repos (e.g. django) as
+ unresolved, silently skewing the resolve rate. Instead grading raises ``GraderDependencyError``
+ so the misconfiguration surfaces.
+ """
+ import sys
+
+ from resources_servers.swe_bench.harness import GraderDependencyError
+
+ # Simulate a missing / broken swebench install for the import inside _swebench_flat_grade.
+ monkeypatch.setitem(sys.modules, "swebench.harness.constants", None)
+ harness = SweBenchHarness("swe-bench")
+ task = _task(metadata={"instance_dict": {"instance_id": "repo__inst-1", "repo": "x/y"}})
+ artifacts = EvalArtifacts(test_output=_PASSING_LOG, return_code=0, raw={})
+ with pytest.raises(GraderDependencyError):
+ harness.grade(task, artifacts)
diff --git a/resources_servers/swe_bench/tests/test_task.py b/resources_servers/swe_bench/tests/test_task.py
new file mode 100644
index 0000000000..4888746d86
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_task.py
@@ -0,0 +1,81 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import json
+
+import pytest
+
+from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming
+from resources_servers.swe_bench.task import (
+ ENVIRONMENT_NAME,
+ SweTask,
+ TaskSubmission,
+ build_task,
+ harness_family_key,
+ parse_submission,
+ parse_task_from_request,
+)
+
+
+def _sample_row() -> dict:
+ inst = {
+ "instance_id": "astropy__astropy-12907",
+ "base_commit": "abc123",
+ "test_patch": "",
+ "FAIL_TO_PASS": '["tests/test_x.py::a"]',
+ "PASS_TO_PASS": '["tests/test_x.py::b"]',
+ }
+ return {
+ "instance_id": "astropy__astropy-12907",
+ "dataset_name": "princeton-nlp/SWE-bench_Verified",
+ "split": "test",
+ "problem_statement": "Fix the bug.",
+ "instance_dict": json.dumps(inst),
+ "responses_create_params": NeMoGymResponseCreateParamsNonStreaming(
+ input=[{"role": "user", "content": "Fix the bug."}],
+ ),
+ }
+
+
+def test_harness_family_key_from_dataset_name() -> None:
+ assert harness_family_key("princeton-nlp/SWE-bench_Verified") == "swe-bench"
+ assert harness_family_key("something/R2E-Gym/foo") == "r2e-gym"
+
+
+def test_build_task_sets_benchmark_fields() -> None:
+ task = build_task(_sample_row(), container_formatter="swebench/sweb.eval.x86_64.{instance_id}")
+ assert task.task_id == "astropy__astropy-12907"
+ assert task.harness_family == "swe-bench"
+ assert task.dataset_name == "princeton-nlp/SWE-bench_Verified"
+ assert task.problem_statement == "Fix the bug."
+ assert task.metadata["instance_dict"]["base_commit"] == "abc123"
+
+
+def test_public_view_excludes_privileged_metadata() -> None:
+ task = build_task(_sample_row(), container_formatter="x.{instance_id}")
+ public = task.public_view()
+ assert public.task_id == task.task_id
+ assert public.environment == ENVIRONMENT_NAME
+ assert public.harness_family == "swe-bench"
+ assert not hasattr(public, "instance_dict")
+
+
+def test_parse_task_from_request_requires_instance_id() -> None:
+ class Body:
+ responses_create_params = None
+ verifier_metadata = {}
+
+ with pytest.raises(ValueError, match="instance_id"):
+ parse_task_from_request(Body(), container_formatter="x.{instance_id}")
+
+
+def test_with_submission() -> None:
+ task = SweTask(instance_id="x", benchmark="swe-bench")
+ updated = task.with_submission(TaskSubmission(model_patch="diff"))
+ assert updated.model_patch == "diff"
+
+
+def test_parse_submission_accepts_git_patch_alias() -> None:
+ assert parse_submission({"git_patch": "p"}).model_patch == "p"
diff --git a/resources_servers/swe_bench/verify_task.py b/resources_servers/swe_bench/verify_task.py
new file mode 100644
index 0000000000..3c08c5a3cf
--- /dev/null
+++ b/resources_servers/swe_bench/verify_task.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Verification orchestrator for the SWE environment.
+
+Grades an agent patch via the ``swe_bench`` resources server ``/verify`` endpoint.
+Runs a fresh-only sequence via ``acquire_sandbox`` (always-teardown), bounded by a
+per-call eval timeout.
+
+Every eval spec is stamped with a ``ttl_s`` so TTL-honoring backends (such as
+opensandbox) self-expire orphaned sandboxes.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import dataclasses
+from collections.abc import Mapping
+from typing import Any
+
+# Importing this package registers the swe_bench harnesses; the docker/apptainer
+# providers are built into nemo_gym.sandbox and resolve lazily (no import needed).
+import resources_servers.swe_bench.harnesses # noqa: F401
+from nemo_gym.sandbox import SandboxProvider
+from resources_servers.swe_bench.harness import (
+ GraderDependencyError,
+ SweEvalReport,
+ SweTask,
+ get_harness,
+ reward_from_report,
+)
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+
+
+#: Slack added to the eval timeout when stamping a sandbox TTL (covers spin-up +
+#: teardown so a TTL-honoring backend does not expire a still-running eval).
+_TTL_SLACK_S = 600.0
+
+
+class ProviderCapabilityError(RuntimeError):
+ """Raised when a task's harness does not support the configured provider."""
+
+
+def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str:
+ """Return the provider's name.
+
+ Args:
+ provider: Either a single-key provider mapping or a ``SandboxProvider``
+ instance.
+
+ Returns:
+ str: The provider name, or ``"?"`` if it cannot be determined.
+ """
+ if isinstance(provider, Mapping):
+ return next(iter(provider), "?")
+ return getattr(provider, "name", "?")
+
+
+async def verify_task(
+ provider: Mapping[str, Any] | SandboxProvider,
+ task: SweTask,
+ *,
+ run_golden: bool = False,
+ eval_timeout_s: float | None = None,
+) -> SweEvalReport:
+ """Grade a task's patch in a fresh sandbox and return a report.
+
+ Selects the harness for the task's benchmark, optionally substitutes the
+ golden patch, then resets the repo, materializes the patch, runs the eval,
+ and grades the artifacts. An empty patch short-circuits without spinning up
+ a sandbox. A genuine wall-clock eval timeout is returned as a report carrying
+ ``error_kind="eval_timeout"``; other non-timeout eval-stage failures are
+ returned unmasked (``resolved=False``, ``error_kind=None``) to mirror main,
+ rather than raised.
+
+ Args:
+ provider: Single-key provider mapping or ``SandboxProvider`` selecting
+ the sandbox backend.
+ task: The task whose patch is graded.
+ run_golden: When True, grade the task's golden patch instead of the
+ model patch.
+ eval_timeout_s: Optional override for the per-call eval timeout in
+ seconds; falls back to the task metadata or a default.
+
+ Returns:
+ SweEvalReport: The grading outcome, with ``error_kind="eval_timeout"`` set
+ only on a genuine wall-clock eval timeout; non-timeout eval-stage
+ failures are reported unmasked (``resolved=False``, ``error_kind=None``).
+
+ Raises:
+ ProviderCapabilityError: If the task's harness does not support the provider.
+ GraderDependencyError: If a required grading dependency is unavailable for an
+ instance the harness must grade exactly (propagated, not swallowed).
+ """
+ harness = get_harness(task.benchmark)
+ if task.metadata.get("flat_eval"):
+ # Grade host-side (flat) so nested families (swe-bench / r2e-gym) can be graded on
+ # exec-only providers like docker; a no-op for already-flat families.
+ harness = harness.with_flat_eval()
+
+ if run_golden:
+ task = dataclasses.replace(task, model_patch=task.metadata.get("golden_patch", ""))
+
+ # Empty/falsy-patch fast path: skip eval spin-up entirely.
+ if not (task.model_patch or "").strip():
+ return SweEvalReport(instance_id=task.instance_id, patch_exists=False, resolved=False)
+
+ provider_name = _provider_name(provider)
+ if not harness.supports_provider(provider_name):
+ raise ProviderCapabilityError(
+ f"Harness {harness.name!r} does not support provider {provider_name!r} "
+ f"(grade_strategy={harness.grade_strategy})"
+ )
+
+ spec = harness.build_spec(task)
+ timeout = eval_timeout_s if eval_timeout_s is not None else float(task.metadata.get("eval_timeout_s", 1800))
+ # Stamp a TTL so backends that honor it (opensandbox) self-expire an eval sandbox
+ # orphaned by a hard crash. docker ignores ttl_s; its finally-teardown covers it.
+ if spec.ttl_s is None:
+ spec = dataclasses.replace(spec, ttl_s=timeout + _TTL_SLACK_S)
+
+ try:
+ async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env:
+
+ async def _sequence() -> SweEvalReport:
+ await harness.reset_repo(env, task)
+ await harness.materialize(env, task)
+ artifacts = await harness.run_eval(env, task)
+ return harness.grade(task, artifacts)
+
+ return await asyncio.wait_for(_sequence(), timeout=timeout)
+ except GraderDependencyError:
+ # A required grader dependency is missing (e.g. swebench for a SWE-bench instance).
+ # Propagate rather than degrading to an unmasked reward-0 so the misconfiguration is
+ # loud (a crash in the standalone path; every sample masked in the anyswe path) instead
+ # of silently skewing the resolve rate.
+ raise
+ except (asyncio.TimeoutError, TimeoutError):
+ # Genuine wall-clock eval timeout: mask via error_kind. This mirrors main's
+ # app.py, which sets eval_timed_out (-> mask_sample) only when the final eval
+ # elapsed time reaches the configured tests timeout.
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ error_kind="eval_timeout",
+ tests_status={"timeout_s": timeout},
+ )
+ except Exception as exc: # non-timeout eval-stage failure -> unmasked reward 0
+ # A non-timeout eval-stage crash is NOT masked: main's app.py catches any eval
+ # exception, returns no report file (resolved=False) and leaves eval_timed_out
+ # False, so the sample stays in the gradient at reward 0. Returning
+ # error_kind=None here keeps mask_sample aligned with main rather than masking
+ # the infra crash (which main does not do).
+ return SweEvalReport(
+ instance_id=task.instance_id,
+ patch_exists=bool(task.model_patch),
+ resolved=False,
+ error_kind=None,
+ tests_status={"exception": repr(exc)},
+ )
+
+
+def report_to_reward(report: SweEvalReport) -> float:
+ """Convert an eval report into a scalar reward.
+
+ Args:
+ report: The grading outcome to score.
+
+ Returns:
+ float: The reward derived from the report.
+ """
+ return reward_from_report(report)
diff --git a/responses_api_agents/claude_code_agent/app.py b/responses_api_agents/claude_code_agent/app.py
index 6970d92ce1..9a399a5094 100644
--- a/responses_api_agents/claude_code_agent/app.py
+++ b/responses_api_agents/claude_code_agent/app.py
@@ -15,9 +15,11 @@
import asyncio
import copy
+import dataclasses
import json
import logging
import os
+import shlex
import shutil
import subprocess
import tempfile
@@ -50,6 +52,7 @@
NeMoGymResponseOutputTokensDetails,
NeMoGymResponseUsage,
)
+from nemo_gym.sandbox import AsyncSandbox, SandboxResources, SandboxSpec
from nemo_gym.server_utils import get_response_json, raise_for_status
from nemo_gym.skills import stage_skills
from responses_api_agents.claude_code_agent.setup_claude_code import ensure_claude_code
@@ -237,10 +240,13 @@ class ClaudeCodeAgentConfig(BaseResponsesAPIAgentConfig):
bare: bool = True
mcp_config: Optional[str] = None
settings: Optional[str] = None
+ sandbox_provider: Optional[dict[str, Any]] = None
+ in_box_timeout_s: int = 1800
class ClaudeCodeAgentRunRequest(BaseRunRequest):
model_config = ConfigDict(extra="allow")
+ verifier_metadata: Optional[dict[str, Any]] = None
class ClaudeCodeAgentVerifyResponse(BaseVerifyResponse):
@@ -490,6 +496,131 @@ def _write_rollout_mcp_config(self, seed_response_json: dict[str, Any], output_d
config_path.write_text(json.dumps(config, indent=2, sort_keys=True))
return str(config_path)
+ @staticmethod
+ def _sandbox_spec_from_descriptor(spec_dict: dict[str, Any]) -> SandboxSpec:
+ payload = dict(spec_dict)
+ resources = payload.pop("resources", None)
+ if resources is None:
+ resources = SandboxResources()
+ elif not isinstance(resources, SandboxResources):
+ resources = SandboxResources.from_mapping(resources)
+ return SandboxSpec(**payload, resources=resources)
+
+ def _anthropic_env(self) -> tuple[dict[str, str], str]:
+ base_url = self._resolve_base_url()
+ model = self.config.model if base_url else self.config.model.split("/")[-1]
+ api_key = self.config.anthropic_api_key
+ env = {
+ "ANTHROPIC_API_KEY": api_key, # pragma: allowlist secret
+ "ANTHROPIC_MODEL": model,
+ "ANTHROPIC_DEFAULT_HAIKU_MODEL": model,
+ "ANTHROPIC_DEFAULT_SONNET_MODEL": model,
+ "ANTHROPIC_DEFAULT_OPUS_MODEL": model,
+ "CLAUDE_CODE_SUBAGENT_MODEL": model,
+ "IS_SANDBOX": "1",
+ }
+ if base_url:
+ env["ANTHROPIC_BASE_URL"] = base_url
+ env["ANTHROPIC_AUTH_TOKEN"] = api_key or "local"
+ return env, model
+
+ async def _run_in_box(
+ self,
+ body: ClaudeCodeAgentRunRequest,
+ seed_resp_json: dict[str, Any],
+ *,
+ skills_path: Optional[str] = None,
+ ) -> tuple[dict[str, Any], str]:
+ spec_dict = (seed_resp_json.get("sandbox") or {}).get("spec") or {}
+ workdir = spec_dict.get("workdir") or "/testbed"
+ spec = self._sandbox_spec_from_descriptor(spec_dict)
+ egress_env = (seed_resp_json.get("egress") or {}).get("env") or {}
+ anthropic_env, model = self._anthropic_env()
+ spec = dataclasses.replace(spec, env={**spec.env, **egress_env, **anthropic_env})
+
+ provider = self.config.sandbox_provider or {"docker": {}}
+ sandbox = AsyncSandbox(provider, spec)
+ await sandbox.start()
+ claude_config_dir: Path | None = None
+ try:
+ claude_config_dir = self._setup_config_dir(skills_path=skills_path)
+ remote_cfg = "/tmp/nemo_gym_claude"
+ await sandbox.exec(f"mkdir -p {shlex.quote(remote_cfg)}", cwd=workdir, timeout_s=60)
+ await sandbox.upload(str(claude_config_dir / "settings.json"), f"{remote_cfg}/settings.json")
+
+ params = body.responses_create_params.model_copy(deep=True)
+ if isinstance(params.input, str):
+ params.input = [NeMoGymEasyInputMessage(role="user", content=params.input)]
+ user_message, input_system = _extract_instruction(params.input)
+ system_parts = [p for p in [self.config.system_prompt, input_system] if p]
+ system_prompt = "\n\n".join(system_parts) if system_parts else None
+
+ cmd_parts = self._build_command(
+ model,
+ user_message,
+ system_prompt=system_prompt,
+ skills_active=bool(skills_path),
+ )
+ env_prefix = " ".join(f"{shlex.quote(k)}={shlex.quote(v)}" for k, v in spec.env.items())
+ remote_cmd = f"{env_prefix} CLAUDE_CONFIG_DIR={shlex.quote(remote_cfg)} {shlex.join(cmd_parts)}"
+ result = await sandbox.exec(remote_cmd, cwd=workdir, timeout_s=self.config.in_box_timeout_s)
+ stdout = result.stdout or ""
+ if result.error_type == "timeout":
+ LOG.warning("claude-code in-box timed out after %ss", self.config.in_box_timeout_s)
+ elif result.return_code not in (0, None) and stdout.strip() == "":
+ LOG.warning(
+ "claude-code in-box exited %s: %s",
+ result.return_code,
+ (result.stderr or "")[:500],
+ )
+
+ output_items, usage = parse_stream_json(stdout)
+ if not any(
+ getattr(item, "type", None) == "message" and getattr(item, "role", None) == "assistant"
+ for item in output_items
+ ):
+ output_items.append(
+ NeMoGymResponseOutputMessage(
+ id=f"msg_{uuid4().hex}",
+ content=[NeMoGymResponseOutputText(text="", annotations=[])],
+ role="assistant",
+ status="completed",
+ type="message",
+ )
+ )
+
+ input_tokens = usage.get("input_tokens", 0)
+ output_tokens = usage.get("output_tokens", 0)
+ agent_resp = NeMoGymResponse(
+ id=f"resp_{uuid4().hex}",
+ created_at=int(time()),
+ model=model,
+ object="response",
+ output=output_items,
+ tool_choice=params.tool_choice,
+ tools=params.tools,
+ parallel_tool_calls=params.parallel_tool_calls,
+ usage=NeMoGymResponseUsage(
+ input_tokens=input_tokens,
+ input_tokens_details=NeMoGymResponseInputTokensDetails(cached_tokens=0),
+ output_tokens=output_tokens,
+ output_tokens_details=NeMoGymResponseOutputTokensDetails(reasoning_tokens=0),
+ total_tokens=input_tokens + output_tokens,
+ ),
+ )
+
+ patch_result = await sandbox.exec(
+ f"cd {shlex.quote(workdir)} && git add -A && git diff --cached",
+ cwd=workdir,
+ timeout_s=120,
+ )
+ patch = patch_result.stdout or ""
+ return agent_resp.model_dump(mode="json"), patch
+ finally:
+ if claude_config_dir is not None:
+ shutil.rmtree(claude_config_dir, ignore_errors=True)
+ await sandbox.stop()
+
async def _create_response(
self,
body: NeMoGymResponseCreateParamsNonStreaming,
@@ -569,23 +700,32 @@ async def run(self, request: Request, body: ClaudeCodeAgentRunRequest) -> Claude
cookies = seed_resp.cookies
seed_resp_json = await get_response_json(seed_resp)
- # The run-level skills_ref (stamped by rollout collection) rides on the request body
- # (extra="allow"). Pass its path straight into _create_response so the CLI invocation
- # can stage the skills into its per-request CLAUDE_CONFIG_DIR. run() calls _create_response
- # in-process, so no metadata side-channel is needed (unlike the schema-forbidden HTTP path).
skills_path = ((body.model_extra or {}).get(SKILLS_REF_KEY_NAME) or {}).get("path")
-
- with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir:
- mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir))
- agent_resp = await self._create_response(
- body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path
- )
- agent_resp_json = agent_resp.model_dump(mode="json")
+ topology = (seed_resp_json.get("placement") or {}).get("topology") or "none"
+
+ if topology == "agent_in_env":
+ agent_resp_json, model_patch = await self._run_in_box(body, seed_resp_json, skills_path=skills_path)
+ verifier_metadata = {
+ **(body.verifier_metadata or {}),
+ **(seed_resp_json.get("verifier_metadata") or {}),
+ "model_patch": model_patch,
+ }
+ else:
+ with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir:
+ mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir))
+ agent_resp = await self._create_response(
+ body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path
+ )
+ agent_resp_json = agent_resp.model_dump(mode="json")
+ verifier_metadata = {
+ **(body.verifier_metadata or {}),
+ **(seed_resp_json.get("verifier_metadata") or {}),
+ }
verify_resp = await self.server_client.post(
server_name=self.config.resources_server.name,
url_path="/verify",
- json=body.model_dump() | {"response": agent_resp_json},
+ json=body.model_dump() | {"response": agent_resp_json, "verifier_metadata": verifier_metadata},
cookies=cookies,
)
await raise_for_status(verify_resp)
diff --git a/tests/unit_tests/test_docker_provider.py b/tests/unit_tests/test_docker_provider.py
new file mode 100644
index 0000000000..46062a60d6
--- /dev/null
+++ b/tests/unit_tests/test_docker_provider.py
@@ -0,0 +1,213 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the local Docker ``SandboxProvider`` (CLI mocked, no docker required)."""
+
+import asyncio
+from pathlib import Path
+from typing import Any, Callable
+
+import pytest
+
+from nemo_gym.sandbox.providers.base import (
+ SandboxCreateError,
+ SandboxResources,
+ SandboxSpec,
+ SandboxStatus,
+)
+from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider
+
+
+class RunRecorder:
+ """Stand-in for ``DockerSandboxProvider._run`` that records argv and returns canned output.
+
+ The responder maps the captured ``docker`` args to a ``(rc, stdout, stderr)`` tuple, and may
+ raise (e.g. ``TimeoutError``) to simulate a CLI failure.
+ """
+
+ def __init__(self, responder: Callable[[list[str]], tuple[int, str, str]]) -> None:
+ self.calls: list[dict[str, Any]] = []
+ self._responder = responder
+
+ async def __call__(self, *args: str, timeout_s: float | None = None) -> tuple[int, str, str]:
+ self.calls.append({"args": list(args), "timeout_s": timeout_s})
+ return self._responder(list(args))
+
+
+def _make_provider(
+ monkeypatch: pytest.MonkeyPatch, responder: Callable[[list[str]], tuple[int, str, str]], **kwargs: Any
+) -> tuple[DockerSandboxProvider, RunRecorder]:
+ provider = DockerSandboxProvider(**kwargs)
+ rec = RunRecorder(responder)
+ monkeypatch.setattr(provider, "_run", rec)
+ return provider, rec
+
+
+def _ran(rec: RunRecorder, *prefix: str) -> bool:
+ """True if any recorded call's args start with ``prefix`` (e.g. ``"rm", "-f"``)."""
+ return any(call["args"][: len(prefix)] == list(prefix) for call in rec.calls)
+
+
+# --------------------------------------------------------------------------- #
+# Construction
+# --------------------------------------------------------------------------- #
+def test_concurrency_must_be_positive() -> None:
+ """A non-positive concurrency is rejected up front."""
+ with pytest.raises(ValueError):
+ DockerSandboxProvider(concurrency=0)
+
+
+def test_concurrency_bounds_the_semaphore() -> None:
+ """The provider's shared semaphore is sized to the configured concurrency."""
+ assert DockerSandboxProvider(concurrency=4)._semaphore._value == 4
+
+
+# --------------------------------------------------------------------------- #
+# create()
+# --------------------------------------------------------------------------- #
+def test_create_returns_handle_with_last_line_id(monkeypatch: pytest.MonkeyPatch) -> None:
+ """create() uses the LAST stdout line as the container id and pre-assigns a unique name."""
+ provider, rec = _make_provider(
+ monkeypatch, lambda args: (0, "WARNING: noise\ncontainer-abc\n", ""), network="host"
+ )
+ handle = asyncio.run(provider.create(SandboxSpec(image="img:tag", workdir="/testbed", env={"A": "1"})))
+ assert handle.sandbox_id == "container-abc"
+ run_args = rec.calls[0]["args"]
+ assert run_args[:3] == ["run", "-d", "--init"]
+ assert "--name" in run_args and run_args[run_args.index("--name") + 1].startswith("nemo-gym-")
+ assert ["--network", "host"] == run_args[run_args.index("--network") : run_args.index("--network") + 2]
+ assert "img:tag" in run_args
+
+
+def test_create_requires_image() -> None:
+ """A spec without an image is rejected before any docker call."""
+ with pytest.raises(SandboxCreateError):
+ asyncio.run(DockerSandboxProvider().create(SandboxSpec(image=None)))
+
+
+def test_create_empty_stdout_guard_and_reap(monkeypatch: pytest.MonkeyPatch) -> None:
+ """rc 0 with empty stdout raises (no IndexError) and reaps the pre-assigned name."""
+ provider, rec = _make_provider(monkeypatch, lambda args: (0, " \n", "") if args[0] == "run" else (0, "", ""))
+ with pytest.raises(SandboxCreateError, match="did not return a container id"):
+ asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+ assert _ran(rec, "rm", "-f")
+
+
+def test_create_nonzero_rc_reaps(monkeypatch: pytest.MonkeyPatch) -> None:
+ """A non-zero ``docker run`` reaps the orphan and raises with the stderr."""
+ provider, rec = _make_provider(monkeypatch, lambda args: (125, "", "boom") if args[0] == "run" else (0, "", ""))
+ with pytest.raises(SandboxCreateError, match="boom"):
+ asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+ assert _ran(rec, "rm", "-f")
+
+
+def test_create_timeout_reaps(monkeypatch: pytest.MonkeyPatch) -> None:
+ """A timed-out ``docker run`` reaps the (possibly daemon-started) orphan by name."""
+
+ def responder(args: list[str]) -> tuple[int, str, str]:
+ if args[0] == "run":
+ raise asyncio.TimeoutError
+ return (0, "", "")
+
+ provider, rec = _make_provider(monkeypatch, responder)
+ with pytest.raises(SandboxCreateError, match="timed out"):
+ asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+ assert _ran(rec, "rm", "-f")
+
+
+def test_create_applies_resource_limits(monkeypatch: pytest.MonkeyPatch) -> None:
+ """Resource requests become ``--memory``/``--cpus``/``--gpus`` run args."""
+ provider, rec = _make_provider(monkeypatch, lambda args: (0, "cid\n", ""))
+ spec = SandboxSpec(image="img:tag", resources=SandboxResources(cpu=2, memory_mib=512, gpu=1))
+ asyncio.run(provider.create(spec))
+ run_args = rec.calls[0]["args"]
+ assert "--memory=512m" in run_args
+ assert "--cpus=2" in run_args
+ assert "--gpus=all" in run_args
+
+
+# --------------------------------------------------------------------------- #
+# exec()
+# --------------------------------------------------------------------------- #
+def test_exec_classifies_docker_level_failure(monkeypatch: pytest.MonkeyPatch) -> None:
+ """rc 125/126/127 with no stdout is a docker-level (``sandbox``) failure."""
+ provider, _ = _make_provider(monkeypatch, lambda args: (125, "", "no such container"))
+ res = asyncio.run(provider.exec(_handle(), "echo hi"))
+ assert res.return_code == 125
+ assert res.error_type == "sandbox"
+
+
+def test_exec_success_has_no_error_type(monkeypatch: pytest.MonkeyPatch) -> None:
+ """A successful exec carries stdout and no error type."""
+ provider, _ = _make_provider(monkeypatch, lambda args: (0, "ok", ""))
+ res = asyncio.run(provider.exec(_handle(), "true"))
+ assert res.return_code == 0 and res.stdout == "ok" and res.error_type is None
+
+
+def test_exec_timeout_returns_124(monkeypatch: pytest.MonkeyPatch) -> None:
+ """A timed-out exec returns rc 124 + ``timeout`` error type rather than raising."""
+
+ def responder(args: list[str]) -> tuple[int, str, str]:
+ raise asyncio.TimeoutError
+
+ provider, _ = _make_provider(monkeypatch, responder)
+ res = asyncio.run(provider.exec(_handle(), "sleep 1", timeout_s=0.01))
+ assert res.return_code == 124 and res.error_type == "timeout"
+
+
+# --------------------------------------------------------------------------- #
+# status / close / file transfer
+# --------------------------------------------------------------------------- #
+def test_status_running_and_stopped(monkeypatch: pytest.MonkeyPatch) -> None:
+ """status() maps docker inspect output to RUNNING/STOPPED/UNKNOWN."""
+ provider, _ = _make_provider(monkeypatch, lambda args: (0, "true\n", ""))
+ assert asyncio.run(provider.status(_handle())) is SandboxStatus.RUNNING
+ provider2, _ = _make_provider(monkeypatch, lambda args: (0, "false\n", ""))
+ assert asyncio.run(provider2.status(_handle())) is SandboxStatus.STOPPED
+ provider3, _ = _make_provider(monkeypatch, lambda args: (1, "", "gone"))
+ assert asyncio.run(provider3.status(_handle())) is SandboxStatus.UNKNOWN
+
+
+def test_close_force_removes(monkeypatch: pytest.MonkeyPatch) -> None:
+ """close() force-removes the container by id."""
+ provider, rec = _make_provider(monkeypatch, lambda args: (0, "", ""))
+ asyncio.run(provider.close(_handle()))
+ assert _ran(rec, "rm", "-f", "cid")
+
+
+def test_upload_failure_raises(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+ """A failed ``docker cp`` upload raises a clear RuntimeError."""
+ provider, _ = _make_provider(monkeypatch, lambda args: (0, "", "") if args[0] == "exec" else (1, "", "nope"))
+ src = tmp_path / "f.txt"
+ src.write_text("x")
+ with pytest.raises(RuntimeError, match="upload failed"):
+ asyncio.run(provider.upload_file(_handle(), src, "/dst/f.txt"))
+
+
+def test_reap_orphan_swallows_errors(monkeypatch: pytest.MonkeyPatch) -> None:
+ """_reap_orphan never raises, even when the ``docker rm`` itself fails/raises."""
+
+ def responder(args: list[str]) -> tuple[int, str, str]:
+ raise RuntimeError("docker daemon down")
+
+ provider, _ = _make_provider(monkeypatch, responder)
+ asyncio.run(provider._reap_orphan("nemo-gym-x")) # must not raise
+
+
+def _handle():
+ """A minimal docker SandboxHandle for exec/status/close tests."""
+ from nemo_gym.sandbox.providers.base import SandboxHandle
+
+ return SandboxHandle(sandbox_id="cid", provider_name="docker", raw={"workdir": "/testbed"})