diff --git a/fern/versions/latest/pages/about/concepts/key-terminology.mdx b/fern/versions/latest/pages/about/concepts/key-terminology.mdx
index 0a90334638..c69943494d 100644
--- a/fern/versions/latest/pages/about/concepts/key-terminology.mdx
+++ b/fern/versions/latest/pages/about/concepts/key-terminology.mdx
@@ -63,7 +63,11 @@ The FastAPI service (under `resources_servers/`) that holds per-task state, expo
 
 **Agent Server (Responses API Agent)**
 
-The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the harness that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own.
+The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the **agent harness** that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own.
+
+**Harness (disambiguation)**
+
+“Harness” is overloaded in the agent-eval community. In Gym docs it usually means **agent harness** (orchestration in an agent server). In [SWE-bench](https://www.swebench.com/SWE-bench/reference/harness/), it often means the **grading pipeline** (`swebench.harness`). See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology).
 
 **Model Server (Responses API Model)**
 
diff --git a/fern/versions/latest/pages/evaluation/index.mdx b/fern/versions/latest/pages/evaluation/index.mdx
index 850ddd0cff..412c359025 100644
--- a/fern/versions/latest/pages/evaluation/index.mdx
+++ b/fern/versions/latest/pages/evaluation/index.mdx
@@ -40,6 +40,10 @@ Harness changes often affect metrics, and some models are better tuned to use sp
 
 NeMo Gym treats an agent as model plus harness. The model server stays stateless; the agent server owns the loop that calls the model, routes tool calls, manages conversation state, and asks the resources server to verify the final attempt.
 
+<Note>
+Outside Gym, “harness” often means the SWE-bench **grading pipeline** (`swebench.harness`) rather than agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology).
+</Note>
+
 <Cards>
 
 <Card title="Agent Server" href="/agent-server">
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx
new file mode 100644
index 0000000000..c99929d2f6
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx
@@ -0,0 +1,250 @@
+---
+title: "Claude Code Agent — Protocol Stack & Data Contracts"
+description: "How claude_code_agent, model servers, PR #1627 /v1/messages, and NeMoGymResponse fit together."
+---
+This engineering note summarizes the protocol stack behind the `claude_code_agent` harness: which Gym entities participate in a rollout, which data contracts apply at each hop, what [PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added, and where RL metadata lives.
+
+## The four Gym server types
+
+An environment decomposes into four concepts. Each maps to a FastAPI server type:
+
+| Concept | Component | Key endpoints |
+| --- | --- | --- |
+| Dataset | JSONL rows | `responses_create_params` per task |
+| Agent harness | `responses_api_agents/` | `POST /run`, `POST /v1/responses` |
+| Verifier + state | `resources_servers/` | `POST /seed_session`, `POST /verify` |
+| Model | `responses_api_models/` | `POST /v1/responses`, `/v1/chat/completions`, `/v1/messages` |
+
+The **Claude Code CLI** (`claude -p`) is *not* a Gym server. It is a black-box subprocess spawned by `claude_code_agent` that only speaks the **Anthropic Messages API**.
+
+## Gym's canonical data contracts
+
+Gym standardizes on the **OpenAI Responses API shape** (with NeMo-specific extensions). Two types form the request/response pair:
+
+| Type | Role |
+| --- | --- |
+| `NeMoGymResponseCreateParamsNonStreaming` | **Request** — input messages, tools, sampling params (from dataset JSONL) |
+| `NeMoGymResponse` | **Response** — accumulated trajectory in `output[]`, plus `usage` |
+
+`NeMoGymResponse` is **not** owned exclusively by models or agents. It is Gym's **shared trajectory contract**:
+
+- **Model servers** produce it on `POST /v1/responses`
+- **Agents** produce it on `POST /v1/responses`
+- **Resources servers** consume it in `POST /verify` (`BaseVerifyRequest.response`)
+- **Rollout harness** reads it from agent `/run` results
+
+The trajectory building block is **`NeMoGymResponse.output[]`** — a sequence of messages, tool calls, tool results, and reasoning items.
+
+### Training variants (`*ForTraining`)
+
+RL metadata lives on **individual output items**, not on the top-level response envelope:
+
+```python
+class TokenIDLogProbMixin(BaseModel):
+    prompt_token_ids: List[int]
+    generation_token_ids: List[int]
+    generation_log_probs: List[float]
+```
+
+Training subclasses (`NeMoGymResponseOutputMessageForTraining`, etc.) mix this in. A rollout JSONL row with RL data looks like:
+
+```json
+{
+  "output": [
+    {
+      "type": "message",
+      "role": "assistant",
+      "content": [...],
+      "prompt_token_ids": [1, 2, 3],
+      "generation_token_ids": [4, 5, 6],
+      "generation_log_probs": [-0.1, -0.2, -0.3]
+    }
+  ]
+}
+```
+
+## Alternate wire formats (not the Gym trajectory contract)
+
+Model servers expose **three** HTTP endpoints. Only one returns `NeMoGymResponse` on the wire:
+
+| Endpoint | Wire format | Gym trajectory? |
+| --- | --- | --- |
+| `POST /v1/responses` | `NeMoGymResponse` JSON | **Yes** |
+| `POST /v1/chat/completions` | `NeMoGymChatCompletion` JSON | No — one chat turn; converted internally or by caller |
+| `POST /v1/messages` | Anthropic Message JSON or SSE | No — foreign protocol adapter ([PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627)) |
+
+`NeMoGymChatCompletion` is a **backend/wire format** (one assistant turn in `choices[0].message`). `vllm_model` uses it internally: `responses()` converts to chat params, calls `chat_completions()`, then converts back to `NeMoGymResponse.output[]`.
+
+Agents like `simple_agent` call model `POST /v1/responses` directly and never see chat completion. `harbor_agent` calls chat completions directly and converts its trajectory to `NeMoGymResponse` output items at the end.
+
+## What PR #1627 added (and did not add)
+
+[PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added a **third spoke** on model servers — not changes to the agent:
+
+**Before:** `SimpleResponsesAPIModel` exposed `/v1/chat/completions` and `/v1/responses` only.
+
+**After:** Every model server also exposes `POST /v1/messages` with a default handler that:
+
+1. Converts Anthropic request → `NeMoGymResponseCreateParams`
+2. Delegates to the server's own `responses()` → internal `NeMoGymResponse`
+3. Converts `NeMoGymResponse` → Anthropic response (JSON or synthesized SSE)
+
+**Not in PR #1627:**
+
+- `claude_code_agent` itself (from #1336) — already had `model_server` ref and `_resolve_base_url()`
+- Default `reasoning_gym_claude_code_agent.yaml` — still points at real Anthropic API
+- RL side-channel plumbing — converter explicitly drops token IDs before Anthropic conversion
+
+## End-to-end rollout flow (linear)
+
+This is the full stack when using `reasoning_gym_claude_code_agent_model_server.yaml` + a model server (e.g. `vllm_model`):
+
+```mermaid
+flowchart TD
+    RC["ng_collect_rollouts"]
+    RUN["agent POST /run"]
+    SEED["resources_server POST /seed_session"]
+    RESP["agent POST /v1/responses"]
+    CLI["claude -p subprocess"]
+    MSG["model_server POST /v1/messages"]
+    MSRESP["model_server responses()"]
+    BACKEND["inference backend POST /v1/chat/completions"]
+    NOTE["↻ repeat for each Claude LLM turn"]
+    PARSE["agent builds NeMoGymResponse from stream-json"]
+    VERIFY["resources_server POST /verify"]
+    OUT["rollout JSONL"]
+
+    RC -->|"NeMoGymResponseCreateParams"| RUN
+    RUN --> SEED --> RESP
+    RESP -->|"shell + ANTHROPIC_BASE_URL"| CLI
+    CLI -->|"Anthropic Messages SSE"| MSG
+    MSG --> MSRESP --> BACKEND
+    BACKEND --> MSG --> CLI
+    CLI --> NOTE
+    NOTE --> PARSE
+    PARSE -->|"NeMoGymResponse"| VERIFY
+    VERIFY --> OUT
+```
+
+### Message types at each hop
+
+| Step | From → To | Format |
+| --- | --- | --- |
+| 1 | Rollout → agent `/run` | Task row with `responses_create_params` |
+| 2 | Agent `/run` → resources `/seed_session` | Same task row |
+| 3 | Agent `/run` → agent `/v1/responses` | `NeMoGymResponseCreateParamsNonStreaming` |
+| 4 | Agent → Claude subprocess | Shell env (`ANTHROPIC_BASE_URL`, etc.) |
+| 5 | Claude ↔ model `/v1/messages` | **Anthropic Messages** (many turns) |
+| 6 | Inside model server | Anthropic → `NeMoGymResponse` → Anthropic (internal) |
+| 7 | Model server ↔ vLLM | OpenAI Chat Completions (internal) |
+| 8 | Claude → agent | **stream-json stdout** (full session) |
+| 9 | Agent `/v1/responses` return | **`NeMoGymResponse`** (episode-level, for scoring) |
+| 10 | Agent → resources `/verify` | Task row + `NeMoGymResponse` |
+
+## Two NeMoGymResponse lifetimes (model-server path)
+
+This is a common source of confusion. On the model-server path there are **two separate** `NeMoGymResponse` objects:
+
+### Per-turn (internal, inside model server)
+
+Each Claude LLM call triggers:
+
+```
+Anthropic request → NeMoGymResponseCreateParams → responses() → NeMoGymResponse
+  → stripped → Anthropic response back to Claude
+```
+
+This object can carry RL fields when `return_token_id_information=True`, but Claude never sees them and Gym rollouts do not receive them today.
+
+### Episode-level (what Gym scoring uses)
+
+After Claude finishes the full session, the agent parses stream-json and **constructs one** `NeMoGymResponse` in `claude_code_agent.responses()`. That is what `/verify` reads.
+
+Today this episode-level response uses plain `NeMoGymResponseOutputMessage` items — **no RL fields**, even if the model server produced them per turn.
+
+## Direct Anthropic path (shorter)
+
+With the default `reasoning_gym_claude_code_agent.yaml` (`anthropic_base_url: null`), steps involving the Gym model server drop out:
+
+```mermaid
+flowchart TD
+    RC["ng_collect_rollouts"]
+    RUN["agent POST /run"]
+    SEED["resources_server POST /seed_session"]
+    RESP["agent POST /v1/responses"]
+    CLI["claude -p subprocess"]
+    ANTH["api.anthropic.com POST /v1/messages"]
+    VERIFY["resources_server POST /verify"]
+    OUT["rollout JSONL"]
+
+    RC --> RUN --> SEED --> RESP --> CLI
+    CLI <-->|"Anthropic Messages"| ANTH
+    CLI -->|"stream-json"| RESP
+    RESP -->|"NeMoGymResponse"| VERIFY --> OUT
+```
+
+PR #1627 is **invisible** on this path.
+
+## Why Anthropic format for Claude?
+
+Claude Code CLI is hard-wired to `POST /v1/messages`. It cannot call Gym's `/v1/responses` or OpenAI chat completions. When you point Claude at a Gym model server, the server must **speak Anthropic on the wire** even though it implements `responses()` internally.
+
+Think of `/v1/messages` as a **protocol adapter**:
+
+```
+Claude (USB-C / Anthropic)  ↔  Gym model server adapter  ↔  vLLM (HDMI / Chat Completions)
+```
+
+Gym's rollout pipeline only cares about the final **`NeMoGymResponse`** the agent builds from stream-json — not the per-turn Anthropic exchanges.
+
+## RL metadata: where it exists and where it is lost
+
+| Location | RL fields present? |
+| --- | --- |
+| `vllm_model` internal `NeMoGymResponse` (`return_token_id_information=True`) | Yes — on `*ForTraining` output items |
+| Model server `POST /v1/responses` wire response | Yes (when configured) |
+| Model server `POST /v1/messages` wire response | **No** — stripped in `responses_to_anthropic_response()` |
+| `claude_code_agent` episode `NeMoGymResponse` | **No** — `parse_stream_json()` builds plain messages |
+| Resources server `/verify` | Reads text from `NeMoGymResponse.output[]`; RL fields unused for scoring |
+
+The planned RL path (not yet wired for Claude Code) would **side-channel** per-turn token IDs from the model server's internal `NeMoGymResponse` and merge them into the agent's episode-level `NeMoGymResponse` as existing `*ForTraining` types — not invent a new schema.
+
+## Protocol layers on a model server
+
+```mermaid
+flowchart TD
+    subgraph gym["Gym trajectory layer"]
+        NR["NeMoGymResponse\n(building block for rollouts / verify)"]
+    end
+
+    subgraph convert["Conversion (inside model server or agent)"]
+        C["AnthropicConverter / VLLMConverter"]
+    end
+
+    subgraph wire["Backend wire formats"]
+        CC["NeMoGymChatCompletion\n/v1/chat/completions"]
+        AM["Anthropic Message\n/v1/messages"]
+        OR["OpenAI native Response\nopenai_model upstream"]
+    end
+
+    NR --> C
+    C --> CC
+    C --> AM
+    C --> OR
+```
+
+## Config cheat sheet
+
+| Config | Model path | PR #1627 involved? |
+| --- | --- | --- |
+| `claude_code_agent/configs/claude_code_agent.yaml` (template) | Direct Anthropic | No |
+| `reasoning_gym_claude_code_agent.yaml` | Direct Anthropic via env vars | No |
+| `reasoning_gym_claude_code_agent_model_server.yaml` + `vllm_model.yaml` | Claude → Gym model `/v1/messages` → vLLM | **Yes** |
+
+## Key takeaways
+
+1. **`NeMoGymResponse.output[]` is the trajectory building block** — shared across models, agents, verifiers, and rollouts.
+2. **`POST /v1/responses` is the Gym contract boundary** — not `/v1/messages` or `/v1/chat/completions`.
+3. **PR #1627 adds an Anthropic adapter on model servers** so Claude Code can target any Gym backend without a separate proxy process.
+4. **The agent wraps Claude as a black box** — Gym HTTP stops at `/v1/responses`; Claude's multi-turn loop uses Anthropic internally.
+5. **RL metadata is schema-ready** (`*ForTraining` types) but **not yet plumbed** through the Claude Code + `/v1/messages` path.
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx
new file mode 100644
index 0000000000..2c43260968
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx
@@ -0,0 +1,103 @@
+---
+title: "Harness Terminology"
+description: "How the agent-eval and SWE-bench communities use “harness,” and how NeMo Gym disambiguates agent orchestration from benchmark grading."
+---
+**“Harness” is overloaded.** In agentic coding and SWE-bench discussions, the same word often means either (a) the agent-side orchestration that runs a model on tasks, or (b) the benchmark-side pipeline that grades patches. There is no single community-wide definition — context disambiguates. This note maps the dominant usages and the vocabulary NeMo Gym uses to keep them separate.
+
+## Three meanings in the wild
+
+| Context | “Harness” usually means | Agent included? | Example |
+| --- | --- | --- | --- |
+| **SWE-bench docs / `swebench.harness`** | Grading pipeline: Docker, apply patch, run tests, report resolved | No | `python -m swebench.harness.run_evaluation` |
+| **SWE-bench leaderboard / blogs** | Agent orchestration: prompt, tool loop, patch extraction | Yes | “mini-SWE-agent + Claude 4.5 Opus” |
+| **NeMo Gym docs** | Agent server orchestration around the model | Yes | `claude_code_agent`, OpenHands, `simple_agent` |
+| **Generic ML eval** | Benchmark runner infrastructure | Maybe | “evaluation harnesses” in NeMo Evaluator |
+
+The collision is sharpest in SWE-bench: **official docs** call the grader “the harness,” while **leaderboard rows** read like “harness + model” where harness is the agent stack.
+
+## SWE-bench: two halves of one eval
+
+The [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) documents **`swebench.harness`** — the module that:
+
+1. Prepares per-instance Docker images
+2. Applies a `model_patch` from a predictions file
+3. Runs the repository test suite
+4. Grades pass/fail and aggregates `% resolved`
+
+The submission contract is intentionally narrow: produce JSONL with `instance_id` and `model_patch`; the harness scores it. Community tutorials often describe this as a unit-test-shaped split:
+
+- **Arrange** — harness prepares the instance environment (images, checkout, deps)
+- **Act** — *your* agent edits the repo and emits a patch (not part of `swebench.harness`)
+- **Assert** — harness applies the patch, runs hidden tests, reports resolved/unresolved
+
+So in SWE-bench **technical** vocabulary, harness ≈ **Environment + grading authority**. The agent is external.
+
+What upstream bundles (without always naming it cleanly) is **task world + grading** inside one pipeline. What it does **not** model is **how** the agent produced the patch — but leaderboard prose often treats “harness + model” as the evaluated product anyway.
+
+## Leaderboard and product language
+
+The [SWE-bench leaderboard](https://www.swebench.com/) reports results as combinations like:
+
+- **“bash only”** — a specific agent-side setup across models
+- **“mini-SWE-agent + Claude 4.5 Opus”** — harness (orchestration) + model
+
+That usage matches NeMo Gym’s [SWE RL case study](/infrastructure/engineering-notes/swe-rl-case-study): *a harness is a system prompt plus orchestration to execute one attempt at the task.* Here harness ≈ **agent harness**, not `run_evaluation`.
+
+When reading papers, blog posts, or vendor announcements, assume **harness = agent-side** unless the text explicitly points at Docker grading or `swebench.harness`.
+
+## NeMo Gym vocabulary (intentional split)
+
+Gym separates roles that colloquial “harness” often merges:
+
+| Gym term | Role | SWE-bench analogue |
+| --- | --- | --- |
+| **Task** | One dataset row / instance (`SweTask`, `TaskPublic`) | `django__django-13741` |
+| **Benchmark** | Fixed eval product: split, metric, protocol, baselines | SWE-bench Verified |
+| **Environment** | Resources server: `seed_session`, `verify`, state, tools | `swe_bench` RS |
+| **Agent harness** | Agent server: multi-step loop, tools, when to stop | Claude Code, OpenHands, mini-SWE-agent |
+| **Model** | Stateless inference | vLLM / OpenAI endpoint |
+
+See [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks) and the [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) note for how `SessionDescriptor`, topology C, and hermetic `verify` fit in.
+
+### The awkward `harnesses/` directory
+
+Under [`resources_servers/swe_bench/harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses), **`harness` means something else again**: benchmark-**family eval plugins** owned by the Environment — provision recipes and grading adapters keyed by dataset family (e.g. SWE-bench vs multilingual). They wrap pieces of upstream `swebench.harness`; they are **not** agent orchestration.
+
+| Name in repo | Meaning | Prefer saying |
+| --- | --- | --- |
+| Agent server / `responses_api_agents/` | Agent harness | **agent harness** or **agent server** |
+| `swebench.harness` (upstream) | Official grading pipeline | **SWE-bench eval harness** or **grading pipeline** |
+| `swe_bench/harnesses/` | Environment eval plugins | **benchmark-family plugin** or **eval plugin** (when disambiguation matters) |
+| `swe_bench` RS overall | MDP authority for SWE tasks | **Environment** or **resources server** |
+
+We keep `harness.py` / `harnesses/` in `swe_bench` to align with upstream module naming (`swebench.harness`) and prior `swe_env` convention — not because Gym equates “harness” with agent orchestration.
+
+## Practical guidance
+
+**When writing docs or PRs**
+
+- Say **agent harness** (or **agent server**) for orchestration in `responses_api_agents/`.
+- Say **SWE-bench eval harness** or **grading pipeline** for `swebench.harness.run_evaluation`.
+- Say **Environment** / **`swe_bench` resources server** for `seed_session` + `verify` + hermetic grading.
+- Say **benchmark** for published eval products (Verified, Lite), not for the server binary.
+- Say **task** / **instance** for one problem row — not “environment” and not “harness.”
+
+**When comparing to leaderboard numbers**
+
+- Identify both **model** and **agent harness** (leaderboard row).
+- Confirm **benchmark split** and **grading protocol** (Environment config, `verified:` marker).
+- Do not conflate upstream `swebench.harness` with the agent named on the leaderboard.
+
+**When designing new environments**
+
+- Keep **grading authority** on the resources server (Environment).
+- Keep **orchestration** on the agent server (agent harness).
+- Use **benchmark-family plugins** only for dataset-specific provision/grade logic — not for agent loops.
+
+## Related reading
+
+- [Evaluation — agent harness](/evaluation#agent-harness) — Gym’s primary “harness” definition
+- [Key Terminology — Agent Server](/about/concepts/key-terminology#architecture-terms) — architecture glossary
+- [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) — Task / Benchmark / Environment split for SWE
+- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — harness + model on the leaderboard
+- [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) — upstream grading pipeline
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
index ce48f79b15..eb92e8bf31 100644
--- a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx
@@ -25,6 +25,24 @@ Infrastructure challenges and deployment topology for SWE RL training.
 <Badge minimal outlined>swe-rl</Badge> <Badge minimal outlined>case-study</Badge>
 </Card>
 
+<Card title="Harness Terminology" href="/infrastructure/engineering-notes/harness-terminology">
+How “harness” is used in SWE-bench vs agent eval vs NeMo Gym — and recommended naming.
+
+<Badge minimal outlined>terminology</Badge> <Badge minimal outlined>swe-bench</Badge>
+</Card>
+
+<Card title="SWE-bench Environment Server" href="/infrastructure/engineering-notes/swe-bench-environment-server">
+Environment resources server for SWE-bench: session descriptors, topology C, and hermetic verify.
+
+<Badge minimal outlined>swe-bench</Badge> <Badge minimal outlined>architecture</Badge>
+</Card>
+
+<Card title="Claude Code Agent — Protocol Stack" href="/infrastructure/engineering-notes/claude-code-agent-protocol-stack">
+Responses API, `/v1/messages`, and rollout data contracts for `claude_code_agent`.
+
+<Badge minimal outlined>claude-code</Badge> <Badge minimal outlined>api-design</Badge>
+</Card>
+
 <Card title="aiohttp vs httpx" href="/infrastructure/engineering-notes/aiohttp-vs-httpx">
 Why NeMo Gym uses aiohttp instead of httpx for async HTTP.
 
diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx
new file mode 100644
index 0000000000..b3a97cc340
--- /dev/null
+++ b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx
@@ -0,0 +1,267 @@
+---
+title: "SWE-bench Environment Server"
+description: "Restoring the Environment as a resources server — seed_session descriptors, topology C, and hermetic verify for SWE-bench."
+---
+This engineering note documents the **`swe_bench` resources server**: what problem it solves, how it differs from earlier SWE integrations in Gym, and how to run evaluation with a black-box agent server such as `claude_code_agent`.
+
+<Note>
+This server ships with `verified: false` — it is a working prototype, not yet baselined on gold patches. See [Adding a Benchmark](/contribute/environments/adding-a-benchmark) for the path to `verified: true`.
+</Note>
+
+## Background: why a separate Environment server?
+
+Earlier SWE convergence work ([PR #1738](https://github.com/NVIDIA-NeMo/Gym/pull/1738)) moved grading and sandbox spec **into the agent server** (`responses_api_agents/swe_env/`, inline `verify_task`). That pattern works for a single bundled agent, but it breaks composability:
+
+- **Black-box agent servers** (Claude Code, OpenHands, Harbor, …) should not import SWE grading code or choose docker vs OpenSandbox themselves.
+- **The Environment** should own task authority: sandbox spec, benchmark grading, and the `verified:` marker on the resources server.
+- **Agents** should connect through a small HTTP contract (`seed_session` → run → `verify`), not an `anyswe`-style wrapper per agent.
+
+The `swe_bench` resources server restores that boundary. Grading harnesses, parsing, and `verify_task` live as **private modules** under [`resources_servers/swe_bench/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench) — not under `responses_api_agents/`.
+
+For cluster-scale SWE RL training topology (Apptainer, CPU sizing), see the older [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study). This note focuses on the **Environment server + agent-server wiring** pattern.
+
+## Three roles (orthogonal)
+
+| Role | Gym component | SWE-bench example |
+| --- | --- | --- |
+| **Environment** | Resources server | `swe_bench` — `seed_session`, `verify`, benchmark harnesses |
+| **Agent server** | `responses_api_agents/` | `claude_code_agent` — runs Claude in the instance sandbox |
+| **Sandbox runtime** | `nemo_gym/sandbox/` | Docker provider (OpenSandbox / Apptainer as needed) |
+
+<Warning>
+**“Harness” overload.** In Gym docs, *agent harness* means orchestration inside an agent server. In SWE-bench, `swebench.harness` is the upstream eval stack. Under `swe_bench`, **`harness.py` / `harnesses/`** are **benchmark-family plugins** (provision + grade recipes keyed by `task.benchmark`). They are Environment-owned, not agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology) for the full map.
+</Warning>
+
+## Environment vs Benchmark vs Task
+
+These names refer to different layers (see also [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks)):
+
+| Layer | SWE-bench example | What it is |
+| --- | --- | --- |
+| **Benchmark** | *SWE-bench Verified* | Fixed eval product: 500-task test split, `% resolved` metric, comparison protocol, leaderboard baselines |
+| **Environment** | `swe_bench` resources server | Executable engine: `seed_session`, `verify`, harness registry, hermetic grading |
+| **Task** | `django__django-13741` | One problem instance in the benchmark (prompt + privileged grading metadata) |
+
+- **`swe_bench` is the Environment** — you can train, dev, or eval with it; it is not the same as the published benchmark.
+- **SWE-bench Verified is a Benchmark** built on that Environment (frozen JSONL + eval config + reporting).
+- **`verified: true`** on the RS means this Environment configuration is **benchmark-grade** (gold-patch baseline, protocol locked) — not merely that the server exists.
+
+One Environment supports multiple benchmarks (Verified, Lite, Multilingual) by swapping **tasks** (dataset) and harness keys — no new resources server per publication.
+
+## What `swe_bench` exposes
+
+The HTTP surface is intentionally thin ([`app.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/app.py)). Heavy logic stays in private modules.
+
+| Endpoint | Responsibility |
+| --- | --- |
+| `POST /seed_session` | Build a **`SessionDescriptor`**: placement topology, per-instance `SandboxSpec`, merged `verifier_metadata` |
+| `POST /verify` | Grade `verifier_metadata.model_patch` in a **fresh** eval sandbox (hermetic twin) |
+
+### SessionDescriptor (response shape)
+
+`seed_session` returns:
+
+```json
+{
+  "placement": { "topology": "agent_in_env" },
+  "sandbox": { "spec": { "image": "swebench/sweb.eval.x86_64....", "workdir": "/testbed", ... } },
+  "egress": { "env": {} },
+  "verifier_metadata": {
+    "instance_id": "django__django-13741",
+    "benchmark": "swe-bench",
+    "dataset_name": "princeton-nlp/SWE-bench_Verified",
+    "flat_eval": true
+  }
+}
+```
+
+The agent server reads **`placement.topology`** and **`sandbox.spec`** — it never imports `swe_bench.harness` or picks a provider on its own (beyond what its config already declares).
+
+### Topology C (`agent_in_env`)
+
+| Topology | Who owns the working sandbox | Typical agent |
+| --- | --- | --- |
+| `none` | No in-box work; MCP / host-side tools | Default Claude Code + MCP resources |
+| `agent_in_env` | Agent starts the descriptor's sandbox and runs inside it | **`claude_code_swe_bench`** |
+| `env_sandboxed` | Environment brokers box lifecycle (future broker RS) | Planned |
+| `whole_interaction` | Single box for agent + eval (legacy) | `swe_agents` style |
+
+**Topology C** is the target for SWE-bench Verified with Claude Code:
+
+1. Environment returns image + workdir from the benchmark harness.
+2. Agent server starts that sandbox, runs `claude -p` **inside** the instance image.
+3. Agent harvests `git diff --cached` as `model_patch`.
+4. Environment grades the patch in a **separate fresh container** (no agent pollution).
+
+```mermaid
+flowchart TD
+    RC["gym eval run"]
+    RUN["agent POST /run"]
+    SEED["swe_bench POST /seed_session"]
+    BOX["Agent: AsyncSandbox from descriptor"]
+    CLAUDE["claude -p in /testbed"]
+    PATCH["git diff --cached → model_patch"]
+    VERIFY["swe_bench POST /verify"]
+    FRESH["verify_task: fresh sandbox"]
+    OUT["rollout JSONL reward 0/1"]
+
+    RC --> RUN --> SEED
+    SEED -->|"topology=agent_in_env, sandbox.spec"| BOX
+    BOX --> CLAUDE --> PATCH
+    PATCH --> VERIFY
+    VERIFY --> FRESH --> OUT
+```
+
+## Benchmark harness layer (private)
+
+Each SWE dataset family registers a harness under [`harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses):
+
+| Registry key | Class | Notes |
+| --- | --- | --- |
+| `swe-bench` | `SweBenchHarness("swe-bench")` | Uses upstream `swebench` `make_test_spec` + `get_logs_eval` |
+| `swe-bench-multilingual` | `SweBenchHarness("swe-bench-multilingual")` | Same class, different family name |
+| `swe-bench-ext` | `SweBenchExtHarness` | Extended / fuzzy parsers |
+| `swe-rebench` | `SweRebenchHarness` | SWE-rebench family |
+| `r2e-gym` | `R2EGymHarness` | R2E-Gym |
+| `nv-internal-1` | `NVInternalHarness` | Internal NV format |
+
+The harness contract ([`harness.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harness.py)) splits provisioning from grading:
+
+- **Agent-visible:** `build_spec`, `supports_provider`, `materialize`
+- **Verifier-only:** `reset_repo`, `run_eval`, `grade` (called only from `verify_task`)
+
+For official SWE-bench instances, grading delegates to the external [`swebench`](https://github.com/SWE-bench/SWE-bench) package — Gym runs the official per-instance `eval_script` in the sandbox and parses logs with `swebench.harness.grading.get_logs_eval`.
+
+## Dataset format
+
+Each JSONL row needs SWE instance metadata in **`verifier_metadata`** (and typically mirrored in `responses_create_params.metadata`):
+
+| Field | Purpose |
+| --- | --- |
+| `instance_id` | SWE-bench instance key (e.g. `django__django-13741`) |
+| `dataset_name` | HuggingFace dataset id (selects harness family) |
+| `split` | Usually `test` |
+| `problem_statement` | User message / issue text for the agent |
+| `instance_dict` | Full SWE-bench instance record (JSON string or object) — required for faithful grading |
+
+Optional per-row `container_formatter` overrides the server default image template.
+
+### Prepare SWE-bench Verified rows
+
+```bash
+python resources_servers/swe_bench/prepare.py --limit 5 --no-images
+```
+
+This writes `resources_servers/swe_bench/data/swebench_verified.jsonl`. Use `--no-images` for dataset-only smoke tests; full eval needs Docker images `swebench/sweb.eval.x86_64.{tag}` (see `prepare.py` for tag normalization: `__` → `_1776_`, lowercased).
+
+## Configuration
+
+Server config: [`resources_servers/swe_bench/configs/swe_bench.yaml`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/configs/swe_bench.yaml)
+
+```yaml
+swe_bench:
+  resources_servers:
+    swe_bench:
+      sandbox_provider:
+        docker: {}
+      container_formatter: swebench/sweb.eval.x86_64.{instance_id}
+      eval_timeout_s: 1800
+      flat_eval: true
+      default_topology: agent_in_env
+
+claude_code_swe_bench:
+  responses_api_agents:
+    claude_code_agent:
+      resources_server:
+        type: resources_servers
+        name: swe_bench
+      sandbox_provider:
+        docker: {}
+      in_box_timeout_s: 1800
+      bare: true
+```
+
+Key knobs:
+
+| Config field | Effect |
+| --- | --- |
+| `sandbox_provider` | Passed to `verify_task` and agent in-box binding |
+| `container_formatter` | Docker image template for instance sandboxes |
+| `flat_eval` | Host-side grading (runs on any exec-capable provider) |
+| `default_topology` | Returned from `seed_session` (`agent_in_env` for topology C) |
+| `in_box_timeout_s` | Agent-side Claude run timeout inside the sandbox |
+
+## Quickstart: evaluation rollouts
+
+**1. Install and test the server** (unit tests use a fake sandbox — no Docker required):
+
+```bash
+gym env test --resources-server swe_bench
+```
+
+**2. Start servers** (Anthropic API key for Claude Code):
+
+```bash
+gym env start \
+  --resources-server swe_bench \
+  --agent claude_code_swe_bench \
+  --model-type openai_model
+```
+
+**3. Run rollouts** on prepared JSONL:
+
+```bash
+gym eval run --no-serve --agent claude_code_swe_bench \
+  --input resources_servers/swe_bench/data/swebench_verified.jsonl \
+  --output results/swe_bench_rollouts.jsonl
+```
+
+The agent passes **`verifier_metadata.model_patch`** (unified diff) on `POST /verify`. The server returns `reward` ∈ `{0.0, 1.0}`, plus `resolved`, `patch_exists`, and optional `error_kind` / `mask_sample` for infra failures.
+
+## Hermetic verify
+
+`verify` **never** reuses the agent's working sandbox. `verify_task`:
+
+1. Selects the harness for `task.benchmark`
+2. Acquires a **fresh** sandbox via `acquire_sandbox` (always teardown)
+3. Runs `reset_repo` → `materialize(model_patch)` → `run_eval` → `grade`
+4. Maps the report to reward (`1.0` if resolved and no `error_kind`)
+
+This mirrors SWE-bench's separation between “agent edits” and “official eval script in a clean tree,” and prevents agent artifacts from affecting the score.
+
+<Accordion title="What if the patch is empty?">
+`verify` short-circuits: `patch_exists=false`, `resolved=false`, `reward=0.0` — no eval sandbox spin-up.
+</Accordion>
+
+<Accordion title="Relationship to swe_agents / anyswe">
+[`responses_api_agents/swe_agents`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/swe_agents) still shells out to `swebench.harness.run_local_evaluation` inside Apptainer-oriented rollouts. That path bundles agent + grading. **`swe_bench` + `claude_code_agent`** is the composable replacement: one Environment RS wired to many agent servers via the descriptor contract, without per-agent SWE wrappers.
+</Accordion>
+
+## Module map
+
+```text
+resources_servers/swe_bench/
+├── app.py              # HTTP: seed_session → SessionDescriptor, verify
+├── task.py             # First-class Task (SweTask, TaskPublic, parse helpers)
+├── session.py          # SessionDescriptor wire models
+├── harness.py          # SweTaskHarness ABC, registry, compute_resolved
+├── harnesses/          # Per-family grading plugins
+├── verify_task.py      # Fresh-sandbox grading orchestrator
+├── sandbox.py          # AsyncSweEnvironment + acquire_sandbox
+├── prepare.py          # HF dataset → Gym JSONL
+└── configs/swe_bench.yaml
+```
+
+## Key takeaways
+
+1. **`swe_bench` is the Environment** — it owns benchmark authority, not the agent server.
+2. **`seed_session` returns a descriptor**, not opaque session state — agents bind sandboxes from `placement` + `sandbox.spec`.
+3. **Topology C** runs Claude inside the instance image; **verify** always uses a hermetic twin sandbox.
+4. **`harnesses/`** are benchmark eval plugins aligned with upstream `swebench.harness` — distinct from Gym “agent harness” orchestration.
+5. **Any agent server** that implements `/run` → `seed_session` → work → `verify` with `model_patch` can plug in; no SWE-specific wrapper required.
+
+## Related docs
+
+- [Claude Code Agent — Protocol Stack](/infrastructure/engineering-notes/claude-code-agent-protocol-stack) — Responses API, `/v1/messages`, and rollout data contracts
+- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — training-scale Apptainer topology
+- [Real-World Environment tutorial](/environment-tutorials/real-world-environment/resources-server-implementation) — `seed_session` / `verify` patterns for resources servers
diff --git a/nemo_gym/sandbox/providers/apptainer/provider.py b/nemo_gym/sandbox/providers/apptainer/provider.py
index 0605b1a805..bc2b70c499 100644
--- a/nemo_gym/sandbox/providers/apptainer/provider.py
+++ b/nemo_gym/sandbox/providers/apptainer/provider.py
@@ -148,6 +148,7 @@ class _ApptainerInstance:
     mount_point: str  # where the folder shows up inside
     image: str  # what it was built from
     env: dict[str, str] = field(default_factory=dict)
+    overlay_dir: Path | None = None  # per-instance disk overlay (cleaned on close)
 
 
 def _resource_flags(resources: SandboxResources) -> list[str]:
@@ -386,6 +387,14 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
         for key, value in spec.env.items():
             argv += ["--env", f"{key}={value}"]
         start_args = list(self._create_config.extra_start_args)
+        # --writable-tmpfs caps the writable layer at apptainer's `sessiondir max size`
+        # (default 64 MiB), which ENOSPCs for repos that rebuild on apply/eval (e.g. astropy's
+        # C extensions). Swap it for a per-instance DISK-backed overlay (bounded by host disk).
+        overlay_dir: Path | None = None
+        if "--writable-tmpfs" in start_args:
+            start_args = [a for a in start_args if a != "--writable-tmpfs"]
+            overlay_dir = Path(tempfile.mkdtemp(prefix="nemo-gym-apptainer-ovl-"))
+            start_args += ["--overlay", str(overlay_dir)]
         resource_limit_flags = _resource_limit_flags(spec.resources)
         if resource_limit_flags and self._create_config.apply_resource_limits:
             if "--fakeroot" in start_args:
@@ -403,9 +412,13 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
             code, _out, err = await self._run(argv, timeout_s=self._create_config.start_timeout_s, daemonize=True)
         except TimeoutError as e:
             shutil.rmtree(staging_dir, ignore_errors=True)
+            if overlay_dir:
+                shutil.rmtree(overlay_dir, ignore_errors=True)
             raise ApptainerCreateError(f"apptainer instance start timed out for image={image!r}: {e}") from e
         if code != 0:
             shutil.rmtree(staging_dir, ignore_errors=True)
+            if overlay_dir:
+                shutil.rmtree(overlay_dir, ignore_errors=True)
             raise ApptainerCreateError(
                 f"apptainer instance start failed (code={code}) for image={image!r}: {err.strip()}"
             )
@@ -420,6 +433,7 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle:
                 mount_point=mount_point,
                 image=image,
                 env=dict(spec.env),
+                overlay_dir=overlay_dir,
             ),
         )
 
@@ -485,6 +499,8 @@ async def _cleanup_failed_create_handle(self, handle: SandboxHandle) -> None:
                 timeout_s=self._exec_config.default_timeout_s,
             )
         shutil.rmtree(inst.staging_dir, ignore_errors=True)
+        if inst.overlay_dir:
+            shutil.rmtree(inst.overlay_dir, ignore_errors=True)
 
     async def exec(
         self,
@@ -667,6 +683,8 @@ async def close(self, handle: SandboxHandle) -> None:
             shutil.rmtree(inst.staging_dir, ignore_errors=False)
         except OSError as e:
             LOGGER.warning("failed to remove staging dir %s: %s", inst.staging_dir, e)
+        if inst.overlay_dir:
+            shutil.rmtree(inst.overlay_dir, ignore_errors=True)
 
         if stop_error is not None:
             raise stop_error
diff --git a/nemo_gym/sandbox/providers/docker/__init__.py b/nemo_gym/sandbox/providers/docker/__init__.py
new file mode 100644
index 0000000000..a339158b99
--- /dev/null
+++ b/nemo_gym/sandbox/providers/docker/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Docker sandbox provider package."""
+
+from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider
+
+
+__all__ = ["DockerSandboxProvider"]
diff --git a/nemo_gym/sandbox/providers/docker/provider.py b/nemo_gym/sandbox/providers/docker/provider.py
new file mode 100644
index 0000000000..7af8fe6ffa
--- /dev/null
+++ b/nemo_gym/sandbox/providers/docker/provider.py
@@ -0,0 +1,324 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Local Docker-backed ``SandboxProvider`` implementation.
+
+Implements the ``nemo_gym.sandbox`` provider Protocol via the ``docker`` CLI so
+SWE environments can be provisioned and graded on any machine with Docker
+installed, making end-to-end SWE-bench verification runnable on a single
+workstation.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import posixpath
+import shlex
+import uuid
+from collections.abc import Mapping
+from pathlib import Path
+from typing import Any
+
+from nemo_gym.sandbox import (
+    SandboxCreateError,
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxResources,
+    SandboxSpec,
+    SandboxStatus,
+)
+
+
+class DockerSandboxProvider:
+    """Run sandboxes as long-lived Docker containers via the ``docker`` CLI."""
+
+    name = "docker"
+
+    def __init__(
+        self,
+        *,
+        docker_bin: str = "docker",
+        default_user: str | int | None = None,
+        network: str | None = None,
+        run_args: list[str] | None = None,
+        keep_alive_command: str = "sleep infinity",
+        concurrency: int = 32,
+        **_: Any,
+    ) -> None:
+        """Configure the Docker sandbox provider.
+
+        Args:
+            docker_bin: Name or path of the ``docker`` executable to invoke.
+            default_user: Default user (name or UID) to run ``exec`` commands as
+                when no per-call user is given; None leaves the image default.
+            network: Docker network to attach containers to; None uses the
+                Docker default.
+            run_args: Extra arguments appended to every ``docker run``
+                invocation.
+            keep_alive_command: Command run as the container's entrypoint to keep
+                it alive for subsequent ``exec`` calls.
+            concurrency: Maximum number of concurrent ``docker`` CLI subprocesses,
+                bounded by a shared semaphore (matches the apptainer provider).
+            **_: Additional keyword arguments are accepted and ignored.
+
+        Raises:
+            ValueError: If ``concurrency`` is less than 1.
+        """
+        if concurrency < 1:
+            raise ValueError("concurrency must be >= 1")
+        self._bin = docker_bin
+        self._default_user = default_user
+        self._network = network
+        self._run_args = list(run_args or [])
+        self._keep_alive = keep_alive_command
+        self._semaphore = asyncio.Semaphore(concurrency)
+
+    async def _run(self, *args: str, timeout_s: int | float | None = None) -> tuple[int, str, str]:
+        """Run the ``docker`` CLI with the given arguments and capture output.
+
+        Concurrency is bounded by the provider's shared semaphore so a busy SWE hot
+        path (one sandbox per rollout, many ``exec`` each) cannot spawn unbounded
+        ``docker`` subprocesses.
+
+        Args:
+            *args: Arguments passed to the ``docker`` executable.
+            timeout_s: Optional timeout in seconds; the process is killed and the
+                timeout error re-raised if it is exceeded.
+
+        Returns:
+            A tuple of ``(return_code, stdout, stderr)`` with output decoded as
+            text using ``errors="replace"``.
+        """
+        async with self._semaphore:
+            proc = await asyncio.create_subprocess_exec(
+                self._bin,
+                *args,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+            try:
+                out, err = await asyncio.wait_for(proc.communicate(), timeout=timeout_s)
+            except (asyncio.TimeoutError, TimeoutError):
+                proc.kill()
+                await proc.wait()
+                raise
+            return (
+                proc.returncode if proc.returncode is not None else -1,
+                out.decode(errors="replace"),
+                err.decode(errors="replace"),
+            )
+
+    @staticmethod
+    def _resources(spec: SandboxSpec) -> SandboxResources:
+        """Coerce a spec's resource request into a ``SandboxResources``.
+
+        Args:
+            spec: Sandbox spec whose ``resources`` field is a
+                ``SandboxResources`` or a mapping.
+
+        Returns:
+            The spec's ``SandboxResources`` if already one, otherwise a
+            ``SandboxResources`` built from the mapping (or empty defaults).
+        """
+        if isinstance(spec.resources, SandboxResources):
+            return spec.resources
+        return SandboxResources.from_mapping(spec.resources if isinstance(spec.resources, Mapping) else {})
+
+    async def create(self, spec: SandboxSpec) -> SandboxHandle:
+        """Start a detached container and return a handle to it.
+
+        Applies resource limits, network, working directory, environment, and
+        extra run args from the spec, then launches the image running the
+        keep-alive command so the container persists for later ``exec`` calls.
+
+        Args:
+            spec: Sandbox spec describing the image, resources, workdir, env, and
+                readiness timeout.
+
+        Returns:
+            A ``SandboxHandle`` whose ``sandbox_id`` is the container id.
+
+        Raises:
+            SandboxCreateError: If no image is given, ``docker run`` times out or
+                fails, or no container id is returned.
+        """
+        if not spec.image:
+            raise SandboxCreateError("DockerSandboxProvider requires spec.image")
+        # Pre-assign a unique name so a container the daemon may have started can still be reaped
+        # if the CLI client dies (e.g. on timeout) before we capture its id (mirrors apptainer's
+        # uuid-named instances).
+        name = f"nemo-gym-{uuid.uuid4().hex}"
+        args = ["run", "-d", "--init", "--name", name]
+        if self._network:
+            args += ["--network", self._network]
+        res = self._resources(spec)
+        if res.memory_mib:
+            args.append(f"--memory={int(res.memory_mib)}m")
+        if res.cpu:
+            args.append(f"--cpus={res.cpu}")
+        if res.gpu:
+            args.append("--gpus=all")
+        if spec.workdir:
+            args += ["-w", spec.workdir]
+        for key, value in (spec.env or {}).items():
+            args += ["-e", f"{key}={value}"]
+        args += self._run_args
+        args += [spec.image, "bash", "-c", self._keep_alive]
+        try:
+            rc, out, err = await self._run(*args, timeout_s=spec.ready_timeout_s or 600)
+        except (asyncio.TimeoutError, TimeoutError) as exc:
+            await self._reap_orphan(name)
+            raise SandboxCreateError(f"docker run timed out for image {spec.image!r}") from exc
+        if rc != 0:
+            await self._reap_orphan(name)
+            raise SandboxCreateError(f"docker run failed (rc={rc}) for {spec.image!r}: {err.strip() or out.strip()}")
+        lines = out.strip().splitlines()
+        container_id = lines[-1].strip() if lines else ""
+        if not container_id:
+            await self._reap_orphan(name)
+            raise SandboxCreateError("docker run did not return a container id")
+        return SandboxHandle(
+            sandbox_id=container_id,
+            provider_name=self.name,
+            raw={"image": spec.image, "workdir": spec.workdir},
+        )
+
+    async def _reap_orphan(self, name: str) -> None:
+        """Best-effort force-remove a container by its pre-assigned name.
+
+        Used to clean up a ``docker run`` that may have started a container on the daemon even
+        though the CLI client failed (timeout / non-zero rc / no id returned) before a handle was
+        captured. Swallows all errors and bounds itself with a short timeout — a missing or
+        already-gone container is fine.
+
+        Args:
+            name: The pre-assigned ``--name`` of the container to remove.
+        """
+        try:
+            await self._run("rm", "-f", name, timeout_s=30)
+        except Exception:
+            pass
+
+    async def exec(
+        self,
+        handle: SandboxHandle,
+        command: str,
+        *,
+        cwd: str | None = None,
+        env: dict[str, str] | None = None,
+        timeout_s: int | float | None = None,
+        user: str | int | None = None,
+    ) -> SandboxExecResult:
+        """Run a shell command inside the container.
+
+        Args:
+            handle: Handle identifying the target container.
+            command: Shell command executed via ``bash -c``.
+            cwd: Working directory for the command; falls back to the workdir
+                recorded at create time.
+            env: Extra environment variables for the command.
+            timeout_s: Optional timeout in seconds; on expiry a result with
+                return code 124 and ``error_type="timeout"`` is returned.
+            user: User (name or UID) to run as; falls back to the provider's
+                default user.
+
+        Returns:
+            A ``SandboxExecResult`` with stdout, stderr, return code, and an
+            ``error_type`` of ``"sandbox"`` for docker-level failures (125/126/
+            127 with no stdout), ``"timeout"`` on timeout, or None otherwise.
+        """
+        args = ["exec"]
+        workdir = cwd or handle.raw.get("workdir")
+        if workdir:
+            args += ["-w", workdir]
+        eff_user = user if user is not None else self._default_user
+        if eff_user is not None:
+            args += ["-u", str(eff_user)]
+        for key, value in (env or {}).items():
+            args += ["-e", f"{key}={value}"]
+        args += [handle.sandbox_id, "bash", "-c", command]
+        try:
+            rc, out, err = await self._run(*args, timeout_s=timeout_s)
+        except (asyncio.TimeoutError, TimeoutError):
+            return SandboxExecResult(
+                stdout=None,
+                stderr=f"command timed out after {timeout_s}s",
+                return_code=124,
+                error_type="timeout",
+            )
+        # docker exec returns 125/126/127 for docker-level failures (container gone, not executable).
+        error_type = "sandbox" if rc in (125, 126, 127) and not out else None
+        return SandboxExecResult(stdout=out, stderr=err, return_code=rc, error_type=error_type)
+
+    async def upload_file(self, handle: SandboxHandle, source_path: Path, target_path: str) -> None:
+        """Copy a host file into the container, creating parent dirs as needed.
+
+        Args:
+            handle: Handle identifying the target container.
+            source_path: Path to the file on the host.
+            target_path: Destination path inside the container.
+
+        Raises:
+            RuntimeError: If the ``docker cp`` upload fails.
+        """
+        parent = posixpath.dirname(target_path)
+        if parent:
+            await self.exec(handle, f"mkdir -p {shlex.quote(parent)}")
+        rc, out, err = await self._run("cp", str(source_path), f"{handle.sandbox_id}:{target_path}")
+        if rc != 0:
+            raise RuntimeError(f"docker cp upload failed: {err.strip() or out.strip()}")
+
+    async def download_file(self, handle: SandboxHandle, source_path: str, target_path: Path) -> None:
+        """Copy a file out of the container to the host.
+
+        Args:
+            handle: Handle identifying the source container.
+            source_path: Path to the file inside the container.
+            target_path: Destination path on the host; parent dirs are created.
+
+        Raises:
+            RuntimeError: If the ``docker cp`` download fails.
+        """
+        target = Path(target_path)
+        target.parent.mkdir(parents=True, exist_ok=True)
+        rc, out, err = await self._run("cp", f"{handle.sandbox_id}:{source_path}", str(target))
+        if rc != 0:
+            raise RuntimeError(f"docker cp download failed: {err.strip() or out.strip()}")
+
+    async def status(self, handle: SandboxHandle) -> SandboxStatus:
+        """Report whether the container is running.
+
+        Args:
+            handle: Handle identifying the container to inspect.
+
+        Returns:
+            ``RUNNING`` or ``STOPPED`` based on the container's running state,
+            or ``UNKNOWN`` if the inspect command fails.
+        """
+        rc, out, _ = await self._run("inspect", "-f", "{{.State.Running}}", handle.sandbox_id)
+        if rc != 0:
+            return SandboxStatus.UNKNOWN
+        return SandboxStatus.RUNNING if out.strip() == "true" else SandboxStatus.STOPPED
+
+    async def close(self, handle: SandboxHandle) -> None:
+        """Force-remove the container.
+
+        Args:
+            handle: Handle identifying the container to remove.
+        """
+        await self._run("rm", "-f", handle.sandbox_id)
+
+    async def aclose(self) -> None:
+        """Release provider-level resources; this provider holds none."""
+        return None
diff --git a/nemo_gym/sandbox/providers/registry.py b/nemo_gym/sandbox/providers/registry.py
index 8c4e39e577..451056d470 100644
--- a/nemo_gym/sandbox/providers/registry.py
+++ b/nemo_gym/sandbox/providers/registry.py
@@ -75,11 +75,18 @@ def _load_opensandbox_provider() -> ProviderClass:
     return OpenSandboxProvider
 
 
+def _load_docker_provider() -> ProviderClass:
+    from nemo_gym.sandbox.providers.docker import DockerSandboxProvider
+
+    return DockerSandboxProvider
+
+
 def _load_apptainer_provider() -> ProviderClass:
     from nemo_gym.sandbox.providers.apptainer import ApptainerProvider
 
     return ApptainerProvider
 
 
-_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider
 _BUILTIN_PROVIDER_LOADERS["opensandbox"] = _load_opensandbox_provider
+_BUILTIN_PROVIDER_LOADERS["docker"] = _load_docker_provider
+_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider
diff --git a/resources_servers/swe_bench/README.md b/resources_servers/swe_bench/README.md
new file mode 100644
index 0000000000..b60972b80a
--- /dev/null
+++ b/resources_servers/swe_bench/README.md
@@ -0,0 +1,51 @@
+# swe_bench resources server
+
+SWE-bench **Environment** resources server: `seed_session` returns a `SessionDescriptor` (topology **C**, per-instance sandbox spec); `verify` grades a model patch in a **fresh** eval sandbox (hermetic twin).
+
+Grading eval harnesses, parsing, and `verify_task` live as **private modules** under this directory (relocated from `responses_api_agents/swe_env/`).
+
+Key modules:
+
+- `task.py` — first-class **Task** (`SweTask`, `TaskPublic`, parse helpers)
+- `session.py` — **SessionDescriptor** returned from `seed_session`
+- `app.py` — thin HTTP surface (`seed_session`, `verify`)
+- `harness.py` / `harnesses/` — benchmark-family grading plugins
+
+## Wiring
+
+```yaml
+responses_api_agents:
+  claude_code_agent:
+    resources_server:
+      type: resources_servers
+      name: swe_bench
+```
+
+## Tests
+
+```bash
+gym env test --resources-server swe_bench
+```
+
+Unit tests use a fake sandbox provider (no Docker required).
+
+## Dataset
+
+Prepare SWE-bench Verified rows with `verifier_metadata` (see `prepare.py`):
+
+```bash
+python resources_servers/swe_bench/prepare.py --limit 5 --no-images
+```
+
+Each JSONL row includes `verifier_metadata.instance_id`, `instance_dict`, `dataset_name`, and optional `container_formatter`.
+
+## Rollouts
+
+```bash
+gym env start --resources-server swe_bench --agent claude_code_swe_bench --model-type openai_model
+gym eval run --no-serve --agent claude_code_swe_bench \
+  --input resources_servers/swe_bench/data/swebench_verified.jsonl \
+  --output results/swe_bench_rollouts.jsonl
+```
+
+Agent servers pass `verifier_metadata.model_patch` (git unified diff) on `POST /verify`.
diff --git a/resources_servers/swe_bench/__init__.py b/resources_servers/swe_bench/__init__.py
new file mode 100644
index 0000000000..ffd5d25501
--- /dev/null
+++ b/resources_servers/swe_bench/__init__.py
@@ -0,0 +1,58 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE-bench Environment resources server modules.
+
+Grading harnesses, parsing, and verify_task implement the Environment MDP
+authority. Agent servers connect via HTTP ``seed_session`` / ``verify`` only.
+"""
+
+from resources_servers.swe_bench.harness import (
+    EvalArtifacts,
+    SweEvalReport,
+    SweTaskHarness,
+    compute_resolved,
+    get_harness,
+    list_harnesses,
+    register_harness,
+    reward_from_report,
+)
+from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+from resources_servers.swe_bench.session import SessionDescriptor
+from resources_servers.swe_bench.task import (
+    ENVIRONMENT_NAME,
+    SweTask,
+    TaskPublic,
+    TaskSubmission,
+    parse_task_from_request,
+)
+
+
+__all__ = [
+    "AsyncSweEnvironment",
+    "ENVIRONMENT_NAME",
+    "EvalArtifacts",
+    "SessionDescriptor",
+    "SweEvalReport",
+    "SweTask",
+    "SweTaskHarness",
+    "TaskPublic",
+    "TaskSubmission",
+    "compute_resolved",
+    "get_harness",
+    "list_harnesses",
+    "parse_task_from_request",
+    "register_harness",
+    "reward_from_report",
+]
diff --git a/resources_servers/swe_bench/app.py b/resources_servers/swe_bench/app.py
new file mode 100644
index 0000000000..214189c99e
--- /dev/null
+++ b/resources_servers/swe_bench/app.py
@@ -0,0 +1,113 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""SWE-bench Environment resources server."""
+
+from __future__ import annotations
+
+import dataclasses
+from typing import Any, Literal
+
+from pydantic import Field
+
+import resources_servers.swe_bench.harnesses  # noqa: F401
+from nemo_gym.base_resources_server import (
+    BaseResourcesServerConfig,
+    SimpleResourcesServer,
+)
+from nemo_gym.sandbox import SandboxSpec
+from resources_servers.swe_bench.harness import get_harness
+from resources_servers.swe_bench.session import (
+    EgressDescriptor,
+    PlacementDescriptor,
+    SandboxDescriptor,
+    SessionDescriptor,
+    SweBenchSeedSessionRequest,
+    SweBenchVerifyRequest,
+    SweBenchVerifyResponse,
+)
+from resources_servers.swe_bench.task import (
+    ENVIRONMENT_NAME,
+    SweTask,
+    parse_submission,
+    parse_task_from_request,
+)
+from resources_servers.swe_bench.verify_task import report_to_reward, verify_task
+
+
+Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"]
+
+
+class SweBenchResourcesServerConfig(BaseResourcesServerConfig):
+    sandbox_provider: dict[str, Any] = Field(default_factory=lambda: {"docker": {}})
+    container_formatter: str = "swebench/sweb.eval.x86_64.{instance_id}"
+    eval_timeout_s: float = 1800.0
+    flat_eval: bool = True
+    default_topology: Topology = "agent_in_env"
+
+
+def _spec_to_dict(spec: SandboxSpec) -> dict[str, Any]:
+    payload = dataclasses.asdict(spec)
+    resources = payload.get("resources")
+    if resources is not None and hasattr(resources, "__dataclass_fields__"):
+        payload["resources"] = dataclasses.asdict(resources)
+    return payload
+
+
+class SweBenchResourcesServer(SimpleResourcesServer):
+    config: SweBenchResourcesServerConfig
+
+    def _parse_task(self, body: SweBenchSeedSessionRequest | SweBenchVerifyRequest) -> SweTask:
+        return parse_task_from_request(
+            body,
+            container_formatter=self.config.container_formatter,
+            flat_eval=self.config.flat_eval,
+            environment=ENVIRONMENT_NAME,
+        )
+
+    async def seed_session(self, body: SweBenchSeedSessionRequest) -> SessionDescriptor:
+        task = self._parse_task(body)
+        harness = get_harness(task.harness_family)
+        if self.config.flat_eval and hasattr(harness, "with_flat_eval"):
+            harness = harness.with_flat_eval()
+        spec = harness.build_spec(task)
+
+        verifier_metadata = task.privileged_verifier_metadata(flat_eval=self.config.flat_eval)
+        if body.verifier_metadata:
+            verifier_metadata = {**body.verifier_metadata, **verifier_metadata}
+
+        return SessionDescriptor(
+            environment=ENVIRONMENT_NAME,
+            task=task.public_view(environment=ENVIRONMENT_NAME),
+            placement=PlacementDescriptor(topology=self.config.default_topology),
+            sandbox=SandboxDescriptor(spec=_spec_to_dict(spec)),
+            egress=EgressDescriptor(env={}),
+            verifier_metadata=verifier_metadata,
+        )
+
+    async def verify(self, body: SweBenchVerifyRequest) -> SweBenchVerifyResponse:
+        task = self._parse_task(body)
+        task = task.with_submission(parse_submission(body.verifier_metadata))
+
+        report = await verify_task(
+            self.config.sandbox_provider,
+            task,
+            eval_timeout_s=self.config.eval_timeout_s,
+        )
+        reward = report_to_reward(report)
+        masked = report.error_kind is not None
+
+        return SweBenchVerifyResponse(
+            **body.model_dump(),
+            task_id=task.task_id,
+            environment=ENVIRONMENT_NAME,
+            reward=reward,
+            resolved=report.resolved,
+            patch_exists=report.patch_exists,
+            mask_sample=masked,
+            error_kind=report.error_kind,
+        )
+
+
+if __name__ == "__main__":
+    SweBenchResourcesServer.run_webserver()
diff --git a/resources_servers/swe_bench/configs/swe_bench.yaml b/resources_servers/swe_bench/configs/swe_bench.yaml
new file mode 100644
index 0000000000..f3a60d11c2
--- /dev/null
+++ b/resources_servers/swe_bench/configs/swe_bench.yaml
@@ -0,0 +1,33 @@
+swe_bench:
+  resources_servers:
+    swe_bench:
+      entrypoint: app.py
+      domain: coding
+      verified: false
+      description: SWE-bench Environment (seed_session + hermetic verify)
+      sandbox_provider:
+        docker: {}
+      container_formatter: swebench/sweb.eval.x86_64.{instance_id}
+      eval_timeout_s: 1800
+      flat_eval: true
+      default_topology: agent_in_env
+
+claude_code_swe_bench:
+  responses_api_agents:
+    claude_code_agent:
+      entrypoint: app.py
+      resources_server:
+        type: resources_servers
+        name: swe_bench
+      concurrency: 16
+      model: claude-sonnet-4-6
+      anthropic_api_key: ${anthropic_api_key}
+      anthropic_base_url: null
+      max_turns: 30
+      timeout: 1800
+      in_box_timeout_s: 1800
+      sandbox_provider:
+        docker: {}
+      bare: true
+      mcp_config: null
+      settings: null
diff --git a/resources_servers/swe_bench/data/.gitignore b/resources_servers/swe_bench/data/.gitignore
new file mode 100644
index 0000000000..ac481ac55b
--- /dev/null
+++ b/resources_servers/swe_bench/data/.gitignore
@@ -0,0 +1,3 @@
+data/
+__pycache__/
+*.pyc
diff --git a/resources_servers/swe_bench/harness.py b/resources_servers/swe_bench/harness.py
new file mode 100644
index 0000000000..59467432bb
--- /dev/null
+++ b/resources_servers/swe_bench/harness.py
@@ -0,0 +1,365 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Task model and harness contract for the SWE environment library.
+
+The first-class **Task** value lives in ``task.py`` (``SweTask``). This module holds
+the harness registry and grading helpers.
+
+The harness contract is intentionally split across a trust boundary:
+
+* ``build_spec`` / ``supports_provider`` / ``materialize`` are **provisioning**
+  methods imported and called by *agents* (and the verifier).
+* ``reset_repo`` / ``run_eval`` / ``grade`` are **grading** methods used
+  **only** by the grader (``verify_task``). A test asserts agent adapters never
+  reference them.
+
+This module also holds the name->harness registry
+(``register_harness``/``get_harness``/``list_harnesses``) and the pure grading
+helpers (``compute_resolved``/``reward_from_report``), merged here so the harness
+contract, its dispatch, and its scoring live in one place.
+"""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from collections.abc import Iterable
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+from nemo_gym.sandbox import SandboxSpec
+from resources_servers.swe_bench.task import SweTask
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class GraderDependencyError(RuntimeError):
+    """A required grading dependency is unavailable for a task the harness must grade exactly.
+
+    Raised by a harness when it cannot grade an instance faithfully (e.g. ``swebench`` is
+    missing for a SWE-bench instance) and degrading to a generic parser would silently skew
+    the result. ``verify_task`` propagates this rather than swallowing it into an unmasked
+    reward-0, so a misconfigured grader fails loudly instead of quietly degrading scores.
+    """
+
+
+@dataclass
+class EvalArtifacts:
+    """Raw evaluation output retrieved from the sandbox, before grading."""
+
+    test_output: str = ""
+    return_code: int = 0
+    patch_applied: bool = False
+    raw: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class SweEvalReport:
+    """Graded result of a single task. ``error_kind`` masks a sample.
+
+    ``error_kind`` is ``None`` for a clean grade. A non-``None`` value (e.g.
+    ``"sandbox"`` / ``"eval_error"``) marks an infra failure: the sample is
+    masked via this flag and ``reward_from_report`` returns ``0.0`` — **never**
+    ``None`` (the wire ``reward`` field is a non-nullable ``float``).
+    """
+
+    instance_id: str
+    resolved: bool = False
+    patch_applied: bool = False
+    patch_exists: bool = False
+    error_kind: str | None = None
+    tests_status: dict[str, Any] = field(default_factory=dict)
+
+
+class SweTaskHarness(ABC):
+    """Per-family provisioning + (server-private) grading recipe."""
+
+    #: registry key, e.g. ``"swe-bench-ext"``.
+    name: str = ""
+    #: ``"flat-host-grade"`` (parse host-side) or ``"nested-harness"`` (in-container grader).
+    grade_strategy: str = "flat-host-grade"
+
+    # --- provisioning (agent-facing + verifier) ------------------------------
+
+    @abstractmethod
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox spec for a task.
+
+        Args:
+            task (SweTask): The task to provision a sandbox for.
+
+        Returns:
+            SandboxSpec: The spec describing image, workdir, env, ttl, and
+                provider options for the task.
+        """
+
+    def supports_provider(self, provider_name: str) -> bool:
+        """Report whether this harness can run on the named provider.
+
+        The base harness accepts every provider; flat host-graded families work on any
+        exec-capable provider.
+
+        Args:
+            provider_name (str): The name of the sandbox provider.
+
+        Returns:
+            bool: ``True`` if the provider is supported.
+        """
+        return True
+
+    def with_flat_eval(self) -> "SweTaskHarness":
+        """Return a variant that grades host-side (flat) on any exec-capable provider.
+
+        All families already grade host-side, so the base implementation returns ``self``.
+
+        Returns:
+            SweTaskHarness: A harness whose grading runs host-side.
+        """
+        return self
+
+    async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Upload the model patch and test patch into the started sandbox.
+
+        Args:
+            env (AsyncSweEnvironment): The started environment to write into.
+            task (SweTask): The task whose patches are uploaded.
+        """
+        if task.model_patch:
+            await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+        if task.test_patch:
+            await env.write_text("/root/test_patch.diff", _ensure_trailing_newline(task.test_patch))
+
+    # --- server-private grading (verifier only) ------------------------------
+
+    async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Reset the in-sandbox checkout to ``base_commit`` for hermetic grading.
+
+        Uses only ``git reset --hard``, never ``git clean -fdx``: verification
+        runs in a fresh sandbox (no agent edits to scrub), and a clean would
+        delete the image's prebuilt artifacts (compiled C extensions, installed
+        environment) and break the tests.
+
+        Args:
+            env (AsyncSweEnvironment): The started environment to reset.
+            task (SweTask): The task whose ``base_commit`` and ``repo_workdir``
+                are used.
+        """
+        if task.base_commit:
+            await env.execute(f"git reset --hard {task.base_commit}", cwd=task.repo_workdir)
+
+    @abstractmethod
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Apply the patches and run the evaluation, returning raw artifacts.
+
+        Args:
+            env (AsyncSweEnvironment): The started environment to evaluate in.
+            task (SweTask): The task being evaluated.
+
+        Returns:
+            EvalArtifacts: The raw evaluation output retrieved from the sandbox.
+        """
+
+    @abstractmethod
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Parse raw artifacts host-side into a graded report.
+
+        Args:
+            task (SweTask): The task that was evaluated.
+            artifacts (EvalArtifacts): The raw evaluation output to parse.
+
+        Returns:
+            SweEvalReport: The graded result for the task.
+        """
+
+
+def _ensure_trailing_newline(text: str) -> str:
+    """Return the text with a single trailing newline.
+
+    Args:
+        text (str): The input text.
+
+    Returns:
+        str: The text unchanged if it already ends in a newline, otherwise the
+            text with a newline appended.
+    """
+    return text if text.endswith("\n") else text + "\n"
+
+
+# --- name->harness registry ----------------------------
+
+_HARNESSES: dict[str, SweTaskHarness] = {}
+
+
+def register_harness(harness: SweTaskHarness, *, override: bool = False) -> None:
+    """Register a harness under its ``name``.
+
+    Args:
+        harness (SweTaskHarness): The harness to register. Its ``name`` must be
+            non-empty.
+        override (bool): If ``True``, replace an existing harness with the same
+            name instead of raising.
+
+    Raises:
+        ValueError: If the harness name is empty, or a harness with the same name
+            is already registered and ``override`` is ``False``.
+    """
+    if not harness.name:
+        raise ValueError("Harness must define a non-empty 'name'")
+    if not override and harness.name in _HARNESSES:
+        raise ValueError(f"Harness {harness.name!r} is already registered")
+    _HARNESSES[harness.name] = harness
+
+
+# HuggingFace dataset names don't match registry keys; map by substring (most-specific first)
+# so callers can pass a raw ``dataset_name`` (e.g. "princeton-nlp/SWE-bench_Verified").
+_HF_NAME_ALIASES: list[tuple[str, str]] = [
+    ("SWE-bench_Multilingual", "swe-bench-multilingual"),
+    ("R2E-Gym", "r2e-gym"),
+    ("SWE-rebench", "swe-rebench"),
+    ("SWE-bench", "swe-bench"),
+]
+
+
+def _ensure_registered() -> None:
+    """Lazily register the built-in harnesses if the registry is empty.
+
+    Importing ``resources_servers.swe_bench.harnesses`` registers all families, but a fresh
+    process (e.g. a Ray worker running the decoupled agent) may call ``get_harness`` before that
+    import has run. Registering on demand keeps lookups robust regardless of import order.
+    """
+    if _HARNESSES:
+        return
+    from resources_servers.swe_bench.harnesses import register_builtin_harnesses
+
+    register_builtin_harnesses()
+
+
+def get_harness(name: str) -> SweTaskHarness:
+    """Look up a harness by registry key, or by HuggingFace dataset-name substring.
+
+    Built-in harnesses are registered on first use (robust to import order). An exact key
+    match wins; otherwise a HuggingFace ``dataset_name`` substring is resolved to its key (e.g.
+    ``"princeton-nlp/SWE-bench_Verified"`` -> ``"swe-bench"``).
+
+    Args:
+        name (str): The registry key, or a HuggingFace dataset name.
+
+    Returns:
+        SweTaskHarness: The registered harness.
+
+    Raises:
+        KeyError: If no harness matches ``name``.
+    """
+    _ensure_registered()
+    if name in _HARNESSES:
+        return _HARNESSES[name]
+    for needle, key in _HF_NAME_ALIASES:
+        if needle in name and key in _HARNESSES:
+            return _HARNESSES[key]
+    available = ", ".join(sorted(_HARNESSES)) or "(none)"
+    raise KeyError(f"Unknown SWE harness {name!r}. Registered: {available}")
+
+
+def list_harnesses() -> list[str]:
+    """List the names of all registered harnesses.
+
+    Returns:
+        list[str]: The registered harness names, sorted alphabetically.
+    """
+    return sorted(_HARNESSES)
+
+
+# --- pure grading helpers -------------------------------
+
+
+def compute_resolved(
+    *,
+    fail_to_pass: Iterable[str],
+    pass_to_pass: Iterable[str],
+    passed: Iterable[str],
+    eval_type: str = "pass_and_fail",
+    status_map: dict[str, str] | None = None,
+) -> bool:
+    """Apply the SWE-bench resolution rule.
+
+    Two eval types are supported, mirroring swebench's per-repo selection
+    (``swebench.harness.grading.get_eval_report`` /
+    ``get_eval_tests_report`` + ``get_resolution_status``):
+
+    * ``"pass_and_fail"`` (default): mirrors swebench's ``check_pass_and_fail``
+      classification combined with the ratio-based ``get_resolution_status``. When a
+      ``status_map`` is supplied, each required test is a **success** when present and
+      PASSED/XFAIL (``test_passed``), a **failure** when absent or FAILED/ERROR
+      (``test_failed``), and **neutral** (excluded from both counts) for any other
+      status (e.g. SKIPPED/XPASS). A task is resolved only when there are zero
+      failures across FAIL_TO_PASS and PASS_TO_PASS (each ratio ``== 1``; an
+      all-neutral category with total ``0`` counts as ``1``). Without a
+      ``status_map`` it falls back to plain ``passed``-set membership.
+    * ``"fail_only"``: used for the JS multilingual repos in swebench's
+      ``FAIL_ONLY_REPOS`` (chartjs/Chart.js, processing/p5.js, markedjs/marked). A
+      required test counts as success **unless** it is present in ``status_map``
+      **and** its status is ``FAILED``. This mirrors swebench's ``check_fail_only``.
+
+    Args:
+        fail_to_pass (Iterable[str]): Tests that must transition from failing to
+            passing.
+        pass_to_pass (Iterable[str]): Tests that must remain passing.
+        passed (Iterable[str]): The tests that actually passed.
+        eval_type (str): ``"pass_and_fail"`` or ``"fail_only"`` (selected by the
+            caller from ``test_spec.repo``).
+        status_map (dict[str, str] | None): Full per-test status map. Required for
+            the ``"fail_only"`` rule (to detect a present-and-FAILED required test)
+            and used by ``"pass_and_fail"`` to exclude neutral-status required tests
+            exactly as swebench does.
+
+    Returns:
+        bool: ``True`` if all required tests passed under the selected rule,
+            ``False`` if there are no required tests or any required test did not
+            pass.
+    """
+    required = list(fail_to_pass) + list(pass_to_pass)
+    if not required:
+        return False
+    if eval_type == "fail_only":
+        sm = status_map or {}
+        # Mirror swebench's check_fail_only: a required test is a failure only when
+        # present in the status map AND explicitly FAILED; anything else is success.
+        return all(not (test in sm and sm[test] == "FAILED") for test in required)
+    if status_map is not None:
+        # Mirror swebench's check_pass_and_fail + get_resolution_status: a required
+        # test is a failure only when it is absent or its status is FAILED/ERROR;
+        # PASSED/XFAIL are successes and any other status (SKIPPED/XPASS) is neutral
+        # (excluded). Resolution requires zero failures in BOTH categories.
+        return all(not (test not in status_map or status_map[test] in ("FAILED", "ERROR")) for test in required)
+    passed_set = set(passed)
+    return all(test in passed_set for test in required)
+
+
+def reward_from_report(report: SweEvalReport) -> float:
+    """Map a graded report to a reward.
+
+    An infra or eval failure (``error_kind`` set) yields ``0.0`` and is masked
+    via the flag downstream; the result is always a ``float`` and never ``None``.
+
+    Args:
+        report (SweEvalReport): The graded result to convert.
+
+    Returns:
+        float: ``1.0`` if the task resolved with no error, otherwise ``0.0``.
+    """
+    if report.error_kind is not None:
+        return 0.0
+    return 1.0 if report.resolved else 0.0
diff --git a/resources_servers/swe_bench/harnesses/__init__.py b/resources_servers/swe_bench/harnesses/__init__.py
new file mode 100644
index 0000000000..e55c86e926
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/__init__.py
@@ -0,0 +1,64 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE dataset-family harnesses. Importing this package registers all families.
+
+Every built-in family is flat and host-graded: it runs the instance's evaluation
+inside a single sandbox, parses the output host-side, and works on any
+exec-capable provider (including docker). The registered families are
+``swe-bench-ext``, ``nv-internal-1``, ``swe-rebench``, ``swe-bench``,
+``swe-bench-multilingual``, and ``r2e-gym``. (The previously apptainer-only nested
+grading for ``swe-bench``/``swe-bench-multilingual``/``r2e-gym`` was removed when
+PR #1694 took ownership of the apptainer provider.)
+"""
+
+from resources_servers.swe_bench.harness import list_harnesses, register_harness
+from resources_servers.swe_bench.harnesses.nv_internal import NVInternalHarness
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.harnesses.swe_rebench import SweRebenchHarness
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+def register_builtin_harnesses() -> None:
+    """Register every built-in SWE dataset-family harness.
+
+    Constructs each built-in harness and registers it under its name, skipping
+    any name that is already registered so the call is safe to run more than once.
+    """
+    builtins = [
+        SweBenchExtHarness(),
+        NVInternalHarness(),
+        SweRebenchHarness(),
+        SweBenchHarness("swe-bench"),
+        SweBenchHarness("swe-bench-multilingual"),
+        R2EGymHarness(),
+    ]
+    existing = set(list_harnesses())
+    for harness in builtins:
+        if harness.name not in existing:
+            register_harness(harness)
+
+
+register_builtin_harnesses()
+
+
+__all__ = [
+    "NVInternalHarness",
+    "R2EGymHarness",
+    "SweBenchExtHarness",
+    "SweBenchHarness",
+    "SweRebenchHarness",
+    "register_builtin_harnesses",
+]
diff --git a/resources_servers/swe_bench/harnesses/flat_eval.py b/resources_servers/swe_bench/harnesses/flat_eval.py
new file mode 100644
index 0000000000..37a6a1727d
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/flat_eval.py
@@ -0,0 +1,280 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Flat (host-graded) eval-script mode for SWE dataset families.
+
+Flat mode runs an instance's eval script directly in the sandbox and parses the
+produced log host-side, computing ``resolved`` from ``FAIL_TO_PASS`` /
+``PASS_TO_PASS`` via :func:`compute_resolved`. Because there is no nested
+container, this runs on any exec-capable provider (docker / opensandbox).
+
+The eval script resets the repo, applies the gold/model patch plus the test
+patch, runs the repo's test command, and wraps the test output between two
+sentinel markers::
+
+    >>>>> Start Test Output
+    ... per-test "PASSED <id>" / "FAILED <id>" lines ...
+    >>>>> End Test Output
+
+It also emits patch-apply / reset / timeout status codes
+(``>>>>> Applied Patch`` etc.). The host-side parser in this module recognises
+these markers and per-test status tokens without importing ``swebench``, so
+grading can run in environments where that package (and its Docker
+dependencies) is absent.
+
+``flat_eval_enabled`` reports whether flat mode applies to a task: when the harness
+selects it or the task opts in via ``SweTask.metadata["flat_eval"]``. The verifier
+honors that per-task key by calling ``SweTaskHarness.with_flat_eval()`` — a no-op for
+the built-in families, which already grade host-side. (A previously apptainer-only
+nested grading path for swe-bench / r2e-gym was removed in PR #1694.)
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, compute_resolved
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# SWE-bench eval-log sentinels, kept here so we never import swebench at grade
+# time.
+APPLY_PATCH_FAIL = ">>>>> Patch Apply Failed"
+APPLY_PATCH_PASS = ">>>>> Applied Patch"
+RESET_FAILED = ">>>>> Reset Failed"
+TESTS_ERROR = ">>>>> Tests Errored"
+TESTS_TIMEOUT = ">>>>> Tests Timed Out"
+START_TEST_OUTPUT = ">>>>> Start Test Output"
+END_TEST_OUTPUT = ">>>>> End Test Output"
+
+# Codes that mean the harness/patch/test setup failed before tests could be
+# trusted; their presence forces an empty status map + patch_applied=False.
+_BAD_CODES = (APPLY_PATCH_FAIL, RESET_FAILED, TESTS_ERROR, TESTS_TIMEOUT)
+
+# Per-test status tokens a pytest-style test runner emits at the start of a line
+# ("PASSED tests/test_x.py::test_a"). XFAIL counts as a pass.
+_PASS_TOKENS = ("PASSED", "XFAIL")
+_FAIL_TOKENS = ("FAILED", "ERROR")
+_STATUS_TOKENS = _PASS_TOKENS + _FAIL_TOKENS + ("SKIPPED",)
+
+# Where the flat path writes the eval script and its captured log inside the
+# sandbox.
+EVAL_SCRIPT_PATH = "/root/eval.sh"
+EVAL_LOG_PATH = "/root/eval_output.log"
+
+
+def parse_eval_log(log: str) -> tuple[dict[str, str], bool]:
+    """Parse a SWE-bench eval-script log host-side.
+
+    For the common pytest-style runner:
+
+    1. If any "bad code" (patch-apply / reset / tests-error / timeout) is
+       present, the run is untrustworthy -> return ``({}, False)``.
+    2. If the ``Start``/``End`` test-output markers are missing, the test patch
+       never applied -> return ``({}, False)``.
+    3. Otherwise extract the slice between the markers and parse per-test
+       ``"<STATUS> <node_id>"`` lines into a ``{node_id: STATUS}`` map. As a
+       fallback (output sometimes escapes the markers, e.g. to stderr) the whole
+       log is scanned when the slice yields nothing.
+
+    Args:
+        log: The combined stdout/stderr captured from running the eval script.
+
+    Returns:
+        A tuple ``(status_map, patch_applied)``. ``status_map`` maps each test
+        node id to its status token. ``patch_applied`` is ``True`` only when the
+        markers were found and no bad code fired.
+    """
+    if any(code in log for code in _BAD_CODES):
+        return {}, False
+    if START_TEST_OUTPUT not in log or END_TEST_OUTPUT not in log:
+        return {}, False
+
+    between = log.split(START_TEST_OUTPUT, 1)[1].split(END_TEST_OUTPUT, 1)[0]
+    status_map = _parse_pytest_status_lines(between)
+    if not status_map:
+        # Fallback: some runners emit per-test lines outside the markers.
+        status_map = _parse_pytest_status_lines(log)
+    return status_map, True
+
+
+def _parse_pytest_status_lines(text: str) -> dict[str, str]:
+    """Parse ``"<STATUS> <node_id>"`` pytest-style lines into a status map.
+
+    A status line starts with one of the recognised status tokens, and the node
+    id is the second whitespace field. FAILED lines may read
+    ``"FAILED <id> - <reason>"``; the trailing reason is stripped by rewriting
+    ``" - "`` to ``" "``.
+
+    Args:
+        text: Text containing zero or more per-test status lines.
+
+    Returns:
+        A mapping from each test node id to its status token. When a node id
+        appears more than once, the last occurrence wins.
+    """
+    status_map: dict[str, str] = {}
+    for raw_line in text.split("\n"):
+        line = raw_line.strip()
+        token = next((t for t in _STATUS_TOKENS if line.startswith(t)), None)
+        if token is None:
+            continue
+        if token == "FAILED":
+            line = line.replace(" - ", " ")
+        fields = line.split()
+        if len(fields) <= 1:
+            continue
+        node_id = fields[1]
+        # Last status wins for a duplicated node id: a later line overwrites an
+        # earlier one, so a runner that re-reports a node (e.g. a rerun plugin)
+        # ends up with its final status.
+        status_map[node_id] = fields[0]
+    return status_map
+
+
+def passed_tests(status_map: dict[str, str]) -> list[str]:
+    """Return node ids whose status counts as a pass (PASSED or XFAIL).
+
+    Args:
+        status_map: A mapping from test node id to its status token.
+
+    Returns:
+        The list of node ids whose status is a passing token.
+    """
+    return [node for node, status in status_map.items() if status in _PASS_TOKENS]
+
+
+async def flat_run_eval(env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+    """Run the instance's eval script in the sandbox and capture its log.
+
+    The eval script must be supplied on the task via
+    ``task.metadata["eval_script"]``. It is written into the sandbox and run,
+    teeing its combined output to :data:`EVAL_LOG_PATH`; the captured
+    stdout/stderr already contain the ``>>>>>`` markers, so ``test_output`` is
+    graded directly. The log file is read back as a fallback when the streamed
+    output is empty.
+
+    Args:
+        env: The SWE environment used to write files and execute commands in the
+            sandbox.
+        task: The task whose ``metadata["eval_script"]`` is run.
+
+    Returns:
+        An :class:`EvalArtifacts` holding the captured test output, the script's
+        return code, whether a model patch existed, and raw metadata. When no
+        eval script is present the artifacts carry an ``eval_error``.
+    """
+    eval_script = task.metadata.get("eval_script", "")
+    if not eval_script:
+        # No script to run -> mask as an eval error rather than scoring 0.
+        return EvalArtifacts(
+            test_output="",
+            return_code=1,
+            patch_applied=False,
+            raw={"error_type": "eval_error", "flat": True},
+        )
+
+    await env.write_text(EVAL_SCRIPT_PATH, eval_script if eval_script.endswith("\n") else eval_script + "\n")
+    # The script is self-contained (it resets + applies patches + runs tests);
+    # `|| true` keeps the captured log even on a non-zero test exit so grade()
+    # can parse per-test status. Combined output is also tee'd to a log file.
+    result = await env.execute(
+        f"bash {EVAL_SCRIPT_PATH} 2>&1 | tee {EVAL_LOG_PATH}; exit ${{PIPESTATUS[0]}}",
+        cwd=task.repo_workdir,
+        is_eval=True,
+        timeout_s=task.metadata.get("tests_timeout"),
+    )
+    log_text = result["output"]
+    if not log_text.strip() and result.get("error_type") not in {"sandbox", "timeout"}:
+        # Streamed output was empty; fall back to the tee'd log file.
+        cat = await env.execute(f"cat {EVAL_LOG_PATH}", cwd=task.repo_workdir)
+        if cat["returncode"] == 0:
+            log_text = cat["output"]
+
+    return EvalArtifacts(
+        test_output=log_text,
+        return_code=result["returncode"],
+        patch_applied=bool(task.model_patch),
+        raw={"error_type": result.get("error_type"), "flat": True},
+    )
+
+
+def flat_grade(task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+    """Grade a flat eval-script log host-side.
+
+    Only genuine infra failures (sandbox/timeout) are masked via ``error_kind``.
+    An unbuildable / missing / empty eval spec (``error_type == "eval_error"``) is
+    NOT masked: it falls through to the parser, which finds no markers and grades
+    unmasked ``resolved=False`` (reward 0), matching main's behavior. A log with a
+    bad code or missing markers likewise grades as unresolved with
+    ``patch_applied`` set from the parse, since a failed setup is a legitimate
+    unresolved rather than an infra mask.
+
+    Args:
+        task: The task being graded, supplying the instance id, expected
+            ``fail_to_pass`` / ``pass_to_pass`` tests, and model patch.
+        artifacts: The eval artifacts produced by :func:`flat_run_eval`.
+
+    Returns:
+        A :class:`SweEvalReport` describing whether the task was resolved,
+        whether the patch applied and existed, any masking ``error_kind``, and
+        the per-test status breakdown.
+    """
+    if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            patch_exists=bool(task.model_patch),
+            patch_applied=artifacts.patch_applied,
+            error_kind=artifacts.raw["error_type"],
+        )
+
+    status_map, log_patch_applied = parse_eval_log(artifacts.test_output)
+    passed = passed_tests(status_map)
+    # Thread the full status_map so compute_resolved mirrors swebench's
+    # get_eval_tests_report semantics: a required test counts as a failure only when
+    # absent or FAILED/ERROR, while neutral statuses (SKIPPED/XPASS) are excluded
+    # rather than treated as failures (which a bare passed-set membership check would).
+    resolved = log_patch_applied and compute_resolved(
+        fail_to_pass=task.fail_to_pass,
+        pass_to_pass=task.pass_to_pass,
+        passed=passed,
+        status_map=status_map,
+    )
+    return SweEvalReport(
+        instance_id=task.instance_id,
+        resolved=resolved,
+        patch_applied=log_patch_applied,
+        patch_exists=bool(task.model_patch),
+        tests_status={"passed": passed, "all": status_map},
+    )
+
+
+def flat_eval_enabled(harness_flag: bool, task: SweTask) -> bool:
+    """Return whether flat (host-side) mode should be used for this task.
+
+    Flat mode applies when the harness flag selects it or the task opts in via
+    ``metadata["flat_eval"]``. This is a pure predicate; it neither swaps the
+    harness nor changes provider support.
+
+    Args:
+        harness_flag: Whether the harness itself selects flat grading.
+        task: The task whose ``metadata["flat_eval"]`` is consulted.
+
+    Returns:
+        ``True`` when either source selects flat mode, otherwise ``False``.
+    """
+    return bool(harness_flag) or bool(task.metadata.get("flat_eval", False))
diff --git a/resources_servers/swe_bench/harnesses/nv_internal.py b/resources_servers/swe_bench/harnesses/nv_internal.py
new file mode 100644
index 0000000000..3eb7fac4f0
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/nv_internal.py
@@ -0,0 +1,426 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""nv-internal-1 harness: flat, host-graded NVIDIA-internal family.
+
+This family does not run any in-container grading harness: it ships a per-instance
+``run_script.sh`` + ``parsing_script.py`` that emit a structured ``output.json``
+test report. The recipe is a 3-hop sequence:
+
+    1. ``bash run_script.sh <test_files> > stdout.log 2> stderr.log``  (keep streams separate)
+    2. ``python parsing_script.py stdout.log stderr.log output.json``  (parse to JSON report)
+    3. read ``output.json`` back host-side
+
+Grading is then a pure host-side parse of that report's ``{tests: [{name, status}]}``
+shape. Because the family is flat and host-graded, it runs on any exec-capable
+provider (e.g. docker). The run script, parsing script, and model patch are
+uploaded by ``materialize``.
+"""
+
+from __future__ import annotations
+
+import ast
+import json
+import re
+from typing import TYPE_CHECKING, Any
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+    EvalArtifacts,
+    SweEvalReport,
+    SweTask,
+    SweTaskHarness,
+    _ensure_trailing_newline,
+    compute_resolved,
+)
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+#: nv-internal default working directory.
+NV_DEFAULT_WORKDIR = "/app"
+#: The generic ``build_task`` default workdir; means "the row didn't set one".
+_GENERIC_DEFAULT_WORKDIR = "/testbed"
+
+
+def _nv_workdir(task: SweTask) -> str:
+    """Resolve the working directory for nv-internal hops.
+
+    The generic ``build_task`` defaults ``repo_workdir`` to ``/testbed``, which is
+    not the nv-internal convention. A row that explicitly sets a non-default
+    ``repo_workdir`` is honored; otherwise the nv-internal default ``/app`` is used.
+
+    Args:
+        task: The task whose ``repo_workdir`` is consulted.
+
+    Returns:
+        The working directory path (str) to run every nv-internal hop in.
+    """
+    workdir = task.repo_workdir
+    if not workdir or workdir == _GENERIC_DEFAULT_WORKDIR:
+        return NV_DEFAULT_WORKDIR
+    return workdir
+
+
+def parse_passed_tests(report: dict[str, Any]) -> list[str]:
+    """Extract PASSED test names from a parsing_script ``output.json`` report.
+
+    The report shape is ``{"tests": [{"name": ..., "status": "PASSED"|...}, ...]}``.
+
+    Args:
+        report: The parsed ``output.json`` report mapping.
+
+    Returns:
+        The list of test names (list[str]) whose status is ``"PASSED"``.
+    """
+    return [
+        test["name"]
+        for test in report.get("tests", [])
+        if isinstance(test, dict) and test.get("status") == "PASSED" and "name" in test
+    ]
+
+
+class NVInternalHarness(SweTaskHarness):
+    """Flat, host-graded harness for the NVIDIA-internal task family.
+
+    Tasks ship their own ``run_script.sh`` and ``parsing_script.py`` that produce
+    a structured ``output.json`` report, which is graded entirely host-side. The
+    harness runs on any exec-capable provider.
+    """
+
+    name = "nv-internal-1"
+    grade_strategy = "flat-host-grade"
+
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox spec for an nv-internal task.
+
+        Environment variables parsed from the task's dockerfiles are injected into
+        ``spec.env`` so the provider applies them to every exec hop. This is a
+        no-op when the dataset does not carry the dockerfiles.
+
+        Args:
+            task: The task to build a sandbox spec for.
+
+        Returns:
+            A :class:`SandboxSpec` describing the image, workdir, timeouts,
+            environment, metadata, resources, and provider options.
+        """
+        env = {"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"}
+        env.update(_parse_dockerfile_env(task))
+        return SandboxSpec(
+            image=task.image,
+            workdir=_nv_workdir(task),
+            ttl_s=task.metadata.get("ttl_s", 1800),
+            ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+            env=env,
+            metadata={
+                "instance_id": task.instance_id[:63],
+                "benchmark": task.benchmark,
+                "harness": self.name,
+            },
+            resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+            provider_options=task.metadata.get("provider_options", {}),
+        )
+
+    def supports_provider(self, provider_name: str) -> bool:
+        """Report whether this harness supports the named provider.
+
+        The family is flat and host-graded, so every exec-capable provider is
+        supported.
+
+        Args:
+            provider_name: The provider name being checked.
+
+        Returns:
+            ``True`` for every provider.
+        """
+        return True  # flat, host-graded: works on any exec-capable provider
+
+    async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Upload run_script.sh, parsing_script.py, and the model patch.
+
+        The scripts live in ``task.metadata``. The dataset stores them under
+        dotted keys (``"run_script.sh"`` / ``"parsing_script.py"``), which are read
+        first, falling back to the extensionless keys only if the dotted ones are
+        absent.
+
+        Args:
+            env: The environment used to write files into the sandbox.
+            task: The task carrying the patch and scripts to upload.
+        """
+        if task.model_patch:
+            await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+        run_script = task.metadata.get("run_script.sh") or task.metadata.get("run_script", "")
+        parsing_script = task.metadata.get("parsing_script.py") or task.metadata.get("parsing_script", "")
+        if run_script:
+            await env.write_text("/root/run_script.sh", _ensure_trailing_newline(run_script))
+        if parsing_script:
+            await env.write_text("/root/parsing_script.py", _ensure_trailing_newline(parsing_script))
+
+    async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Reset the checkout to ``base_commit``.
+
+        Runs ``git reset --hard`` followed by ``git checkout`` of the base commit
+        (not ``git clean``) in the nv-internal working directory.
+
+        Args:
+            env: The environment used to execute commands in the sandbox.
+            task: The task carrying the ``base_commit`` to reset to.
+        """
+        if task.base_commit:
+            await env.execute(
+                f"git reset --hard {task.base_commit} && git checkout {task.base_commit}",
+                cwd=_nv_workdir(task),
+            )
+
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Run the 3-hop evaluation recipe and collect its artifacts.
+
+        Applies the model patch, runs the optional per-instance repo setup hook,
+        then executes the run/parse/read sequence. Sandbox or timeout failures in
+        any hop short-circuit and are surfaced via ``raw["error_type"]``.
+
+        Args:
+            env: The environment used to execute commands in the sandbox.
+            task: The task being evaluated.
+
+        Returns:
+            An :class:`EvalArtifacts` holding the report output, return code,
+            whether the patch applied cleanly, and any infra error type.
+        """
+        workdir = _nv_workdir(task)
+        # Apply the model patch with rejection to tolerate conflicts:
+        # `--reject` writes .rej files instead of failing; `|| true` keeps going.
+        patch_applied = True
+        if task.model_patch:
+            applied = await env.execute(
+                "git apply --ignore-space-change --ignore-whitespace --reject -v /root/patch.diff",
+                cwd=workdir,
+            )
+            patch_applied = applied["returncode"] == 0
+
+        # Optional per-instance repo setup hook.
+        repo_cmd = task.metadata.get("before_repo_set_cmd", "").strip()
+        if repo_cmd:
+            repo_cmd = repo_cmd.split("\n")[-1]
+            setup = await env.execute(repo_cmd, cwd=workdir, is_eval=True)
+            if setup.get("error_type") in {"sandbox", "timeout"}:
+                return EvalArtifacts(
+                    test_output=setup["output"],
+                    return_code=setup["returncode"],
+                    patch_applied=patch_applied,
+                    raw={"error_type": setup.get("error_type")},
+                )
+
+        # Hop 1: run the per-instance script, keeping stdout/stderr separate.
+        # The selected test files are passed positionally.
+        test_files = _format_test_files(task.metadata.get("selected_test_files_to_run", []))
+        run = await env.execute(
+            f"bash /root/run_script.sh {test_files} > /root/stdout.log 2> /root/stderr.log || true",
+            cwd=workdir,
+            is_eval=True,
+        )
+        if run.get("error_type") in {"sandbox", "timeout"}:
+            return EvalArtifacts(
+                test_output=run["output"],
+                return_code=run["returncode"],
+                patch_applied=patch_applied,
+                raw={"error_type": run.get("error_type")},
+            )
+
+        # Hop 2: parse the logs into a JSON report.
+        parse = await env.execute(
+            "python /root/parsing_script.py /root/stdout.log /root/stderr.log /root/output.json",
+            cwd=workdir,
+            is_eval=True,
+        )
+        if parse.get("error_type") in {"sandbox", "timeout"}:
+            return EvalArtifacts(
+                test_output=parse["output"],
+                return_code=parse["returncode"],
+                patch_applied=patch_applied,
+                raw={"error_type": parse.get("error_type")},
+            )
+
+        # Hop 3: read the report back host-side.
+        report = await env.execute("cat /root/output.json", cwd=workdir, is_eval=True)
+        return EvalArtifacts(
+            test_output=report["output"],
+            return_code=report["returncode"],
+            patch_applied=patch_applied,
+            raw={"error_type": report.get("error_type")},
+        )
+
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Grade the evaluation artifacts into a report.
+
+        Parses the host-side ``output.json`` report, extracts PASSED tests, and
+        derives resolution from the required FAIL_TO_PASS / PASS_TO_PASS sets. An
+        infra failure (sandbox or timeout) is masked via ``error_kind`` rather than
+        scored as unresolved.
+
+        Args:
+            task: The task being graded.
+            artifacts: The artifacts produced by ``run_eval``.
+
+        Returns:
+            A :class:`SweEvalReport` with resolution status, patch flags, and the
+            parsed test report.
+        """
+        # Infra failure → mask via error_kind (never scored as "unresolved").
+        if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+            return SweEvalReport(
+                instance_id=task.instance_id,
+                patch_exists=bool(task.model_patch),
+                patch_applied=artifacts.patch_applied,
+                error_kind=artifacts.raw["error_type"],
+            )
+        try:
+            report = json.loads(artifacts.test_output) if artifacts.test_output.strip() else {}
+        except (ValueError, TypeError):
+            report = {}
+        passed = parse_passed_tests(report)
+        f2p, p2p = _resolve_required_tests(task)
+        # Resolution is derived from tests alone and never gated on patch-apply rc.
+        # An empty report or no required tests → unresolved (compute_resolved
+        # returns False).
+        resolved = compute_resolved(
+            fail_to_pass=f2p,
+            pass_to_pass=p2p,
+            passed=passed,
+        )
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            resolved=resolved,
+            patch_applied=artifacts.patch_applied,
+            patch_exists=bool(task.model_patch),
+            tests_status={"passed": passed, "report": report},
+        )
+
+
+def _format_test_files(test_files: Any) -> str:
+    """Build the comma-joined test-files argument.
+
+    Accepts a list, or a string that is either a comma-joined value or a
+    ``repr``-style list. A stringified list may use single quotes
+    (``['a', 'b']``) which ``json.loads`` rejects, so ``ast.literal_eval`` is used
+    (handling single-quoted and native lists) with a safe fallback to the raw
+    string.
+
+    Args:
+        test_files: A list/tuple of names, or a string holding a comma-joined
+            value or a stringified list.
+
+    Returns:
+        The comma-joined test-files argument (str); empty for unsupported inputs.
+    """
+    if isinstance(test_files, (list, tuple)):
+        return ",".join(str(item) for item in test_files)
+    if isinstance(test_files, str):
+        stripped = test_files.strip()
+        if stripped.startswith("[") and stripped.endswith("]"):
+            try:
+                parsed = ast.literal_eval(stripped)
+                if isinstance(parsed, (list, tuple)):
+                    return ",".join(str(item) for item in parsed)
+            except (ValueError, SyntaxError):
+                pass
+        return stripped
+    return ""
+
+
+def _resolve_required_tests(task: SweTask) -> tuple[list[str], list[str]]:
+    """Resolve the FAIL_TO_PASS / PASS_TO_PASS required-test sets.
+
+    The ``fail_to_pass_select`` / ``pass_to_pass_select`` keys on ``task.metadata``
+    take precedence when present; otherwise the plain ``task.fail_to_pass`` /
+    ``task.pass_to_pass`` are used. Values may be lists or stringified lists.
+
+    Args:
+        task: The task whose required-test sets are resolved.
+
+    Returns:
+        A ``(fail_to_pass, pass_to_pass)`` tuple of test-name lists.
+    """
+    f2p = task.metadata.get("fail_to_pass_select")
+    f2p = _coerce_test_list(f2p) if f2p is not None else list(task.fail_to_pass)
+    p2p = task.metadata.get("pass_to_pass_select")
+    p2p = _coerce_test_list(p2p) if p2p is not None else list(task.pass_to_pass)
+    return f2p, p2p
+
+
+def _coerce_test_list(value: Any) -> list[str]:
+    """Coerce a test-list value (list or stringified list) into a list of names.
+
+    Args:
+        value: A list/tuple of names, or a string holding a stringified list.
+
+    Returns:
+        The list of test names (list[str]); empty for unsupported inputs.
+    """
+    if isinstance(value, (list, tuple)):
+        return [str(item) for item in value]
+    if isinstance(value, str):
+        stripped = value.strip()
+        if stripped.startswith("[") and stripped.endswith("]"):
+            try:
+                parsed = ast.literal_eval(stripped)
+                if isinstance(parsed, (list, tuple)):
+                    return [str(item) for item in parsed]
+            except (ValueError, SyntaxError):
+                pass
+    return []
+
+
+def _parse_dockerfile_env(task: SweTask) -> dict[str, str]:
+    """Parse ``ENV`` lines from the task's dockerfiles into a name->value mapping.
+
+    Scans ``base_dockerfile + instance_dockerfile`` for ``ENV`` directives and
+    converts them to environment variables. Handles both Docker forms:
+
+        ENV KEY=VALUE   (equals)
+        ENV KEY VALUE   (space-separated)
+
+    Returns ``{}`` when the dockerfiles are absent from metadata.
+
+    Args:
+        task: The task whose dockerfile metadata is scanned.
+
+    Returns:
+        A mapping (dict[str, str]) of environment variable names to values.
+    """
+    base_dockerfile = str(task.metadata.get("base_dockerfile", "") or "")
+    instance_dockerfile = str(task.metadata.get("instance_dockerfile", "") or "")
+    env: dict[str, str] = {}
+    for raw_line in (base_dockerfile + "\n" + instance_dockerfile).split("\n"):
+        line = raw_line.strip()
+        if not line.startswith("ENV "):
+            continue
+        body = line[len("ENV ") :].strip()
+        if "=" in body:
+            # Format: ENV KEY=VALUE -> normalize spaces around the first `=`.
+            key, _, value = body.partition("=")
+            key = re.sub(r"\s+", "", key)
+            value = value.strip()
+        else:
+            # Format: ENV KEY VALUE -> split into key + remainder value.
+            parts = body.split(None, 1)
+            if len(parts) < 2:
+                continue
+            key, value = parts[0], parts[1]
+        if key:
+            env[key] = value
+    return env
diff --git a/resources_servers/swe_bench/harnesses/r2egym.py b/resources_servers/swe_bench/harnesses/r2egym.py
new file mode 100644
index 0000000000..9b4f42f24c
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/r2egym.py
@@ -0,0 +1,174 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""r2e-gym harness — host-side (flat) graded.
+
+Runs the instance's eval script in the sandbox and parses the log host-side via the shared
+flat-eval path, so it runs on any exec-capable provider.
+
+NOTE: the apptainer-only nested ``run_local_evaluation`` path (which produced r2e-gym's own
+``report.json`` in-container) was removed when PR #1694 took ownership of the apptainer
+provider. Re-wiring r2e-gym's nested grading + ``.sif``/mounts onto #1694's provider is tracked
+for a follow-up PR (see APPTAINER_PR3_TRACKER.md); until then r2e-gym grades flat (it needs an
+``eval_script`` in task metadata, else the flat grader masks the sample as an eval error).
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+    EvalArtifacts,
+    SweEvalReport,
+    SweTask,
+    SweTaskHarness,
+    _ensure_trailing_newline,
+    compute_resolved,
+)
+from resources_servers.swe_bench.harnesses import flat_eval
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class R2EGymHarness(SweTaskHarness):
+    """Harness for the r2e-gym family of SWE tasks (host-side / flat graded)."""
+
+    name = "r2e-gym"
+    grade_strategy = "flat-host-grade"
+
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox spec for an r2e-gym task.
+
+        Args:
+            task: The SWE task whose metadata, image, and workdir describe the sandbox.
+
+        Returns:
+            SandboxSpec: The populated sandbox spec (image, workdir, TTL, env, metadata,
+            resources, and any provider options carried on the task).
+        """
+        return SandboxSpec(
+            image=task.image,
+            workdir=task.repo_workdir,
+            ttl_s=task.metadata.get("ttl_s", 1800),
+            ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+            env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+            metadata={
+                "instance_id": task.instance_id[:63],
+                "benchmark": task.benchmark,
+                "harness": self.name,
+            },
+            resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+            provider_options=dict(task.metadata.get("provider_options", {})),
+        )
+
+    async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Write the bare ``/root/patch.diff`` the eval script applies.
+
+        Args:
+            env: The active SWE environment used to write files into the sandbox.
+            task: The SWE task supplying the model patch (newline-normalized).
+        """
+        if task.model_patch:
+            await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+
+    async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Reset the repository checkout (no-op for r2e-gym).
+
+        Args:
+            env: The active SWE environment (unused).
+            task: The SWE task (unused).
+        """
+        return None
+
+    def hide_eval_tests_commands(self) -> list[str]:
+        """Build shell commands that strip the held-out eval tests from the agent's checkout.
+
+        ``/r2e_tests`` holds the evaluation tests the agent must not see; ``run_tests.sh``
+        launches them. ``run_tests.sh`` is deleted only when it references ``r2e_tests``
+        (substring guard). The agent adapter runs these after ``materialize``.
+
+        Returns:
+            list[str]: One shell command per checkout root (``""``, ``/root``, ``/testbed``).
+        """
+        commands: list[str] = []
+        for root_dir in ["", "/root", "/testbed"]:
+            commands.append(
+                f"rm -rf {root_dir}/r2e_tests && "
+                f"if grep -qs r2e_tests {root_dir}/run_tests.sh; then rm -rf {root_dir}/run_tests.sh; fi"
+            )
+        return commands
+
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Run the instance's eval script in-sandbox and grade the log host-side.
+
+        Args:
+            env: The active SWE environment used to execute commands in the sandbox.
+            task: The SWE task whose ``metadata['eval_script']`` is run.
+
+        Returns:
+            EvalArtifacts: The captured test output, return code, patch existence, and flat
+            markers (masked as ``eval_error`` when no eval script is present).
+        """
+        return await flat_eval.flat_run_eval(env, task)
+
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Grade an r2e-gym task from its evaluation artifacts (host-side, flat).
+
+        Unlike the SWE-bench flat grader, this path does NOT gate ``resolved`` on the
+        SWE-bench ``>>>>> Start/End Test Output`` marker pair: r2e-gym's ``run_tests.sh``
+        does not emit those swebench sentinels, so requiring them would mask every r2e-gym
+        sample as unresolved. Per-test status lines are parsed from the whole log and the
+        node-ids are matched directly against the required ``fail_to_pass`` / ``pass_to_pass``
+        sets (R2E-Gym uses pytest node-ids verbatim). Only genuine infra failures
+        (sandbox/timeout) are masked.
+
+        Args:
+            task: The SWE task being graded.
+            artifacts: The evaluation artifacts produced by ``run_eval``.
+
+        Returns:
+            SweEvalReport: The resolved/unresolved verdict with patch state and any error kind.
+        """
+        if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+            return SweEvalReport(
+                instance_id=task.instance_id,
+                patch_exists=bool(task.model_patch),
+                patch_applied=artifacts.patch_applied,
+                error_kind=artifacts.raw["error_type"],
+            )
+        # Parse per-test status lines from the whole log (no swebench-marker gate). An
+        # unbuildable / empty log yields an empty status map -> no required test passes ->
+        # unmasked unresolved, and compute_resolved still returns False for an empty
+        # required set (the edge validated by main).
+        status_map = flat_eval._parse_pytest_status_lines(artifacts.test_output)
+        passed = flat_eval.passed_tests(status_map)
+        # Thread the full status_map so compute_resolved mirrors swebench's
+        # get_eval_tests_report semantics: neutral-status required tests (SKIPPED/XPASS)
+        # are excluded rather than treated as failures.
+        resolved = compute_resolved(
+            fail_to_pass=task.fail_to_pass,
+            pass_to_pass=task.pass_to_pass,
+            passed=passed,
+            status_map=status_map,
+        )
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            resolved=resolved,
+            patch_applied=bool(status_map),
+            patch_exists=bool(task.model_patch),
+            tests_status={"passed": passed, "all": status_map},
+        )
diff --git a/resources_servers/swe_bench/harnesses/swe_bench_ext.py b/resources_servers/swe_bench/harnesses/swe_bench_ext.py
new file mode 100644
index 0000000000..7c925c1264
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swe_bench_ext.py
@@ -0,0 +1,311 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-bench-ext harness: flat, host-graded reference family.
+
+Applies the model patch (and test patch) against the repository checkout, runs
+the framework test command, and grades host-side with the parser
+(:func:`resources_servers.swe_bench.parsing.parse_and_check_tests`).
+
+Grading delegates the full per-framework logic to ``parse_and_check_tests``:
+junit-xml parsing, test-id normalization, the fuzzy matcher, the framework
+dispatch, the ``::build``/``::compile`` synthetic-PASS injection, and
+build-failed-package propagation.
+
+``resolved`` is taken from the parser's verdict (all FAIL_TO_PASS passed AND all
+PASS_TO_PASS passed). It does not depend on ``patch_applied``: the model and test
+patches are applied best-effort and grading is on the tests only.
+``patch_applied`` is still recorded for information.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness
+from resources_servers.swe_bench.parsing import (
+    get_framework_config,
+    get_test_command_with_output,
+    parse_and_check_tests,
+)
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# Default checkout locations probed (in order) when locating the repo, mirroring main's
+# ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null`` ladder
+# in SweBenchExtDatasetProcessor's eval script.
+_REPO_WORKDIR_LADDER = ("/testbed", "/workspace/repo", "/app", "/root/repo")
+
+
+# Output markers the parser (parse_and_check_tests) extracts content between.
+_TEST_OUTPUT_START = "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>"
+_TEST_OUTPUT_END = "<<<SWE_BENCH_EXT_TEST_OUTPUT_END>>>"
+_RESULT_FILE_START = "<<<SWE_BENCH_EXT_RESULT_FILE_START>>>"
+_RESULT_FILE_END = "<<<SWE_BENCH_EXT_RESULT_FILE_END>>>"
+
+
+class SweBenchExtHarness(SweTaskHarness):
+    """Flat, host-graded harness for the swe-bench-ext task family.
+
+    Runs the task's framework test command inside a single sandbox and grades the
+    captured output on the host. Works on any exec-capable sandbox provider.
+    """
+
+    name = "swe-bench-ext"
+    grade_strategy = "flat-host-grade"
+
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox specification for a task.
+
+        Args:
+            task: The SWE task describing the image, working directory, and
+                per-task metadata (timeouts, resources, provider options).
+
+        Returns:
+            SandboxSpec: The sandbox spec used to launch the task's container.
+        """
+        return SandboxSpec(
+            image=task.image,
+            workdir=task.repo_workdir,
+            ttl_s=task.metadata.get("ttl_s", 1800),
+            ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+            env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+            metadata={
+                "instance_id": task.instance_id[:63],
+                "benchmark": task.benchmark,
+                "harness": self.name,
+            },
+            resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+            provider_options=task.metadata.get("provider_options", {}),
+        )
+
+    def supports_provider(self, provider_name: str) -> bool:
+        """Report whether this harness supports a sandbox provider.
+
+        Being flat and host-graded, it works on any exec-capable provider.
+
+        Args:
+            provider_name: The name of the sandbox provider.
+
+        Returns:
+            bool: Always ``True``.
+        """
+        return True
+
+    async def _resolve_repo_workdir(self, env: "AsyncSweEnvironment", task: SweTask) -> str:
+        """Locate the repository checkout, mirroring main's ``cd`` fallback ladder.
+
+        Main's ``SweBenchExtDatasetProcessor`` eval script runs
+        ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null``
+        so a repo that is not at ``/testbed`` is still found. This reproduces that
+        host-side: a row-provided ``repo_workdir`` that differs from the default and holds a
+        ``.git`` checkout wins; otherwise the ladder (``/testbed``, ``/workspace/repo``,
+        ``/app``, ``/root/repo``) is probed for a ``.git`` directory. If nothing matches the
+        task's ``repo_workdir`` is returned unchanged (preserving prior behavior).
+
+        Args:
+            env: The async environment used to probe the sandbox.
+            task: The SWE task whose ``repo_workdir`` is the preferred/default location.
+
+        Returns:
+            str: The resolved repository working directory inside the sandbox.
+        """
+        # Prefer an explicit, non-default row workdir holding a checkout.
+        candidates: list[str] = []
+        if task.repo_workdir and task.repo_workdir != "/testbed":
+            candidates.append(task.repo_workdir)
+        candidates.extend(d for d in _REPO_WORKDIR_LADDER if d not in candidates)
+        for candidate in candidates:
+            probe = await env.execute(f'test -d "{candidate}/.git"', cwd="/")
+            if probe["returncode"] == 0:
+                return candidate
+        return task.repo_workdir
+
+    async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Reset the located checkout to ``base_commit`` for hermetic grading.
+
+        Resolves the repo workdir via the same ladder main uses (so a non-``/testbed``
+        checkout is found), then defers to the base ``git reset --hard`` behavior.
+
+        Args:
+            env: The started environment to reset.
+            task: The task whose ``base_commit`` is restored.
+        """
+        if task.base_commit:
+            workdir = await self._resolve_repo_workdir(env, task)
+            await env.execute(f"git reset --hard {task.base_commit}", cwd=workdir)
+
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Apply patches, run the test command, and capture the evaluation output.
+
+        Applies the model patch (and test patch) best-effort, then runs the
+        framework test command wrapped between output markers so the parser can
+        extract the structured result file or marked stdout.
+
+        Args:
+            env: The async environment used to execute commands in the sandbox.
+            task: The SWE task providing the patches, test command, and framework.
+
+        Returns:
+            EvalArtifacts: The captured test output, return code, whether the
+                model patch applied, and the execution error type if any.
+        """
+        # Resolve the checkout via main's cd ladder so a non-/testbed repo is found.
+        workdir = await self._resolve_repo_workdir(env, task)
+        patch_applied = True
+        # Best-effort apply: a bad apply never fails the run (grading is on the
+        # tests only); we still record whether the model patch applied for info.
+        apply_flags = "--reject --recount --ignore-space-change --ignore-whitespace"
+        if task.model_patch:
+            applied = await env.execute(
+                f"git apply {apply_flags} /root/patch.diff",
+                cwd=workdir,
+            )
+            patch_applied = applied["returncode"] == 0
+        if task.test_patch:
+            await env.execute(
+                f"git apply {apply_flags} /root/test_patch.diff",
+                cwd=workdir,
+            )
+        # Wrap the command's output: add structured-output flags (--junitxml/--json)
+        # via get_test_command_with_output, run it between the markers, and dump the
+        # framework result file so parse_and_check_tests receives junit-xml (preferred)
+        # or the marked stdout.
+        #
+        # The framework is passed through verbatim. An empty framework must NOT be
+        # coerced to "pytest": for a non-pytest instance whose framework is absent, the
+        # parser's auto-detect path is what grades correctly, and the default framework
+        # config adds no flags and no result file. grade() reuses this SAME value via
+        # _resolve_framework so the two stay in lockstep.
+        framework = self._resolve_framework(task)
+        # Use the row's test command verbatim, with NO default runner. Main's
+        # SweBenchExtDatasetProcessor uses ``inst.get("test_command", "")`` (empty when
+        # absent): a command-less row runs no runner and grades unresolved. Injecting a
+        # default ``python -m pytest`` here would diverge from main by fabricating results.
+        base_command = task.test_command
+        test_cmd = get_test_command_with_output(base_command, framework)
+        result_file = (get_framework_config(framework, base_command) or {}).get("result_file")
+        result = await env.execute(self._wrap_eval_command(test_cmd, result_file), cwd=workdir, is_eval=True)
+        return EvalArtifacts(
+            test_output=result["output"],
+            return_code=result["returncode"],
+            patch_applied=patch_applied,
+            raw={"error_type": result.get("error_type")},
+        )
+
+    @staticmethod
+    def _resolve_framework(task: SweTask) -> str:
+        """Return the framework value used by both ``run_eval`` and ``grade``.
+
+        Returns the task's framework verbatim. An empty or unknown value is
+        intentionally passed through unchanged: coercing it to ``"pytest"`` would
+        mis-dispatch the parser for non-pytest instances that ship no framework.
+        Centralizing this guarantees ``run_eval`` (which selects the
+        structured-output flag and result file) and ``grade`` (which parses the
+        output) agree on the framework.
+
+        Args:
+            task: The SWE task whose framework value is returned.
+
+        Returns:
+            str: The task's test framework name (possibly empty).
+        """
+        return task.test_framework
+
+    @staticmethod
+    def _wrap_eval_command(test_cmd: str, result_file: str | None) -> str:
+        """Wrap the eval command in the output markers and a result-file dump.
+
+        The parser prefers the junit/json result file (emitted between the
+        RESULT_FILE markers) and falls back to the marked stdout. The ``mkdir -p``
+        ensures ``/workspace/test-results`` exists first, since some frameworks
+        (e.g. junit/gradle, xctest) write their result file there.
+
+        Args:
+            test_cmd: The test command to run inside the markers.
+            result_file: Path or glob of the framework result file to dump, or
+                ``None`` when the framework produces no result file.
+
+        Returns:
+            str: A shell script that runs the test command and emits the marked
+                output and result-file blocks.
+        """
+        mkdir_block = "mkdir -p /workspace/test-results\n"
+        if result_file and "*" in result_file:
+            result_block = (
+                f'echo "{_RESULT_FILE_START}"\n'
+                f"for f in {result_file}; do\n"
+                f'    if [ -f "$f" ]; then echo "=== FILE: $f ==="; cat "$f"; echo ""; fi\n'
+                f"done 2>/dev/null || true\n"
+                f'echo "{_RESULT_FILE_END}"\n'
+            )
+        elif result_file:
+            result_block = (
+                f'echo "{_RESULT_FILE_START}"\n'
+                f'if [ -f "{result_file}" ]; then cat "{result_file}"; fi\n'
+                f'echo "{_RESULT_FILE_END}"\n'
+            )
+        else:
+            result_block = ""
+        return f'{mkdir_block}echo "{_TEST_OUTPUT_START}"\n{test_cmd}\n{result_block}echo "{_TEST_OUTPUT_END}"\n'
+
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Grade captured evaluation artifacts into a report.
+
+        Infrastructure failures are masked via ``error_kind`` and never scored as
+        unresolved. Otherwise the test output is handed to ``parse_and_check_tests``
+        and ``resolved`` is taken from the parser's verdict.
+
+        Args:
+            task: The SWE task providing the expected test sets and framework.
+            artifacts: The captured test output, return code, and error type.
+
+        Returns:
+            SweEvalReport: The grading report, including ``resolved``,
+                ``patch_applied``, ``patch_exists``, and the parsed test status (or
+                ``error_kind`` on infrastructure failure).
+        """
+        # Infra failure: mask via error_kind (never scored as "unresolved").
+        if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+            return SweEvalReport(
+                instance_id=task.instance_id,
+                patch_exists=bool(task.model_patch),
+                patch_applied=artifacts.patch_applied,
+                error_kind=artifacts.raw["error_type"],
+            )
+        # Delegate to the parser, passing the framework verbatim via the SAME
+        # _resolve_framework value run_eval used. An empty/unknown framework falls
+        # through to the parser's auto-detect path; coercing it to "pytest" here would
+        # mis-grade non-pytest instances.
+        test_framework = self._resolve_framework(task)
+        result = parse_and_check_tests(
+            test_output=artifacts.test_output,
+            test_framework=test_framework,
+            fail_to_pass=task.fail_to_pass,
+            pass_to_pass=task.pass_to_pass,
+            instance_id=task.instance_id,
+        )
+        # resolved is the parser's verdict (all F2P passed AND all P2P passed); it
+        # does NOT gate on patch_applied (grading is on tests only).
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            resolved=bool(result["resolved"]),
+            patch_applied=artifacts.patch_applied,
+            patch_exists=bool(task.model_patch),
+            tests_status=result,
+        )
diff --git a/resources_servers/swe_bench/harnesses/swe_rebench.py b/resources_servers/swe_bench/harnesses/swe_rebench.py
new file mode 100644
index 0000000000..68b863182b
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swe_rebench.py
@@ -0,0 +1,375 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-rebench harness: a flat, host-graded family with a vendored log parser.
+
+This is a flat host-graded family: reset to base, apply the model patch and test
+patch, run the install/test commands, then parse the test log host-side.
+
+Two things distinguish swe-rebench:
+
+* **JAVA env** — SWE-rebench tasks need
+  ``_JAVA_OPTIONS=-Djava.net.preferIPv6Addresses=false``, surfaced via
+  ``build_spec.env`` so it is set for the whole sandbox session.
+* **Dynamic log parser** — swe-rebench has no single uniform pytest summary; the
+  correct per-test PASSED/FAILED status comes from a repo-specific parser keyed
+  by ``log_parser`` and shipped in the cloned ``SWE-rebench-V2`` repo
+  (``lib/agent/log_parsers.py`` or ``agent/log_parsers.py``). It is imported
+  dynamically, guarded by try/except.
+
+The cloned ``SWE-rebench-V2`` directory must be provisioned out-of-band. When it
+is absent or the named parser cannot be resolved, ``grade`` masks the sample via
+``error_kind`` rather than scoring a misleading ``unresolved``.
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import json
+import re
+import sys
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Callable
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# JAVA flag required for every SWE-rebench task.
+_JAVA_OPTIONS = "-Djava.net.preferIPv6Addresses=false"
+
+# Patch-apply flags shared by the model and test patch; non-fatal
+# ``git apply --reject`` style so a failed apply still runs the tests.
+_APPLY_FLAGS = "--reject --recount --ignore-space-change --whitespace=nowarn"
+
+# Timing/duration suffixes some test runners append to node names; stripped so
+# the parser output lines up with the (already-normalized) expected node ids.
+_REBENCH_TIMING_NORMALIZE_RES = [
+    re.compile(r"\s*\[\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\]\s*$", re.IGNORECASE),
+    re.compile(r"\s+in\s+\d+(?:\.\d+)?\s+(?:msec|sec)\b", re.IGNORECASE),
+    re.compile(r"\s*\(\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\)\s*$", re.IGNORECASE),
+]
+
+
+def _normalize_test_name(name: str) -> str:
+    """Strip trailing timing annotations from a test node name.
+
+    Args:
+        name (str): The raw test node name, possibly carrying a trailing timing
+            or duration annotation.
+
+    Returns:
+        str: The node name with any timing suffix removed and surrounding
+            whitespace stripped.
+    """
+    for pattern in _REBENCH_TIMING_NORMALIZE_RES:
+        name = pattern.sub("", name)
+    return name.strip()
+
+
+def _load_rebench_log_parsers(rebench_repo_dir: Path):
+    """Dynamically import the cloned SWE-rebench-V2 ``log_parsers`` module.
+
+    Prefers ``lib/agent/log_parsers.py`` and falls back to
+    ``agent/log_parsers.py``, temporarily prepending the repo (and its ``lib``
+    directory) to ``sys.path`` so the module's intra-repo imports resolve.
+
+    Args:
+        rebench_repo_dir (Path): Path to the cloned SWE-rebench-V2 repository.
+
+    Returns:
+        ModuleType: The imported ``log_parsers`` module.
+
+    Raises:
+        FileNotFoundError: If the cloned directory has not been provisioned and
+            no ``log_parsers.py`` can be located.
+    """
+    lp_path = rebench_repo_dir / "lib" / "agent" / "log_parsers.py"
+    if not lp_path.exists():
+        lp_path = rebench_repo_dir / "agent" / "log_parsers.py"
+    if not lp_path.exists():
+        raise FileNotFoundError(
+            f"SWE-rebench-V2 log_parsers not found under {rebench_repo_dir}; "
+            "provision the clone via setup_scripts/swe_rebench.sh"
+        )
+
+    extra_paths = [str(rebench_repo_dir), str(rebench_repo_dir / "lib")]
+    added: list[str] = []
+    for p in extra_paths:
+        if p not in sys.path:
+            sys.path.insert(0, p)
+            added.append(p)
+    try:
+        spec = importlib.util.spec_from_file_location("_rebench_log_parsers", str(lp_path))
+        mod = importlib.util.module_from_spec(spec)
+        spec.loader.exec_module(mod)
+        return mod
+    finally:
+        for p in added:
+            try:
+                sys.path.remove(p)
+            except ValueError:
+                pass
+
+
+def _resolve_parser(log_parsers, log_parser_name: str) -> Callable[[str], dict[str, str]] | None:
+    """Resolve a parser callable from the loaded module.
+
+    Looks up the name in the module's ``NAME_TO_PARSER`` mapping first, then
+    falls back to a module-level attribute of the same name.
+
+    Args:
+        log_parsers: The imported ``log_parsers`` module.
+        log_parser_name (str): The name of the parser to resolve.
+
+    Returns:
+        Callable[[str], dict[str, str]] | None: The resolved parser callable, or
+            ``None`` if no parser matches the name.
+    """
+    name_to_parser = getattr(log_parsers, "NAME_TO_PARSER", {}) or {}
+    return name_to_parser.get(log_parser_name) or getattr(log_parsers, log_parser_name, None)
+
+
+def _as_list(value: Any) -> list[str]:
+    """Coerce a test-command/install/list field to a list of strings.
+
+    Accepts the value as a JSON-encoded string, a bare string, or a list. A
+    JSON-encoded string is parsed and coerced recursively; a bare string that
+    fails to parse is wrapped in a single-element list.
+
+    Args:
+        value (Any): The field value to coerce. May be ``None``, a string, a
+            list, a tuple, or any other type.
+
+    Returns:
+        list[str]: The value normalized to a list of strings. An empty list is
+            returned for ``None`` or an empty string.
+    """
+    if value is None:
+        return []
+    if isinstance(value, str):
+        text = value.strip()
+        if not text:
+            return []
+        if text[0] in "[{":
+            try:
+                parsed = json.loads(text)
+            except (ValueError, TypeError):
+                return [value]
+            return _as_list(parsed)
+        return [value]
+    if isinstance(value, (list, tuple)):
+        return [str(v) for v in value]
+    return [str(value)]
+
+
+class SweRebenchHarness(SweTaskHarness):
+    """Flat, host-graded harness for the swe-rebench benchmark family.
+
+    Applies the model and test patches, runs the install/test commands, then
+    parses the test log host-side using a repo-specific parser loaded
+    dynamically from the cloned SWE-rebench-V2 repository.
+    """
+
+    name = "swe-rebench"
+    grade_strategy = "flat-host-grade"
+
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox spec for a swe-rebench task.
+
+        Sets the git and ``_JAVA_OPTIONS`` environment variables, merges any
+        task-provided env, and forwards TTL, readiness timeout, resources, and
+        provider options from the task metadata.
+
+        Args:
+            task (SweTask): The task to build a sandbox specification for.
+
+        Returns:
+            SandboxSpec: The sandbox specification for running the task.
+        """
+        # _JAVA_OPTIONS forces IPv4 for SWE-rebench tasks.
+        env = {
+            "GIT_CONFIG_GLOBAL": "/dev/null",
+            "GIT_PAGER": "cat",
+            "_JAVA_OPTIONS": _JAVA_OPTIONS,
+        }
+        env.update(task.metadata.get("env", {}))
+        return SandboxSpec(
+            image=task.image,
+            workdir=task.repo_workdir,
+            ttl_s=task.metadata.get("ttl_s", 1800),
+            ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+            env=env,
+            metadata={
+                "instance_id": task.instance_id[:63],
+                "benchmark": task.benchmark,
+                "harness": self.name,
+            },
+            resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+            provider_options=task.metadata.get("provider_options", {}),
+        )
+
+    def supports_provider(self, provider_name: str) -> bool:
+        """Report whether the harness supports a given sandbox provider.
+
+        Being flat and host-graded, it works on any exec-capable provider.
+
+        Args:
+            provider_name (str): The name of the sandbox provider.
+
+        Returns:
+            bool: Always ``True``.
+        """
+        return True  # flat, host-graded: works on any exec-capable provider
+
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Apply patches, run install and test commands, and collect artifacts.
+
+        Applies the model patch then the test patch (both best-effort), runs the
+        non-fatal install commands, then runs the test block with the eval
+        timeout. Records whether the model patch applied for informational
+        purposes only; grading does not gate on it.
+
+        Args:
+            env (AsyncSweEnvironment): The environment used to execute commands
+                inside the sandbox.
+            task (SweTask): The task being evaluated.
+
+        Returns:
+            EvalArtifacts: The captured test output, return code, model-patch
+                application status, and raw error metadata.
+        """
+        workdir = task.repo_workdir
+        install_config = task.metadata.get("install_config", {}) or {}
+        install_cmds = _as_list(install_config.get("install"))
+        test_cmds = _as_list(install_config.get("test_cmd")) or ([task.test_command] if task.test_command else [])
+
+        # Apply the model patch first, then the test patch. Both are best-effort:
+        # a failed apply still runs the tests; model-patch application is recorded
+        # for info only (grading does not gate on it).
+        patch_applied = True
+        if task.model_patch:
+            applied = await env.execute(
+                f"git apply {_APPLY_FLAGS} /root/patch.diff",
+                cwd=workdir,
+            )
+            patch_applied = applied["returncode"] == 0
+        if task.test_patch:
+            await env.execute(f"git apply {_APPLY_FLAGS} /root/test_patch.diff", cwd=workdir)
+
+        # Install commands are non-fatal; failures there should not abort the
+        # test run.
+        for cmd in install_cmds:
+            await env.execute(cmd, cwd=workdir)
+
+        test_block = "\n".join(test_cmds) if test_cmds else "python -m pytest -rA -q"
+        # Thread the eval timeout into the test exec, defaulting to 1800s so a
+        # stuck swe-rebench run is bounded. A row that explicitly carries a
+        # ``tests_timeout`` overrides the default.
+        result = await env.execute(
+            test_block,
+            cwd=workdir,
+            is_eval=True,
+            timeout_s=task.metadata.get("tests_timeout", 1800),
+        )
+        return EvalArtifacts(
+            test_output=result["output"],
+            return_code=result["returncode"],
+            patch_applied=patch_applied,
+            raw={"error_type": result.get("error_type")},
+        )
+
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Grade a swe-rebench task from its evaluation artifacts.
+
+        Masks infra failures (sandbox/timeout) and grading errors (missing clone,
+        unknown parser, parser crash) via ``error_kind`` rather than scoring them.
+        Otherwise parses the test output with the resolved repo-specific parser
+        and marks the task resolved when every FAIL_TO_PASS and PASS_TO_PASS test
+        is in the passed set.
+
+        Args:
+            task (SweTask): The task being graded.
+            artifacts (EvalArtifacts): The artifacts captured during evaluation.
+
+        Returns:
+            SweEvalReport: The grading report, with ``resolved`` set on success
+                or ``error_kind`` set when the sample is masked.
+        """
+        # Infra failure -> mask via error_kind (never scored as "unresolved").
+        if artifacts.raw.get("error_type") in {"sandbox", "timeout"}:
+            return SweEvalReport(
+                instance_id=task.instance_id,
+                patch_exists=bool(task.model_patch),
+                patch_applied=artifacts.patch_applied,
+                error_kind=artifacts.raw["error_type"],
+            )
+
+        install_config = task.metadata.get("install_config", {}) or {}
+        log_parser_name = install_config.get("log_parser", "")
+        # The cloned SWE-rebench-V2 dir is provisioned out-of-band; its absence,
+        # an unknown parser name, or a parser crash all mask the sample via
+        # ``error_kind`` rather than mis-scoring it.
+        rebench_repo_dir = task.metadata.get("rebench_repo_dir")
+        if not rebench_repo_dir:
+            return self._masked(task, artifacts, "eval_error")
+        try:
+            log_parsers = _load_rebench_log_parsers(Path(rebench_repo_dir))
+            parser = _resolve_parser(log_parsers, log_parser_name)
+            if parser is None:
+                return self._masked(task, artifacts, "eval_error")
+            results = parser(artifacts.test_output)
+        except Exception:
+            return self._masked(task, artifacts, "eval_error")
+
+        results = {_normalize_test_name(k): v for k, v in (results or {}).items()}
+        passed_set = {k for k, v in results.items() if v == "PASSED"}
+        fail_to_pass_set = {_normalize_test_name(n) for n in task.fail_to_pass}
+        pass_to_pass_set = {_normalize_test_name(n) for n in task.pass_to_pass}
+
+        # Resolution rule: every FAIL_TO_PASS and PASS_TO_PASS test must be in the
+        # passed set. Resolution is not gated on patch application, and the
+        # F2P/P2P sets are not required to be non-empty (an empty set is a subset
+        # of any set).
+        resolved = (fail_to_pass_set <= passed_set) and (pass_to_pass_set <= passed_set)
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            resolved=resolved,
+            patch_applied=artifacts.patch_applied,
+            patch_exists=bool(task.model_patch),
+            tests_status={"passed": sorted(passed_set), "all": results},
+        )
+
+    @staticmethod
+    def _masked(task: SweTask, artifacts: EvalArtifacts, kind: str) -> SweEvalReport:
+        """Build a masked report that records a grading error instead of a score.
+
+        Args:
+            task (SweTask): The task being graded.
+            artifacts (EvalArtifacts): The artifacts captured during evaluation.
+            kind (str): The error kind to record on the report.
+
+        Returns:
+            SweEvalReport: A report with ``error_kind`` set and no resolution.
+        """
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            patch_exists=bool(task.model_patch),
+            patch_applied=artifacts.patch_applied,
+            error_kind=kind,
+        )
diff --git a/resources_servers/swe_bench/harnesses/swebench.py b/resources_servers/swe_bench/harnesses/swebench.py
new file mode 100644
index 0000000000..563c6ae614
--- /dev/null
+++ b/resources_servers/swe_bench/harnesses/swebench.py
@@ -0,0 +1,274 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""swe-bench / swe-bench-multilingual harness — host-side (flat) grading.
+
+A single parametrized class serves both families. It runs the instance's official SWE-bench
+eval script (``swebench.make_test_spec(...).eval_script``) inside the sandbox and grades the
+produced log host-side with swebench's per-repo log parser, so it runs on any exec-capable
+provider (docker / opensandbox).
+
+NOTE: the apptainer-only nested ``run_local_evaluation`` path was removed when PR #1694 took
+ownership of the apptainer provider. The swe_env-specific nested-apptainer grading (mounts/.sif
+wiring + run_local_evaluation) is tracked for a follow-up PR (see APPTAINER_PR3_TRACKER.md).
+"""
+
+from __future__ import annotations
+
+import dataclasses
+import os
+import tempfile
+from typing import TYPE_CHECKING
+
+from nemo_gym.sandbox import SandboxResources, SandboxSpec
+from resources_servers.swe_bench.harness import (
+    EvalArtifacts,
+    GraderDependencyError,
+    SweEvalReport,
+    SweTask,
+    SweTaskHarness,
+    _ensure_trailing_newline,
+    compute_resolved,
+)
+from resources_servers.swe_bench.harnesses import flat_eval
+
+
+if TYPE_CHECKING:
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+# Per-test status tokens swebench's repo parsers emit that count as a pass.
+_SWEBENCH_PASS_STATUSES = frozenset({"PASSED", "XFAIL"})
+
+# swe-bench families this harness serves.
+_VALID_NAMES = frozenset({"swe-bench", "swe-bench-multilingual"})
+
+
+class SweBenchHarness(SweTaskHarness):
+    """SWE-bench (and multilingual) harness, host-side (flat) graded.
+
+    Runs the instance's official eval script in the sandbox and parses the log host-side with
+    swebench's per-repo parser. Construct one instance per family
+    (``SweBenchHarness("swe-bench")`` / ``SweBenchHarness("swe-bench-multilingual")``).
+    """
+
+    grade_strategy = "flat-host-grade"
+
+    def __init__(self, name: str = "swe-bench") -> None:
+        """Initialize the harness for a given swe-bench family.
+
+        Args:
+            name: The swe-bench family to serve (``"swe-bench"`` or ``"swe-bench-multilingual"``).
+
+        Raises:
+            ValueError: If ``name`` is not a known swe-bench family.
+        """
+        if name not in _VALID_NAMES:
+            raise ValueError(f"Unknown swe-bench family: {name!r} (expected one of {sorted(_VALID_NAMES)})")
+        self.name = name
+
+    # --- provisioning --------------------------------------------------------
+
+    def build_spec(self, task: SweTask) -> SandboxSpec:
+        """Build the sandbox spec for a task.
+
+        Args:
+            task: The task to provision a sandbox for.
+
+        Returns:
+            A ``SandboxSpec`` describing the image, workdir, environment, and any provider
+            options carried on the task. Flat grading runs the eval script directly in the
+            instance image, so no host harness/venv mounts are needed.
+        """
+        return SandboxSpec(
+            image=task.image,
+            workdir=task.repo_workdir,
+            ttl_s=task.metadata.get("ttl_s", 1800),
+            ready_timeout_s=task.metadata.get("ready_timeout_s", 600),
+            env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"},
+            metadata={
+                "instance_id": task.instance_id[:63],
+                "benchmark": task.benchmark,
+                "harness": self.name,
+            },
+            resources=SandboxResources.from_mapping(task.metadata.get("resources", {})),
+            provider_options=dict(task.metadata.get("provider_options", {})),
+        )
+
+    async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None:
+        """Write the bare ``/root/patch.diff`` the eval script applies.
+
+        Args:
+            env: The environment used to write files into the sandbox.
+            task: The task whose model patch is staged for the eval script (newline-normalized
+                so the upstream ``git apply`` succeeds).
+        """
+        if task.model_patch:
+            await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch))
+
+    def _flat_eval_script(self, task: SweTask) -> str:
+        """Build the official SWE-bench eval script for host-side (flat) grading.
+
+        Uses the ``swebench`` library's ``make_test_spec(...).eval_script`` (the per-repo recipe),
+        prefixed with a step that applies the model patch from ``/root/patch.diff``. Returns an
+        empty string if the instance dict is unavailable or the spec cannot be built, in which
+        case the flat grader masks the sample as an eval error rather than scoring 0.
+
+        Args:
+            task: The task whose ``metadata['instance_dict']`` describes the SWE-bench instance.
+
+        Returns:
+            The eval-script text, or ``""`` when it cannot be constructed.
+        """
+        instance = task.metadata.get("instance_dict")
+        if not instance:
+            return ""
+        try:
+            from swebench.harness.test_spec.test_spec import make_test_spec
+
+            spec = make_test_spec(instance, namespace="swebench")
+        except Exception:
+            return ""
+        # Mirror main's GIT_APPLY ladder (swebench/harness/run_evaluation.py GIT_APPLY_CMDS):
+        # try each apply command in order, breaking on the first rc==0, and never write
+        # conflict markers into the tree (no --3way). The trailing `echo` only fires when
+        # every command failed.
+        apply_model = (
+            "cd /testbed && "
+            "(git apply --verbose /root/patch.diff || "
+            "git apply --verbose --reject /root/patch.diff || "
+            "patch --batch --fuzz=5 -p1 -i /root/patch.diff || "
+            "echo 'NEMO_GYM_PATCH_APPLY_FAILED')\n"
+        )
+        return apply_model + spec.eval_script
+
+    # --- server-private grading ----------------------------------------------
+
+    async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts:
+        """Run the instance's eval script in-sandbox and collect its log.
+
+        Args:
+            env: The environment used to execute commands in the sandbox.
+            task: The task to evaluate.
+
+        Returns:
+            An ``EvalArtifacts`` carrying the captured test output, return code, whether a patch
+            existed, and the flat-eval markers.
+        """
+        if not task.metadata.get("eval_script"):
+            task = dataclasses.replace(task, metadata={**task.metadata, "eval_script": self._flat_eval_script(task)})
+        return await flat_eval.flat_run_eval(env, task)
+
+    def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport:
+        """Grade a task from its evaluation artifacts (host-side, flat).
+
+        The SWE-bench family spans repos with different test runners (pytest, django's unittest
+        runner, etc.). The generic flat parser is pytest-only and silently scores non-pytest
+        repos (e.g. django) unresolved — even the gold patch. Grade with swebench's official
+        per-repo log parser; if ``swebench`` cannot be imported for a real SWE-bench instance
+        this raises ``GraderDependencyError`` (fail loud) rather than silently mis-scoring. The
+        generic parser is used only for the legitimate cases where there is no instance dict or
+        the eval spec cannot be built (matching main's behavior for unbuildable instances).
+
+        Args:
+            task: The task being graded.
+            artifacts: The evaluation artifacts produced by ``run_eval``.
+
+        Returns:
+            A ``SweEvalReport`` recording resolution, patch state, and any error kind.
+
+        Raises:
+            GraderDependencyError: If ``swebench`` is unavailable for a real SWE-bench instance.
+        """
+        report = self._swebench_flat_grade(task, artifacts)
+        return report if report is not None else flat_eval.flat_grade(task, artifacts)
+
+    def _swebench_flat_grade(self, task: SweTask, artifacts: EvalArtifacts) -> "SweEvalReport | None":
+        """Grade a flat eval log with swebench's official per-repo log parser.
+
+        The generic :func:`flat_eval.flat_grade` parser only recognises pytest-style
+        ``PASSED <node_id>`` lines, so repos with other test runners (e.g. django's unittest
+        runner) parse as zero passing tests and grade unresolved — even for the gold patch.
+        This path uses ``swebench.harness.grading.get_logs_eval`` (the same per-repo parser the
+        nested harness uses), keeping docker flat grading faithful to the official result.
+
+        Args:
+            task: The task being graded (supplies the instance dict + fail/pass test ids).
+            artifacts: The artifacts produced by :func:`flat_eval.flat_run_eval`.
+
+        Returns:
+            A ``SweEvalReport`` with the official verdict, or ``None`` when there is no instance
+            dict or the eval spec cannot be built (caller falls back to the generic parser).
+
+        Raises:
+            GraderDependencyError: If ``swebench`` cannot be imported for a real SWE-bench
+                instance (fail loud rather than silently degrading to the generic parser).
+        """
+        # Mirror flat_grade's infra masks so a genuine sandbox/timeout never scores 0. An
+        # unbuildable/empty eval spec is NOT masked here (it grades unmasked unresolved via
+        # the generic parser fallback below), matching main's behavior.
+        error_type = artifacts.raw.get("error_type")
+        if error_type in {"sandbox", "timeout"}:
+            return SweEvalReport(
+                instance_id=task.instance_id,
+                patch_exists=bool(task.model_patch),
+                patch_applied=artifacts.patch_applied,
+                error_kind=error_type,
+            )
+        instance = task.metadata.get("instance_dict")
+        if not instance:
+            return None
+        try:
+            from swebench.harness.constants import FAIL_ONLY_REPOS
+            from swebench.harness.grading import get_logs_eval
+            from swebench.harness.test_spec.test_spec import make_test_spec
+        except Exception as exc:
+            # Fail loud instead of degrading to the generic pytest-only parser, which mis-scores
+            # non-pytest repos (e.g. django) as unresolved even for a correct patch. swebench is a
+            # pinned hard dependency (requirements.txt: swebench==4.1.0); a missing/broken install
+            # is a misconfiguration that must surface, not silently skew the SWE-bench resolve rate.
+            raise GraderDependencyError(
+                "swebench is required to grade SWE-bench instances faithfully (per-repo log "
+                "parsers) but could not be imported; install the pinned 'swebench==4.1.0'."
+            ) from exc
+        log_fp = None
+        try:
+            spec = make_test_spec(instance, namespace="swebench")
+            with tempfile.NamedTemporaryFile("w", suffix=".log", delete=False) as handle:
+                handle.write(artifacts.test_output or "")
+                log_fp = handle.name
+            status_map, markers_found = get_logs_eval(spec, log_fp)
+        except Exception:
+            return None
+        finally:
+            if log_fp is not None and os.path.exists(log_fp):
+                os.unlink(log_fp)
+        passed = [node for node, status in status_map.items() if status in _SWEBENCH_PASS_STATUSES]
+        # Select the eval type per-repo exactly as swebench.harness.grading.get_eval_report:
+        # FAIL_ONLY_REPOS (the JS multilingual repos) use the fail-only resolution rule.
+        eval_type = "fail_only" if spec.repo in FAIL_ONLY_REPOS else "pass_and_fail"
+        resolved = bool(markers_found) and compute_resolved(
+            fail_to_pass=task.fail_to_pass,
+            pass_to_pass=task.pass_to_pass,
+            passed=passed,
+            eval_type=eval_type,
+            status_map=status_map,
+        )
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            resolved=resolved,
+            patch_applied=bool(markers_found),
+            patch_exists=bool(task.model_patch),
+            tests_status={"passed": passed, "all": status_map},
+        )
diff --git a/resources_servers/swe_bench/parsing/__init__.py b/resources_servers/swe_bench/parsing/__init__.py
new file mode 100644
index 0000000000..a9de18198d
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/__init__.py
@@ -0,0 +1,52 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""SWE-Bench-Ext test-output parser.
+
+Provides the per-framework parsers, framework output config, and the
+resolution helper used by SWE harnesses for host-side grading. This
+``__init__`` re-exports the public symbols so callers can import them from a
+single location, e.g.::
+
+    from resources_servers.swe_bench.parsing import (
+        parse_and_check_tests,
+        get_framework_config,
+        get_test_command_with_output,
+    )
+"""
+
+from resources_servers.swe_bench.parsing.frameworks import (
+    FRAMEWORK_CONFIGS,
+    get_framework_config,
+    get_test_command_with_output,
+)
+from resources_servers.swe_bench.parsing.parsing import (
+    normalize_test_id,
+    parse_test_output,
+)
+from resources_servers.swe_bench.parsing.utils import parse_and_check_tests
+
+
+__all__ = [
+    # High-level grading entry point (F2P/P2P resolution).
+    "parse_and_check_tests",
+    # Framework output config + command augmentation.
+    "FRAMEWORK_CONFIGS",
+    "get_framework_config",
+    "get_test_command_with_output",
+    # Framework dispatcher + test-id normalization.
+    "parse_test_output",
+    "normalize_test_id",
+]
diff --git a/resources_servers/swe_bench/parsing/frameworks.py b/resources_servers/swe_bench/parsing/frameworks.py
new file mode 100644
index 0000000000..7de570c491
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/frameworks.py
@@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Test framework output configuration mapping."""
+
+from typing import Dict
+
+
+FRAMEWORK_CONFIGS: Dict[str, Dict] = {
+    "pytest": {
+        "output_flag": "--junitxml=/workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "unittest": {
+        "output_flag": "--junitxml=/workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "go": {
+        "output_flag": "-json",
+        "result_file": None,
+    },
+    "jest": {
+        "output_flag": "--json --outputFile=/workspace/test-results/output.json",
+        "result_file": "/workspace/test-results/output.json",
+    },
+    "vitest": {
+        "output_flag": "--reporter=json --outputFile=/workspace/test-results/output.json",
+        "result_file": "/workspace/test-results/output.json",
+    },
+    "mocha": {
+        "output_flag": "--reporter json --reporter-options output=/workspace/test-results/output.json",
+        "result_file": "/workspace/test-results/output.json",
+    },
+    "bun": {
+        "output_flag": None,  # Bun doesn't have structured JSON output flag by default
+        "result_file": None,  # Parse from stdout
+    },
+    "junit": {
+        "output_flag": None,
+        "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml",
+    },
+    "maven": {
+        "output_flag": None,
+        "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml",
+    },
+    "gtest": {
+        "output_flag": "--gtest_output=json:/workspace/test-results/output.json",
+        "result_file": "/workspace/test-results/output.json",
+    },
+    "cargo-nextest": {
+        "output_flag": None,  # Profile is already in test_command
+        "result_file": None,  # JUnit XML is output to repo/junit.xml by profile config
+    },
+    "ctest": {
+        "output_flag": "--output-on-failure --output-junit /workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "xctest": {
+        # For SwiftPM with XCTest framework
+        "output_flag": "--parallel --num-workers=1 --xunit-output /workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "testing": {
+        # For SwiftPM with new Swift Testing framework (Swift 6+)
+        "output_flag": "--disable-xctest --parallel --xunit-output /workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "cppunit": {
+        "output_flag": None,
+        "result_file": None,
+    },
+    # Lua test frameworks - Tier 1 (Standard XML output)
+    "busted": {
+        "output_flag": "--output=junit",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    "luaunit": {
+        "output_flag": "-o junit -n /workspace/test-results/output.xml",
+        "result_file": "/workspace/test-results/output.xml",
+    },
+    # Lua test frameworks - Tier 2 (Custom parsers)
+    "telescope": {
+        "output_flag": None,
+        "result_file": None,
+    },
+    "lust": {
+        "output_flag": None,
+        "result_file": None,
+    },
+    "minitest": {
+        "output_flag": None,
+        "result_file": None,
+    },
+    "bespoke_libgeos": {
+        "output_flag": None,
+        "result_file": None,
+    },
+    # TAP (Test Anything Protocol) - used by tape, node-tap
+    "tap": {
+        "output_flag": None,  # TAP outputs to stdout
+        "result_file": None,  # Parse from stdout
+    },
+    "tape": {
+        "output_flag": None,  # tape outputs TAP to stdout
+        "result_file": None,  # Parse from stdout
+    },
+    # Hardhat (Solidity) - uses Mocha under the hood
+    "hardhat": {
+        "output_flag": None,  # Uses Mocha console reporter by default
+        "result_file": None,  # Parse from stdout
+    },
+}
+
+
+def get_framework_config(framework: str, test_command: str = "") -> Dict:
+    """Get configuration for a test framework.
+
+    Args:
+        framework: Test framework name
+        test_command: The test command (optional, used to detect Gradle vs Maven)
+    """
+    config = FRAMEWORK_CONFIGS.get(
+        framework,
+        {
+            "output_flag": None,
+            "result_file": None,
+        },
+    )
+
+    # Special handling for JUnit: detect Gradle vs Maven from command
+    if framework == "junit" and test_command:
+        if "gradlew" in test_command or "gradle " in test_command:
+            # Gradle uses different output location than Maven
+            # Use */TEST-*.xml to match both standard Gradle (test/) and Android (testDebugUnitTest/)
+            config = {
+                "output_flag": None,
+                "result_file": "find:/workspace/repo:*/build/test-results*:TEST-*.xml",
+            }
+
+    # Special handling for xctest: detect Swift Testing vs XCTest from command
+    # When --disable-xctest is used, the task is using Swift Testing, not XCTest
+    # Use the 'testing' framework config to avoid adding XCTest-only flags like --num-workers
+    if framework == "xctest" and test_command:
+        if "--disable-xctest" in test_command:
+            config = FRAMEWORK_CONFIGS.get("testing", config)
+
+    return config
+
+
+def get_test_command_with_output(base_command: str, framework: str) -> str:
+    """
+    Add structured output flags to test command.
+
+    Returns: command_with_output_flags
+    """
+    config = get_framework_config(framework, base_command)
+    output_flag = config.get("output_flag")
+
+    enhanced = f"{base_command} {output_flag}" if output_flag else base_command
+
+    return enhanced
diff --git a/resources_servers/swe_bench/parsing/parsing.py b/resources_servers/swe_bench/parsing/parsing.py
new file mode 100644
index 0000000000..800586adad
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/parsing.py
@@ -0,0 +1,1606 @@
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Test parsing utilities for build.py.
+
+Helper functions for:
+- Separating test and gold patches
+- Parsing JUnit XML and JSON test outputs
+"""
+
+import json
+import re
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+
+
+def read_patch(path: Path, skip_binary: bool = False) -> str:
+    """
+    Read the text content of a patch file optionally skipping binary files
+    """
+    parts = split_patch(path, skip_binary=skip_binary)
+    return "".join([diff for _, diff in parts])
+
+
+def split_patch(patch_path: Path, skip_binary: bool = False) -> list[Tuple[str, str]]:
+    """
+    Read a patch and partition by file.
+
+    Args:
+        patch_path (Path) - The patch file to split
+        skip_binary (bool) - Whether to exclude binary files
+
+    Returns: List of (filename, patch content) tuples
+    """
+    content = patch_path.read_text()
+    parts = []
+
+    # Split by file changes (each starts with "diff --git")
+    file_diffs = re.split(r"(diff --git.*?)(?=diff --git|\Z)", content, flags=re.DOTALL)
+
+    for i in range(0, len(file_diffs), 2):
+        if i + 1 >= len(file_diffs):
+            continue
+
+        header = file_diffs[i]
+        content = file_diffs[i + 1]
+        full_diff = header + content
+
+        # Extract filename from diff header
+        file_match = re.search(r"diff --git a/(.*?) b/", full_diff)
+        if not file_match:
+            continue
+
+        filepath = file_match.group(1)
+
+        if skip_binary:
+            binary_match = re.search(r"^GIT binary patch$", full_diff, flags=re.MULTILINE)
+            if binary_match:
+                continue
+
+        parts.append((filepath, full_diff))
+
+    return parts
+
+
+def _parse_embedded_test_results(text_output: str, test_prefix: str = "") -> Dict[str, str]:
+    """Parse embedded test results from system-out text.
+
+    This handles cases like wolfssl where a single ctest testcase runs many individual tests
+    and outputs them in a specific format within <system-out>.
+
+    Expected formats:
+    - "     1: test_name                                    : passed (  0.00016)"
+    - "     2: test_name                                    : failed (  0.00016)"
+    - "     3: test_name                                    : skipped"
+    - "HMAC-MD5 test passed!"
+    - "RSA      test failed!"
+
+    Args:
+        text_output: The text content from <system-out>
+        test_prefix: Prefix to add to test names (usually the testcase name)
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+    """
+    results = {}
+
+    # Pattern 1: Numbered test format (wolfssl API tests)
+    # Format: "     1: test_name                                    : passed (  0.00016)"
+    numbered_pattern = re.compile(
+        r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*?)\s*:\s*(passed|failed|skipped)", re.MULTILINE | re.IGNORECASE
+    )
+
+    for match in numbered_pattern.finditer(text_output):
+        test_name = match.group(1).strip()
+        status = match.group(2).lower()
+
+        # Build test ID with prefix
+        if test_prefix:
+            test_id = f"{test_prefix}::{test_name}"
+        else:
+            test_id = test_name
+
+        if status == "passed":
+            results[test_id] = "PASSED"
+        elif status == "failed":
+            results[test_id] = "FAILED"
+        elif status == "skipped":
+            results[test_id] = "SKIPPED"
+
+    # Pattern 2: Unit test format (wolfssl unit tests)
+    # Format: "HMAC-MD5 test passed!"
+    # Only match lines that don't contain '---' (separator lines)
+    # Use [ \t] instead of \s to avoid matching newlines
+    unit_pattern = re.compile(
+        r"^([A-Za-z0-9_\-/]+(?:[ \t]+[A-Za-z0-9_\-/]+){0,5}?)[ \t]+test[ \t]+(passed|failed)!",
+        re.MULTILINE | re.IGNORECASE,
+    )
+
+    for match in unit_pattern.finditer(text_output):
+        test_name = match.group(1).strip()
+        status = match.group(2).lower()
+
+        # Skip if the test name contains special characters indicating it's not a real test
+        if "---" in test_name or len(test_name) > 50:
+            continue
+
+        # Build test ID with prefix
+        if test_prefix:
+            test_id = f"{test_prefix}::{test_name}"
+        else:
+            test_id = test_name
+
+        if status == "passed":
+            results[test_id] = "PASSED"
+        elif status == "failed":
+            results[test_id] = "FAILED"
+
+    # Pattern 3: FAILURES section (wolfssl API tests)
+    # Format: "FAILURES:\n   892: test_wolfSSL_CTX_load_verify_locations"
+    failures_section = re.search(r"FAILURES:\s*\n(.*?)(?:\n\s*End|$)", text_output, re.DOTALL)
+    if failures_section:
+        failure_pattern = re.compile(r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*)", re.MULTILINE)
+        for match in failure_pattern.finditer(failures_section.group(1)):
+            test_name = match.group(1).strip()
+            if test_prefix:
+                test_id = f"{test_prefix}::{test_name}"
+            else:
+                test_id = test_name
+            # Mark as failed (this overrides any previous 'passed' if it exists)
+            results[test_id] = "FAILED"
+
+    return results
+
+
+def parse_junit_xml(xml_content: str) -> Dict[str, str]:
+    """Parse JUnit XML to extract test results.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (import errors, syntax errors, etc.), if we find valid
+    XML test results, we parse and return them. We only return None if we're certain
+    the framework didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If test framework failed to run (not the same as tests failing)
+    """
+    results = {}
+    found_any_xml = False
+
+    # PRIORITY 1 & 2: Try to parse XML documents (pure or mixed with other output)
+    # Handle multiple concatenated XML documents (from multiple test result files)
+    # Split by <?xml declaration to handle each document separately
+    xml_docs = xml_content.split("<?xml")
+
+    for i, doc in enumerate(xml_docs):
+        if i == 0 and not doc.strip():
+            continue  # Skip empty first split
+
+        # Re-add the <?xml declaration (except for first empty split)
+        if i > 0:
+            doc = "<?xml" + doc
+
+        if not doc.strip():
+            continue
+
+        # Try parsing this document
+        try:
+            tree = ET.fromstring(doc.strip())
+            found_any_xml = True
+        except ET.ParseError:
+            # If parsing fails, try extracting just the testsuite/testsuites portion
+            xml_start = doc.find("<testsuite")
+            if xml_start == -1:
+                xml_start = doc.find("<testsuites")
+            if xml_start == -1:
+                continue
+
+            xml_end = doc.find("</testsuites>", xml_start)
+            if xml_end > xml_start:
+                xml_extracted = doc[xml_start : xml_end + len("</testsuites>")]
+            else:
+                xml_end = doc.find("</testsuite>", xml_start)
+                if xml_end > xml_start:
+                    xml_extracted = doc[xml_start : xml_end + len("</testsuite>")]
+                else:
+                    continue
+
+            try:
+                tree = ET.fromstring(xml_extracted)
+                found_any_xml = True
+            except ET.ParseError:
+                continue
+
+        # Parse all testcases from this document
+        for testcase in tree.iter("testcase"):
+            classname = testcase.get("classname", "")
+            name = testcase.get("name", "")
+            test_id = f"{classname}::{name}" if classname else name
+
+            # Check if this testcase has system-out with embedded test results
+            # This handles cases like wolfssl where a single ctest executable runs many tests
+            system_out = testcase.find("system-out")
+            embedded_results = {}
+            if system_out is not None and system_out.text:
+                embedded_results = _parse_embedded_test_results(system_out.text, classname or name)
+
+            if embedded_results:
+                # If we found embedded test results, use those instead of the testcase status
+                results.update(embedded_results)
+            elif testcase.find("failure") is not None or testcase.find("error") is not None:
+                results[test_id] = "FAILED"
+            elif testcase.find("skipped") is not None:
+                results[test_id] = "SKIPPED"
+            else:
+                results[test_id] = "PASSED"
+
+    # PRIORITY 3: If we found NO valid XML and NO results, check for error indicators
+    # Only return None if we're certain the framework failed to run
+    if not found_any_xml and not results:
+        error_indicators = [
+            "ERROR: ",  # Generic error marker
+            "ImportError:",  # Python import errors
+            "ModuleNotFoundError:",  # Python module errors
+            "SyntaxError:",  # Python syntax errors
+            "FAILED ",  # Framework failure markers
+            "INTERNALERROR",  # pytest internal errors
+            "collection errors",  # pytest collection errors
+            "error: ",  # Generic error (C++, Swift, etc.)
+            "fatal error:",  # Fatal compilation errors
+            "cannot find symbol",  # Java compilation errors
+            "error: build had",  # Swift build errors (xctest)
+            "error: terminated",  # Swift process crashes (xctest)
+        ]
+        has_errors = any(indicator in xml_content for indicator in error_indicators)
+        # Return None ONLY if: no XML found AND errors present
+        # Return empty dict if: no XML found AND no errors (rare but valid)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_go_json(json_output: str) -> Dict[str, str]:
+    """Parse Go test -json output (newline-delimited JSON).
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (module errors, build errors, etc.), if we find valid
+    test results JSON, we parse and return it. We only return None if we're certain
+    the tests didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If Go tests failed to run (not the same as tests failing)
+    """
+    results = {}
+    has_valid_json = False
+
+    # PRIORITY 1: Try to parse newline-delimited JSON (valid test output)
+    for line in json_output.strip().split("\n"):
+        if not line.strip():
+            continue
+        try:
+            event = json.loads(line)
+            has_valid_json = True  # Found at least one valid JSON line
+            action = event.get("Action")
+
+            # Handle test-level events
+            if "Test" in event and action in ["pass", "fail", "skip"]:
+                test_name = event.get("Test", "")
+                if test_name:
+                    package = event.get("Package", "")
+                    test_id = f"{package}::{test_name}" if package else test_name
+
+                    if action == "pass":
+                        results[test_id] = "PASSED"
+                    elif action == "fail":
+                        results[test_id] = "FAILED"
+                    elif action == "skip":
+                        results[test_id] = "SKIPPED"
+
+            # Handle package-level failures (no Test field)
+            elif "Package" in event and "Test" not in event and action == "fail":
+                package = event.get("Package", "")
+                test_id = f"{package}::package"
+                results[test_id] = "FAILED"
+
+        except json.JSONDecodeError:
+            # PRIORITY 2: Handle plaintext build failures (legitimate failures)
+            # When tests can't compile/build, Go outputs plaintext "FAIL package [build failed]"
+            # This is a legitimate test failure, not a parsing error
+            build_fail_match = re.match(r"^FAIL\s+(\S+)\s+\[build failed\]", line)
+            if build_fail_match:
+                package_name = build_fail_match.group(1)
+                results[package_name] = "FAILED"
+                has_valid_json = True  # Count build failures as valid results
+
+    # PRIORITY 3: If we found NO valid JSON and NO build failures, check for error indicators
+    if not has_valid_json and not results:
+        error_indicators = [
+            "go: cannot find main module",  # Module not found
+            "can't load package",  # Package loading errors
+            "pattern matches no packages",  # No matching packages
+            "build constraints exclude all Go files",  # Build constraints error
+        ]
+        has_errors = any(indicator in json_output for indicator in error_indicators)
+        # Return None ONLY if: no JSON found AND errors present
+        # Return empty dict if: no JSON found AND no errors (rare but valid)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_jest_vitest_json(json_output: str) -> Dict[str, str]:
+    """Parse Jest/Vitest JSON output.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (TypeScript, npm, etc.), if we find valid
+    test results JSON, we parse and return it. We only return None if we're
+    certain the framework didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If Jest itself failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+    try:
+        data = json.loads(json_output.strip())
+        # If we got JSON, check if it has test results (even if errors exist elsewhere in output)
+    except json.JSONDecodeError:
+        # PRIORITY 2: Search for JSON markers in mixed output
+        # Even with errors in output, tests might have run and produced JSON
+        json_start = json_output.find('{"numFailed')  # Jest format
+        if json_start == -1:
+            json_start = json_output.find('{"numTotalTest')  # Vitest format
+        if json_start == -1:
+            json_start = json_output.find('{"test')  # Alternative format
+        if json_start == -1:
+            # PRIORITY 3: No JSON found - NOW check if there are error indicators
+            # Only return None if we're sure tests didn't run (no results + errors present)
+            # NOTE: error_indicators are a LAST RESORT - we prefer finding test results
+            error_indicators = [
+                "error TS",  # TypeScript compilation errors (e.g., error TS2307:)
+                "ELIFECYCLE",  # npm script failures
+                "npm ERR!",  # npm errors
+                "Error: Cannot find module",  # Module loading errors (like Mocha)
+                "SyntaxError:",  # JavaScript/TypeScript syntax errors
+                "Test suite failed to run",  # Jest-specific: tests couldn't be loaded
+                "FAIL ",  # Jest failure marker without JSON
+            ]
+            has_errors = any(indicator in json_output for indicator in error_indicators)
+            # Return None ONLY if: no JSON found AND errors present
+            # Return empty dict if: no JSON found AND no errors (rare but valid)
+            return None if has_errors else results
+
+        # Try to extract JSON from mixed output
+        decoder = json.JSONDecoder()
+        try:
+            data, _ = decoder.raw_decode(json_output[json_start:])
+        except json.JSONDecodeError:
+            # Could not parse JSON even after finding marker
+            return None
+
+    # At this point, we have successfully parsed JSON
+    # Check if this is Jest's error response format (Jest itself failed, not the tests)
+    # Format: {"error": {"code": 2, "summary": "", "detail": ""}}
+    # This is a structured error response, NOT test results
+    if "error" in data and "code" in data.get("error", {}):
+        # This is an error response from Jest itself, not test results
+        return None
+
+    # Check if we have the expected test results structure
+    # If we have testResults, parse it even if tests failed - those are legitimate test results
+    # Parse test results
+    if "testResults" in data:
+        for test_result in data.get("testResults", []):
+            file_path = test_result.get("name", "")
+            suite_status = test_result.get("status", "")
+            assertions = test_result.get("assertionResults", [])
+
+            # Handle suite-level failures (no assertions ran)
+            if suite_status == "failed" and len(assertions) == 0:
+                test_id = f"{file_path}::suite"
+                results[test_id] = "FAILED"
+                continue
+
+            # Handle individual test assertions
+            for assertion in assertions:
+                full_name = assertion.get("fullName", "")
+                title = assertion.get("title", "")
+                status = assertion.get("status", "")
+                test_id = f"{file_path}::{full_name}" if full_name else f"{file_path}::{title}"
+
+                if status == "passed":
+                    results[test_id] = "PASSED"
+                elif status == "failed":
+                    results[test_id] = "FAILED"
+                elif status in ["pending", "skipped"]:
+                    results[test_id] = "SKIPPED"
+
+    # If we successfully parsed JSON but found no testResults, that's unexpected
+    # Return None to indicate this isn't valid test output
+    # (Valid Jest output should have testResults array, even if empty)
+    if "testResults" not in data:
+        return None
+
+    return results
+
+
+def parse_mocha_json(json_output: str) -> Optional[Dict[str, str]]:
+    """Parse Mocha JSON output.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (module errors, syntax errors, etc.), if we find valid
+    test results JSON, we parse and return it. We only return None if we're certain
+    the framework didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If Mocha itself failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+    try:
+        data = json.loads(json_output.strip())
+        # Validate this is Mocha JSON by checking for 'stats' key
+        if "stats" not in data:
+            data = None
+    except json.JSONDecodeError:
+        data = None
+
+    # PRIORITY 2: If direct parse failed, search for JSON in mixed output
+    if data is None:
+        # Look for stats key in JSON
+        stats_pos = json_output.find('"stats"')
+        if stats_pos == -1:
+            # PRIORITY 3: No JSON found - NOW check if there are error indicators
+            error_indicators = [
+                "Error: Cannot find module",  # Module loading errors
+                "SyntaxError:",  # JavaScript syntax errors
+                "TypeError:",  # Type errors
+                "ReferenceError:",  # Reference errors
+                "No test files found",  # Mocha-specific: no tests found
+            ]
+            has_errors = any(indicator in json_output for indicator in error_indicators)
+            # Return None ONLY if: no JSON found AND errors present
+            # Return empty dict if: no JSON found AND no errors (rare but valid)
+            return None if has_errors else results
+
+        # Find the opening brace before "stats"
+        json_start = json_output.rfind("{", 0, stats_pos)
+        if json_start == -1:
+            return None
+
+        # Try parsing from this position
+        json_portion = json_output[json_start:]
+
+        # Use json.JSONDecoder to find where the object ends
+        decoder = json.JSONDecoder()
+        try:
+            data, _ = decoder.raw_decode(json_portion)
+        except json.JSONDecodeError:
+            return None
+
+        # Validate extracted JSON has 'stats'
+        if "stats" not in data:
+            return None
+
+    # At this point, we have valid Mocha JSON with 'stats'
+    # Parse test results even if some tests failed - those are legitimate results
+
+    # Process passed tests
+    for test in data.get("passes", []):
+        file_path = test.get("file", "")
+        full_title = test.get("fullTitle", "")
+        test_id = f"{file_path}::{full_title}" if full_title else file_path
+        results[test_id] = "PASSED"
+
+    # Process failed tests
+    for test in data.get("failures", []):
+        file_path = test.get("file", "")
+        full_title = test.get("fullTitle", "")
+        test_id = f"{file_path}::{full_title}" if full_title else file_path
+        results[test_id] = "FAILED"
+
+    # Process pending/skipped tests
+    for test in data.get("pending", []):
+        file_path = test.get("file", "")
+        full_title = test.get("fullTitle", "")
+        test_id = f"{file_path}::{full_title}" if full_title else file_path
+        results[test_id] = "SKIPPED"
+
+    return results
+
+
+def parse_gtest_json(json_output: str) -> Dict[str, str]:
+    """Parse Google Test JSON output.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (compilation errors, linking errors, etc.), if we find valid
+    test results JSON, we parse and return it. We only return None if we're certain
+    the tests didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If GTest itself failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Try to parse as pure JSON (test results take precedence)
+    try:
+        data = json.loads(json_output.strip())
+        # Validate this is GTest JSON by checking for 'testsuites' key
+        if "testsuites" not in data:
+            data = None
+    except json.JSONDecodeError:
+        data = None
+
+    # PRIORITY 2: If direct parse failed, search for JSON in mixed output
+    if data is None:
+        # Try to find JSON in mixed output
+        json_start = json_output.find('{"testsuites"')
+        if json_start == -1:
+            json_start = json_output.find('{\n  "testsuites"')
+        if json_start == -1:
+            # PRIORITY 3: No JSON found - NOW check if there are error indicators
+            error_indicators = [
+                "error:",  # C++ compilation errors
+                "undefined reference to",  # Linking errors
+                "fatal error:",  # Fatal compilation errors
+                "cannot find -l",  # Linking library errors (e.g., "cannot find -lgtest")
+                ": No such file or directory",  # File not found errors
+            ]
+            has_errors = any(indicator in json_output for indicator in error_indicators)
+            # Return None ONLY if: no JSON found AND errors present
+            # Return empty dict if: no JSON found AND no errors (rare but valid)
+            return None if has_errors else results
+
+        # Extract JSON object
+        json_portion = json_output[json_start:]
+        decoder = json.JSONDecoder()
+        try:
+            data, _ = decoder.raw_decode(json_portion)
+        except json.JSONDecodeError:
+            return None
+
+        # Validate extracted JSON has 'testsuites'
+        if "testsuites" not in data:
+            return None
+
+    # At this point, we have valid GTest JSON with 'testsuites'
+    # Parse test results even if some tests failed - those are legitimate results
+
+    # Parse test results from testsuites
+    testsuites = data.get("testsuites", [])
+    if not isinstance(testsuites, list):
+        testsuites = [testsuites] if isinstance(testsuites, dict) else []
+
+    for testsuite in testsuites:
+        suite_name = testsuite.get("name", "")
+
+        # Handle both 'testsuite' (array) and direct test cases
+        test_cases = testsuite.get("testsuite", [])
+        if not test_cases:
+            test_cases = testsuite.get("tests", [])
+
+        for test_case in test_cases:
+            test_name = test_case.get("name", "")
+            classname = test_case.get("classname", suite_name)
+
+            # Build test ID in format: SuiteName::TestName
+            test_id = f"{classname}::{test_name}" if classname else test_name
+
+            # Determine test status
+            status = test_case.get("status", "RUN")
+            result = test_case.get("result", "COMPLETED")
+
+            # Check for failures
+            failures = test_case.get("failures", [])
+            if failures and len(failures) > 0:
+                results[test_id] = "FAILED"
+            elif status == "NOTRUN" or result == "SKIPPED":
+                results[test_id] = "SKIPPED"
+            elif result == "COMPLETED" or status == "RUN":
+                results[test_id] = "PASSED"
+            else:
+                results[test_id] = "FAILED"
+
+    return results
+
+
+def parse_maven_text_output(text_output: str) -> Dict[str, str]:
+    """Parse Maven text output for test results."""
+    results = {}
+
+    # Look for test summary lines like:
+    # Tests run: 5, Failures: 1, Errors: 0, Skipped: 0
+    summary_pattern = r"Tests run: (\d+),\s*Failures: (\d+),\s*Errors: (\d+),\s*Skipped: (\d+)"
+
+    # Check for compilation errors - if tests can't compile, mark them as failed
+    compilation_error_pattern = r"\[ERROR\].*?testCompile.*?Compilation failure"
+    if re.search(compilation_error_pattern, text_output, re.DOTALL | re.IGNORECASE):
+        # Find test files mentioned in compilation errors
+        test_file_pattern = r"/workspace/repo/[^/]+/src/test/java/([\w/]+)\.java"
+        for match in re.finditer(test_file_pattern, text_output):
+            test_class = match.group(1).replace("/", ".")
+            # Mark as failed due to compilation
+            results[f"{test_class}::compile"] = "FAILED"
+        # If we found compilation errors, return early
+        if results:
+            return results
+
+    # Check for BUILD FAILURE
+    if "BUILD FAILURE" in text_output:
+        # If build failed and we haven't found specific test failures, mark as generic failure
+        if not results:
+            results["maven::build"] = "FAILED"
+        return results
+
+    # Parse test run summaries per module
+    lines = text_output.split("\n")
+    current_module = None
+
+    for line in lines:
+        # Track which module we're in
+        if "Building" in line and "[" in line and "]" in line:
+            # Extract module name from lines like "[INFO] Building Docs Web 1.12-SNAPSHOT [4/4]"
+            parts = line.split("Building")
+            if len(parts) > 1:
+                module_parts = parts[1].strip().split()
+                if len(module_parts) > 0:
+                    current_module = module_parts[0]
+
+        # Look for test summary
+        summary_match = re.search(summary_pattern, line)
+        if summary_match:
+            total = int(summary_match.group(1))
+            failures = int(summary_match.group(2))
+            errors = int(summary_match.group(3))
+            skipped = int(summary_match.group(4))
+
+            if total > 0:
+                # We have test counts but might not have individual test names
+                # Generate generic test IDs based on the current module
+                module_name = current_module or "unknown"
+                passed = total - failures - errors - skipped
+
+                for j in range(passed):
+                    results[f"{module_name}::test_{j + 1}"] = "PASSED"
+                for j in range(failures + errors):
+                    results[f"{module_name}::test_failed_{j + 1}"] = "FAILED"
+                for j in range(skipped):
+                    results[f"{module_name}::test_skipped_{j + 1}"] = "SKIPPED"
+    return results
+
+
+def parse_cargo_nextest(output: str) -> Dict[str, str]:
+    """Parse cargo-nextest text output.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (warnings, etc.), if we find valid test results,
+    we parse and return them. We only return None if we're certain the tests didn't
+    run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If cargo-nextest failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Parse individual test result lines
+    # Format: PASS [   1.588s] rusty::tests integration::linking::test_name
+    #         FAIL [   5.845s] rusty codegen::tests::parameters_tests::test_name
+    test_line_pattern = re.compile(r"^\s*(PASS|FAIL|SIGKILL|SKIP)\s+\[.*?\]\s+(.+)$", re.MULTILINE)
+
+    for match in test_line_pattern.finditer(output):
+        status = match.group(1)
+        test_name = match.group(2).strip()
+
+        if status == "PASS":
+            results[test_name] = "PASSED"
+        elif status in ("FAIL", "SIGKILL"):
+            results[test_name] = "FAILED"
+        elif status == "SKIP":
+            results[test_name] = "SKIPPED"
+
+    # PRIORITY 2: If we found NO test results, check for error indicators
+    # Only return None if we're certain tests didn't run (compilation/linking errors)
+    if not results:
+        error_indicators = [
+            "error[E",  # Rust compiler errors (e.g., error[E0425])
+            "error: could not compile",  # Cargo compilation errors
+            "error: linking with",  # Linking errors
+            "error: aborting due to",  # Compilation aborted
+        ]
+        has_errors = any(indicator in output for indicator in error_indicators)
+        # Return None ONLY if: no results found AND errors present
+        # Return empty dict if: no results found AND no errors (rare but valid - no tests in project)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_bun_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse Bun test framework output.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (TypeScript, compilation, etc.), if we find valid
+    test results, we parse and return them. We only return None if we're
+    certain Bun didn't run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If Bun itself failed to run (not the same as tests failing)
+    """
+    results = {}
+    current_file = None
+    current_describe = None
+
+    # PRIORITY 1: Try to parse test results (✓ and ✗ symbols)
+    for line in text_output.split("\n"):
+        # Track current file (lines ending with .ts: or .js:)
+        if re.match(r"^[^\s].*\.(ts|js|tsx|jsx):?\s*$", line.strip()):
+            current_file = line.strip().rstrip(":")
+            current_describe = None
+            continue
+
+        # Track describe blocks (indented text followed by colon, but not test results)
+        describe_match = re.match(r"^\s+([^✓✗\n]+):\s*$", line)
+        if describe_match:
+            current_describe = describe_match.group(1).strip()
+            continue
+
+        # Remove ANSI color codes
+        clean_line = re.sub(r"\x1b\[[0-9;]*m", "", line)
+
+        # Match passed tests: ✓ test_name [time]
+        pass_match = re.match(r"^\s*✓\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line)
+        if pass_match:
+            test_name = pass_match.group(1).strip()
+            # Build test ID with file, describe block, and test name
+            test_id = test_name
+            if current_file:
+                test_id = f"{current_file}::{test_name}"
+            if current_describe:
+                test_id = f"{current_file}::{current_describe} > {test_name}"
+            results[test_id] = "PASSED"
+            continue
+
+        # Match failed tests: ✗ test_name [time]
+        fail_match = re.match(r"^\s*✗\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line)
+        if fail_match:
+            test_name = fail_match.group(1).strip()
+            # Build test ID with file, describe block, and test name
+            test_id = test_name
+            if current_file:
+                test_id = f"{current_file}::{test_name}"
+            if current_describe:
+                test_id = f"{current_file}::{current_describe} > {test_name}"
+            results[test_id] = "FAILED"
+            continue
+
+        # Alternative format: FAIL  filepath > describe > test_name
+        alt_fail_match = re.match(r"^\s*FAIL\s+(.+?)\s+>\s+(.+?)\s*$", clean_line)
+        if alt_fail_match:
+            file_path = alt_fail_match.group(1).strip()
+            test_path = alt_fail_match.group(2).strip()
+            test_id = f"{file_path}::{test_path}"
+            results[test_id] = "FAILED"
+            continue
+
+    # PRIORITY 2: If no individual test results found, try parsing summary
+    if not results:
+        # Look for summary like "5 pass, 2 fail" or "X passing (Yms)"
+        summary_match = re.search(r"(\d+)\s+pass(?:ing|ed)?.*?(\d+)\s+fail(?:ing|ed)?", text_output.lower())
+        if summary_match:
+            passed = int(summary_match.group(1))
+            failed = int(summary_match.group(2))
+
+            # Generate generic test IDs
+            for i in range(passed):
+                results[f"test_{i + 1}"] = "PASSED"
+            for i in range(failed):
+                results[f"test_failed_{i + 1}"] = "FAILED"
+
+    # PRIORITY 3: No test results found - NOW check if there are error indicators
+    # Only return None if we're sure tests didn't run (no results + errors present)
+    # NOTE: error_indicators are a LAST RESORT - we prefer finding test results
+    if not results:
+        error_indicators = [
+            "error TS",  # TypeScript compilation errors (e.g., error TS2307:)
+            "Error: Cannot find module",  # Module loading errors
+            "SyntaxError:",  # JavaScript/TypeScript syntax errors
+            "error: ",  # Generic Bun errors (lowercase 'error:')
+            "Error:",  # Generic errors
+            "ModuleNotFoundError",  # Module not found
+            "bun: command not found",  # Bun not installed
+            "panicked at",  # Bun runtime panics
+            "Segmentation fault",  # Critical runtime errors
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        # Return None ONLY if: no test results found AND errors present
+        # Return empty dict if: no test results found AND no errors (rare but valid)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_cppunit_text(text_output: str) -> Dict[str, str]:
+    """Parse CppUnit text output for test results.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    Even if output contains errors (warnings, etc.), if we find valid test results,
+    we parse and return them. We only return None if we're certain the tests didn't
+    run (no test results + error indicators).
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If CppUnit failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Parse individual test result lines
+    # Format: TestClassName::testMethodName : OK
+    #         TestClassName::testMethodName : FAIL
+    test_line_pattern = re.compile(
+        r"^([A-Za-z_][A-Za-z0-9_]*::[A-Za-z_][A-Za-z0-9_]*)\s*:\s*(OK|FAIL|ERROR)$", re.MULTILINE
+    )
+
+    for match in test_line_pattern.finditer(text_output):
+        test_name = match.group(1).strip()
+        status = match.group(2).strip()
+
+        if status == "OK":
+            results[test_name] = "PASSED"
+        elif status in ["FAIL", "ERROR"]:
+            results[test_name] = "FAILED"
+
+    # PRIORITY 2: If we found NO test results, check for error indicators
+    # Only return None if we're certain tests didn't run (compilation/linking errors)
+    if not results:
+        error_indicators = [
+            "error:",  # C++ compilation errors
+            "undefined reference to",  # Linking errors
+            "fatal error:",  # Fatal compilation errors
+            "ld returned",  # Linker errors
+            "cannot find -l",  # Library linking errors
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        # Return None ONLY if: no results found AND errors present
+        # Return empty dict if: no results found AND no errors (rare but valid - no tests)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_minitest_text(text_output: str, test_metadata_path: str = None) -> Dict[str, str]:
+    """
+    Parse mini.nvim (MiniTest) test framework output.
+
+    MiniTest is used by Neovim plugins for testing.
+    Example output:
+      Total number of cases: 5
+      tests/test_treesitter.lua: ooooo
+
+      Fails (0) and Notes (0)
+
+      Or with failures:
+      FAIL in tests/test_treesitter.lua | wrap_cursor | normal: error message
+      FAIL in tests/test_treesitter.lua | enumerate: error message
+
+      Fails (2) and Notes (0)
+
+    IMPORTANT: MiniTest only outputs individual test names when they FAIL.
+    When all tests pass, only summary is shown - no individual test names.
+
+    Solution: When all tests pass, read test_metadata.json to get expected test names
+    and return them as PASSED. This ensures real test names are used consistently.
+    """
+    results = {}
+
+    # Parse individual test results from FAIL/NOTE lines
+    # Format: FAIL in file.lua | group | test_name: error message
+    # Use [^|:]+ to stop at pipe OR colon (prevents capturing error message)
+    fail_pattern = re.compile(
+        r"^(?:\x1b\[\d+(?:;\d+)?m)?FAIL(?:\x1b\[0m)?\s+in\s+([^|]+)\s*\|\s*([^|:]+)(?:\s*\|\s*([^:]+))?:", re.MULTILINE
+    )
+
+    for match in fail_pattern.finditer(text_output):
+        file_path = match.group(1).strip()
+        group = match.group(2).strip()
+        test_name = match.group(3).strip() if match.group(3) else ""
+
+        # Create test ID: file | group | test_name or file | group
+        if test_name:
+            test_id = f"{file_path} | {group} | {test_name}"
+        else:
+            test_id = f"{file_path} | {group}"
+
+        results[test_id] = "FAILED"
+
+    return results
+
+
+def parse_telescope_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse telescope test framework output.
+
+    Telescope outputs lines like:
+      ✓ test_name
+      ✗ test_name
+      - test_name (skipped)
+
+    Also handles PlenaryBusted output for Neovim plugins.
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    We only return None if we're certain the framework didn't run.
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If telescope failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    for line in text_output.split("\n"):
+        line = line.strip()
+
+        # Match passed tests: ✓ test_name or "Success: test_name"
+        if "✓" in line:
+            test_name = line.split("✓", 1)[1].strip()
+            if test_name:  # Avoid empty test names
+                results[test_name] = "PASSED"
+        elif line.lower().startswith("success:"):
+            test_name = line.split(":", 1)[1].strip()
+            if test_name:
+                results[test_name] = "PASSED"
+
+        # Match failed tests: ✗ test_name or "Failed: test_name"
+        elif "✗" in line:
+            test_name = line.split("✗", 1)[1].strip()
+            if test_name:
+                results[test_name] = "FAILED"
+        elif line.lower().startswith("failed:"):
+            test_name = line.split(":", 1)[1].strip()
+            if test_name:
+                results[test_name] = "FAILED"
+
+        # Match skipped tests: - test_name or "Skipped: test_name"
+        elif line.startswith("- ") and "skip" in line.lower():
+            test_name = line[2:].strip()
+            # Remove "(skipped)" suffix if present
+            test_name = re.sub(r"\s*\(skipped\)\s*$", "", test_name, flags=re.IGNORECASE)
+            if test_name:
+                results[test_name] = "SKIPPED"
+        elif line.lower().startswith("skipped:"):
+            test_name = line.split(":", 1)[1].strip()
+            if test_name:
+                results[test_name] = "SKIPPED"
+
+    # If no results found, try parsing summary line
+    if not results:
+        # Look for summary like "5 passed, 2 failed, 1 skipped"
+        summary_pattern = r"(\d+)\s+passed.*?(\d+)\s+failed"
+        match = re.search(summary_pattern, text_output.lower())
+        if match:
+            passed = int(match.group(1))
+            failed = int(match.group(2))
+
+            # Generate generic test IDs
+            for i in range(passed):
+                results[f"test_{i + 1}"] = "PASSED"
+            for i in range(failed):
+                results[f"test_failed_{i + 1}"] = "FAILED"
+
+    # PRIORITY 2: If we found NO test results, check for error indicators
+    # Only return None if we're certain tests didn't run (Lua/Neovim errors)
+    if not results:
+        error_indicators = [
+            "Error:",  # Generic Lua errors
+            "error loading module",  # Lua module loading errors
+            "attempt to call",  # Lua runtime errors
+            "bad argument",  # Lua runtime errors
+            "stack traceback:",  # Lua errors with traceback
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        # Return None ONLY if: no results found AND errors present
+        # Return empty dict if: no results found AND no errors (rare but valid - no tests)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_lust_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse lust test framework output.
+
+    Lust outputs test results with dots (.) for pass, F for fail.
+    Example output:
+      ..F.
+      4 tests, 1 failure
+      test/my_test.lua:15: Expected true but got false
+
+    We parse individual test results when available, or fall back to summary.
+    """
+    results = {}
+
+    # Try to parse individual test results from verbose output
+    # Pattern: "  test_name ... ok" or "  test_name ... FAILED"
+    test_pattern = re.compile(r"^\s*(.+?)\s+\.\.\.\s+(ok|FAILED|ERROR)", re.MULTILINE)
+    matches = test_pattern.findall(text_output)
+
+    if matches:
+        # Found individual test results
+        for test_name, status in matches:
+            test_name = test_name.strip()
+            if status == "ok":
+                results[test_name] = "PASSED"
+            else:
+                results[test_name] = "FAILED"
+        return results
+
+    # Try to extract test descriptions from failure messages
+    # Pattern: "test_file.lua:line_number: test description"
+    failure_pattern = re.compile(r"^([^\s:]+\.lua):(\d+):\s*(.+)$", re.MULTILINE)
+    failures = failure_pattern.findall(text_output)
+
+    if failures:
+        for filepath, _, description in failures:
+            test_id = f"{filepath}::{description.strip()}"
+            results[test_id] = "FAILED"
+
+    # Parse summary line to get total count: "X tests, Y failures"
+    summary_match = re.search(r"(\d+)\s+tests?,\s+(\d+)\s+failures?", text_output.lower())
+    if summary_match:
+        total_tests = int(summary_match.group(1))
+        failures = int(summary_match.group(2))
+
+        # If we haven't parsed individual tests yet, generate generic ones
+        if not results:
+            passed = total_tests - failures
+            for i in range(passed):
+                results[f"test_{i + 1}"] = "PASSED"
+            for i in range(failures):
+                results[f"test_failed_{i + 1}"] = "FAILED"
+            return results
+
+    # Fallback: if no detailed info, check for overall success/failure
+    if not results:
+        if "0 failures" in text_output.lower() or "0 errors" in text_output.lower():
+            results["test_suite"] = "PASSED"
+        else:
+            results["test_suite"] = "FAILED"
+
+    return results
+
+
+def parse_bespoke_libgeos(text_output: str) -> Dict[str, str]:
+    """Parse libgeos/GEOS test output format.
+
+    Format:
+        capi::GEOSBoundary: .
+        capi::GEOSBuffer: .....................
+        geos::operation::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F][5=F][6=F]
+        geos::operation::buffer::BufferOp: ..........................[27=X]
+
+    Where:
+        - dots (.) = passing tests
+        - [N=F] = explicit failure markers
+        - [N=X] = exception markers (also failures)
+        - standalone F or X = failure/exception
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    We only return None if we're certain the framework didn't run.
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If libgeos tests failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Parse individual test result lines
+    # Pattern: TestSuite::TestName: followed by dots, Fs, Xs, or [N=F]/[N=X] markers
+    # Example: capi::GEOSBoundary: .
+    # Example: geos::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F]
+    # Example: geos::operation::buffer::BufferOp: ..........................[27=X]
+    test_line_pattern = re.compile(
+        r"^([a-zA-Z_][a-zA-Z0-9_:]*::[a-zA-Z_][a-zA-Z0-9_]*)\s*:\s*(.+?)(?:\n|$)", re.MULTILINE
+    )
+
+    for match in test_line_pattern.finditer(text_output):
+        test_id = match.group(1)  # Full name like "capi::GEOSBoundary"
+        test_output_line = match.group(2)  # Everything after the colon
+
+        # Check for failure markers:
+        # 1. [N=F] pattern (explicit failure notation)
+        # 2. [N=X] pattern (exception notation)
+        # 3. Standalone F or X characters
+        has_failure = bool(re.search(r"\[.*=[FX]\]|(?<!\w)[FX](?!\w)", test_output_line))
+
+        if has_failure:
+            results[test_id] = "FAILED"
+        else:
+            # All dots or successful - mark as passed
+            results[test_id] = "PASSED"
+
+    # PRIORITY 2: If we found NO test results, check for error indicators
+    # Only return None if we're certain tests didn't run (compilation/linking errors)
+    if not results:
+        error_indicators = [
+            "error:",  # C++ compilation errors
+            "undefined reference",  # Linking errors
+            "fatal error:",  # Fatal compilation errors
+            "ld returned",  # Linker errors
+            "cannot find -l",  # Library linking errors
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        # Return None ONLY if: no results found AND errors present
+        # Return empty dict if: no results found AND no errors (rare but valid - no tests)
+        return None if has_errors else results
+
+    return results
+
+
+def _normalize_swift_test_name(raw_name: str) -> str:
+    """Normalize XCTest case identifiers from swift test console output."""
+    name = raw_name.strip()
+
+    # Typical format: -[Module.Class testMethod]
+    if name.startswith("-[") and name.endswith("]"):
+        inner = name[2:-1].strip()
+        if " " in inner:
+            class_name, method = inner.split(" ", 1)
+            return f"{class_name}::{method}"
+        return inner
+
+    # Alternate format: Module.Class.testMethod
+    if "." in name:
+        parts = name.split(".", 1)
+        return f"{parts[0]}::{parts[1]}"
+
+    return name
+
+
+def parse_swift_test_text(text_output: str) -> Dict[str, str]:
+    """Parse vanilla `swift test` console output (without --xunit-output)."""
+    results = {}
+    test_case_pattern = re.compile(r"Test Case '([^']+)' (passed|failed|skipped)", re.IGNORECASE)
+
+    for match in test_case_pattern.finditer(text_output):
+        raw_name, status = match.groups()
+        test_id = _normalize_swift_test_name(raw_name)
+        results[test_id] = status.upper()
+
+    if results:
+        return results
+
+    # Fallback: parse summary line to infer aggregate results if per-test lines missing
+    summary_match = re.search(
+        r"Executed\s+(\d+)\s+tests?,\s+with\s+(\d+)\s+failures?",
+        text_output,
+        re.IGNORECASE,
+    )
+    if summary_match:
+        total_tests = int(summary_match.group(1))
+        failures = int(summary_match.group(2))
+        passes = max(total_tests - failures, 0)
+
+        for i in range(passes):
+            results[f"swift_test_pass_{i + 1}"] = "PASSED"
+        for i in range(failures):
+            results[f"swift_test_fail_{i + 1}"] = "FAILED"
+
+    return results
+
+
+def parse_xctest_output(output: str) -> Dict[str, str]:
+    """Parse XCTest results, preferring XML when available."""
+    xml_results = parse_junit_xml(output)
+    if xml_results:
+        return xml_results
+    return parse_swift_test_text(output)
+
+
+def normalize_test_id(test_id: str, framework: str = "") -> str:
+    """Normalize test IDs for stable matching across different formats.
+
+    This function performs several normalizations:
+
+    1. Removes unstable runtime prefixes that change between runs:
+       - (N/M) - Test execution order (e.g., "(2/5) test_name")
+       - [N/M] - Alternative bracket format
+       - #N - Test number prefix (e.g., "#42 test_name")
+       - N. - Numbered list format (e.g., "1. test_name")
+
+    2. Removes common file extensions (.py, .js, .ts, .go, etc.) from test paths
+       to allow matching between "test_file.py::test" and "test_file::test"
+
+    3. Normalizes delimiters (`.`, `::`, `/`) to a canonical form (`::`)
+       when they appear between alphanumeric characters, allowing matching
+       between "testa.testb::testc" and "testa/testb.testc"
+
+    Examples:
+        "(2/5) test_name" -> "test_name"
+        "test_file.py::test_name" -> "test_file::test_name"
+        "testa.testb::testc" -> "testa::testb::testc"
+        "testa/testb.testc" -> "testa::testb::testc"
+        "tests/module.js::describe::it" -> "tests::module::describe::it"
+
+    Args:
+        test_id: Original test ID from parser
+        framework: Test framework name (for future framework-specific rules if needed)
+
+    Returns:
+        Normalized test ID
+    """
+    # Step 1: Remove unstable runtime prefixes
+
+    # Universal pattern: Remove (N/M) or [N/M] prefixes (test execution order)
+    # Matches: "(2/5) test", "[2/5] test", "(123/456) test", "( 1/75) test" (with internal space)
+    normalized = re.sub(r"^[\(\[]?\s*\d+/\d+[\)\]]?\s+", "", test_id)
+
+    # Universal pattern: Remove #N prefix (test numbering)
+    # Matches: "#42 test", "# 42 test"
+    normalized = re.sub(r"^#\s*\d+\s+", "", normalized)
+
+    # Universal pattern: Remove "N. " prefix (numbered list)
+    # Matches: "1. test", "42. test"
+    normalized = re.sub(r"^\d+\.\s+", "", normalized)
+
+    # Step 2: Remove common file extensions before delimiters
+    # This prevents .py from becoming ::py after delimiter normalization
+    # Match extensions like .py, .js, .ts, etc. that appear before :: / . or end of string
+    extensions_pattern = (
+        r"\.(py|pyw|js|mjs|cjs|ts|mts|cts|jsx|tsx|"
+        r"go|java|rb|rs|c|cpp|cc|cxx|h|hpp|hxx|"
+        r"swift|kt|kts|scala|php|cs|fs|"
+        r"ex|exs|erl|hrl|clj|cljs|cljc|"
+        r"lua|pl|pm|t|r|R|m|mm|"
+        r"f|f90|f95|for|vb|pas|pp|"
+        r"d|nim|zig|v|sv|vhd|vhdl|"
+        r"tcl|sh|bash|zsh|fish|ps1|psm1|psd1)"
+        r"(?=::|/|\.|$)"
+    )
+    normalized = re.sub(extensions_pattern, "", normalized, flags=re.IGNORECASE)
+
+    # Step 3: Normalize delimiters (., ::, /) to :: when between word characters
+    # This allows matching "testa.testb::testc" with "testa/testb.testc"
+    delimiter_pattern = r"(?<=\w)(::|\.|/)(?=\w)"
+    normalized = re.sub(delimiter_pattern, "::", normalized)
+
+    return normalized
+
+
+def parse_tap_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse TAP (Test Anything Protocol) output.
+
+    TAP is used by tape, node-tap, and other JavaScript test frameworks.
+
+    Format:
+        TAP version 13
+        # Subtest: Test name
+            1..N
+            ok 1 - assertion name
+            not ok 2 - assertion name
+        ok 1 - Test name # time=123ms
+        not ok 2 - Test name
+        1..N
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+    We only return None if we're certain the tests didn't run.
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If TAP tests failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # PRIORITY 1: Parse top-level test results (not indented subtests)
+    # Format: "ok N - Test name" or "not ok N - Test name"
+    # Skip lines starting with whitespace (subtests)
+    tap_test_pattern = re.compile(
+        r"^(not )?ok\s+(\d+)\s*(?:-\s*)?(.+?)(?:\s*#\s*(skip|todo|time=.*))?$", re.MULTILINE | re.IGNORECASE
+    )
+
+    for match in tap_test_pattern.finditer(text_output):
+        is_failure = match.group(1) is not None  # "not ok" prefix
+        test_num = match.group(2)
+        test_name = match.group(3).strip() if match.group(3) else f"test_{test_num}"
+        directive = match.group(4)
+
+        # Clean up test name (remove timing info like "# time=123ms")
+        test_name = re.sub(r"\s*#\s*time=[\d.]+m?s\s*$", "", test_name, flags=re.IGNORECASE)
+
+        test_id = test_name if test_name else f"test_{test_num}"
+
+        # Check for skip directive
+        if directive and directive.lower().startswith("skip"):
+            results[test_id] = "SKIPPED"
+        elif is_failure:
+            results[test_id] = "FAILED"
+        else:
+            results[test_id] = "PASSED"
+
+    # PRIORITY 2: If no results found, try parsing summary line
+    # Format: "# tests N", "# pass N", "# fail N"
+    if not results:
+        pass_match = re.search(r"#\s*pass\s+(\d+)", text_output, re.IGNORECASE)
+        fail_match = re.search(r"#\s*fail\s+(\d+)", text_output, re.IGNORECASE)
+
+        if pass_match or fail_match:
+            passed = int(pass_match.group(1)) if pass_match else 0
+            failed = int(fail_match.group(1)) if fail_match else 0
+
+            for i in range(passed):
+                results[f"tap_test_passed_{i + 1}"] = "PASSED"
+            for i in range(failed):
+                results[f"tap_test_failed_{i + 1}"] = "FAILED"
+
+    # PRIORITY 3: If no results, check for error indicators
+    if not results:
+        error_indicators = [
+            "npm ERR!",  # npm errors
+            "Error: Cannot find module",  # Module loading errors
+            "SyntaxError:",  # JavaScript syntax errors
+            "TypeError:",  # Type errors
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        # Return None ONLY if: no results found AND errors present
+        return None if has_errors else results
+
+    return results
+
+
+def parse_hardhat_mocha_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse Hardhat/Mocha console text output (non-JSON reporter).
+
+    Hardhat uses Mocha under the hood and outputs text like:
+        Contract: FeeSharingProxy:
+            withdrawFees
+                ✓ Shouldn't be able to use zero token address
+                ✓ Shouldn't be able to withdraw second time in period
+                1) Should fail with specific error
+
+        5 passing (1s)
+        1 failing
+
+    IMPORTANT: We prioritize finding valid test results over detecting errors.
+
+    Returns:
+        Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED)
+        None: If tests failed to run (not the same as tests failing)
+    """
+    results = {}
+
+    # Track current context (Contract/describe blocks)
+    current_context = []
+
+    # PRIORITY 1: Parse individual test results
+    for line in text_output.split("\n"):
+        stripped = line.strip()
+
+        # Track Contract: or describe blocks
+        contract_match = re.match(r"^Contract:\s*(.+?):\s*$", stripped)
+        if contract_match:
+            current_context = [contract_match.group(1)]
+            continue
+
+        # Track describe blocks (indented without checkmark/number)
+        if stripped and not stripped.startswith(("✓", "✗", "-")) and not re.match(r"^\d+\)", stripped):
+            # Check if this looks like a describe block (usually followed by test cases)
+            if ":" not in stripped and len(stripped) < 100:
+                # This might be a describe block, but we'll handle it dynamically
+                pass
+
+        # Match passed tests: ✓ test_name or ✔ test_name
+        pass_match = re.match(r"^[✓✔]\s+(.+?)(?:\s+\(\d+m?s\))?$", stripped)
+        if pass_match:
+            test_name = pass_match.group(1).strip()
+            test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+            results[test_id] = "PASSED"
+            continue
+
+        # Match failed tests: N) test_name or ✗ test_name
+        fail_match = re.match(r"^(?:\d+\)|[✗✘])\s*(.+?)$", stripped)
+        if fail_match:
+            test_name = fail_match.group(1).strip()
+            test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+            results[test_id] = "FAILED"
+            continue
+
+        # Match skipped tests: - test_name
+        skip_match = re.match(r"^-\s+(.+?)$", stripped)
+        if skip_match:
+            test_name = skip_match.group(1).strip()
+            test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name
+            results[test_id] = "SKIPPED"
+            continue
+
+    # PRIORITY 2: Parse summary if no individual results found
+    if not results:
+        # Look for "N passing" and "N failing"
+        pass_match = re.search(r"(\d+)\s+passing", text_output, re.IGNORECASE)
+        fail_match = re.search(r"(\d+)\s+failing", text_output, re.IGNORECASE)
+
+        if pass_match or fail_match:
+            passed = int(pass_match.group(1)) if pass_match else 0
+            failed = int(fail_match.group(1)) if fail_match else 0
+
+            for i in range(passed):
+                results[f"mocha_test_passed_{i + 1}"] = "PASSED"
+            for i in range(failed):
+                results[f"mocha_test_failed_{i + 1}"] = "FAILED"
+
+    # PRIORITY 3: If no results, check for error indicators
+    if not results:
+        error_indicators = [
+            "Error: Cannot find module",
+            "SyntaxError:",
+            "CompilerError:",  # Solidity compilation errors
+            "Error: HH",  # Hardhat errors
+        ]
+        has_errors = any(indicator in text_output for indicator in error_indicators)
+        return None if has_errors else results
+
+    return results
+
+
+def parse_pytest_text(text_output: str) -> Dict[str, str]:
+    """
+    Parse pytest plain text output (-v flag).
+
+    Pytest outputs lines like:
+      tests/test_foo.py::test_one PASSED
+      tests/test_foo.py::test_two FAILED
+      tests/test_foo.py::test_three SKIPPED
+
+    Or in short form:
+      tests/test_foo.py .F.s
+
+    Also handles summary lines like:
+      ===== 3 passed, 1 failed, 1 skipped in 0.5s =====
+    """
+    results = {}
+
+    # Pattern 1: Verbose output with test names
+    # e.g., "tests/test_foo.py::test_one PASSED"
+    verbose_pattern = re.compile(
+        r"^([\w./]+::\w+(?:::\w+)*)\s+(PASSED|FAILED|SKIPPED|ERROR|XFAIL|XPASS)", re.MULTILINE
+    )
+
+    for match in verbose_pattern.finditer(text_output):
+        test_id = match.group(1).strip()
+        status = match.group(2).upper()
+
+        if status in ("PASSED", "XPASS"):
+            results[test_id] = "PASSED"
+        elif status in ("FAILED", "ERROR", "XFAIL"):
+            results[test_id] = "FAILED"
+        elif status == "SKIPPED":
+            results[test_id] = "SKIPPED"
+
+    if results:
+        return results
+
+    # Pattern 2: Short form with dots (. = pass, F = fail, s = skip)
+    # e.g., "tests/test_foo.py .F.s"
+    short_pattern = re.compile(r"^([\w./]+\.py)\s+([.FsExX]+)", re.MULTILINE)
+
+    for match in short_pattern.finditer(text_output):
+        file_path = match.group(1)
+        outcomes = match.group(2)
+
+        for i, char in enumerate(outcomes):
+            test_id = f"{file_path}::test_{i + 1}"
+            if char == ".":
+                results[test_id] = "PASSED"
+            elif char.upper() == "F":
+                results[test_id] = "FAILED"
+            elif char.lower() == "s":
+                results[test_id] = "SKIPPED"
+
+    if results:
+        return results
+
+    # Pattern 3: Summary line fallback
+    # e.g., "===== 3 passed, 1 failed, 1 skipped in 0.5s ====="
+    summary_pattern = re.compile(r"(\d+)\s+passed(?:,\s*(\d+)\s+failed)?(?:,\s*(\d+)\s+(?:skipped|deselected))?")
+    match = summary_pattern.search(text_output)
+    if match:
+        passed = int(match.group(1) or 0)
+        failed = int(match.group(2) or 0)
+        skipped = int(match.group(3) or 0)
+
+        for i in range(passed):
+            results[f"pytest_test_passed_{i + 1}"] = "PASSED"
+        for i in range(failed):
+            results[f"pytest_test_failed_{i + 1}"] = "FAILED"
+        for i in range(skipped):
+            results[f"pytest_test_skipped_{i + 1}"] = "SKIPPED"
+
+    return results
+
+
+def parse_test_output(output: str, framework: str) -> Dict[str, str]:
+    """
+    Parse test output to extract individual test results.
+
+    Returns: {'test_id': 'PASSED'|'FAILED'|'SKIPPED'}
+    """
+    # Direct framework → parser mapping
+    parsers = {
+        "pytest": parse_junit_xml,
+        "unittest": parse_junit_xml,
+        "junit": parse_junit_xml,
+        "maven": parse_maven_text_output,
+        "gtest": parse_gtest_json,
+        "cargo-nextest": parse_cargo_nextest,
+        "go": parse_go_json,
+        "jest": parse_jest_vitest_json,
+        "vitest": parse_jest_vitest_json,
+        "mocha": parse_mocha_json,
+        "bun": parse_bun_text,
+        "ctest": parse_junit_xml,
+        "cppunit": parse_cppunit_text,
+        "bespoke_libgeos": parse_bespoke_libgeos,
+        # XCTest using hybrid approach
+        "xctest": parse_xctest_output,
+        "testing": parse_xctest_output,  # New Swift Testing framework (Swift 6+)
+        # Lua frameworks
+        "busted": parse_junit_xml,  # Uses JUnit XML output
+        "luaunit": parse_junit_xml,  # Uses JUnit XML output
+        "telescope": parse_telescope_text,
+        "lust": parse_lust_text,
+        "minitest": parse_minitest_text,  # Neovim mini.nvim test framework
+        # TAP (Test Anything Protocol) - used by tape, node-tap
+        "tap": parse_tap_text,
+        "tape": parse_tap_text,
+        # Hardhat (Solidity) - uses Mocha console output
+        "hardhat": parse_hardhat_mocha_text,
+    }
+
+    parser = parsers.get(framework)
+    if parser:
+        result = parser(output)
+        # Fallback for common frameworks if their primary parser returns None/empty
+        if not result:
+            if framework in ["junit", "maven"]:
+                result = parse_maven_text_output(output)
+            elif framework == "pytest":
+                # Pytest often outputs plain text, not JUnit XML
+                result = parse_pytest_text(output)
+            elif framework == "mocha":
+                # Mocha might output text instead of JSON (console reporter)
+                result = parse_hardhat_mocha_text(output)
+        return result or {}
+
+    # Try auto-detection for unknown frameworks
+    # Check for TAP output
+    if "TAP version" in output or re.search(r"^(?:not )?ok\s+\d+", output, re.MULTILINE):
+        return parse_tap_text(output) or {}
+
+    # Check for Mocha/Hardhat console output
+    if "Contract:" in output or re.search(r"^\s*[✓✔]\s+", output, re.MULTILINE):
+        return parse_hardhat_mocha_text(output) or {}
+
+    return {}
diff --git a/resources_servers/swe_bench/parsing/utils.py b/resources_servers/swe_bench/parsing/utils.py
new file mode 100644
index 0000000000..4ef21aad79
--- /dev/null
+++ b/resources_servers/swe_bench/parsing/utils.py
@@ -0,0 +1,194 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""SWE-Bench-Ext test output parsing utilities.
+
+Provides the high-level grading entry point that parses raw test output,
+normalizes test IDs, fuzzy-matches the expected FAIL_TO_PASS / PASS_TO_PASS
+tests, and reports whether the task was resolved. Example usage::
+
+    from resources_servers.swe_bench.parsing import parse_and_check_tests
+
+    result = parse_and_check_tests(
+        test_output=log_text,
+        test_framework="pytest",
+        fail_to_pass=["test_a", "test_b"],
+        pass_to_pass=["test_c"],
+        instance_id="my-task-123",
+    )
+    # result["resolved"] -> bool
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from resources_servers.swe_bench.parsing.parsing import (
+    normalize_test_id,
+    parse_test_output,
+)
+
+
+# Marker strings used to delimit structured output in the raw test log.
+_TEST_OUTPUT_START = "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>"
+_TEST_OUTPUT_END = "<<<SWE_BENCH_EXT_TEST_OUTPUT_END>>>"
+_RESULT_FILE_START = "<<<SWE_BENCH_EXT_RESULT_FILE_START>>>"
+_RESULT_FILE_END = "<<<SWE_BENCH_EXT_RESULT_FILE_END>>>"
+
+
+def _extract_between_markers(text: str, start: str, end: str) -> Optional[str]:
+    """Extract the substring between two marker strings.
+
+    Args:
+        text: Text to search within.
+        start: Opening marker; the result begins after it.
+        end: Closing marker; the result ends before it.
+
+    Returns:
+        The stripped text between the markers, or None if either marker is
+        missing or they appear out of order.
+    """
+    s = text.find(start)
+    e = text.find(end)
+    if s != -1 and e != -1 and s < e:
+        return text[s + len(start) : e].strip()
+    return None
+
+
+def _match_test_with_fuzzy(
+    test_id: str,
+    parsed_results: Dict[str, str],
+    build_failed_packages: set,
+) -> str:
+    """Resolve the status of a single test ID against parsed results.
+
+    Tries, in order: a direct lookup, a check for membership in a package that
+    failed to build, a substring match, and a match on the final ``::``
+    component.
+
+    Args:
+        test_id: Normalized test identifier to look up.
+        parsed_results: Mapping of parsed test ID to status string.
+        build_failed_packages: Set of package names whose build failed; any
+            test ID prefixed by one of these is treated as failed.
+
+    Returns:
+        The matched status string, ``"FAILED"`` if the test's package failed to
+        build, or ``"NOT_FOUND"`` if no match is found.
+    """
+    # Direct match
+    if test_id in parsed_results:
+        return parsed_results[test_id]
+
+    # Check if this test belongs to a package that failed to build
+    for pkg in build_failed_packages:
+        if test_id.startswith(pkg):
+            return "FAILED"
+
+    # Substring match (normalized IDs may differ in prefix)
+    for parsed_id, status in parsed_results.items():
+        if test_id in parsed_id or parsed_id in test_id:
+            return status
+
+    # Try matching by last component (after last ::)
+    if "::" in test_id:
+        suffix = test_id.rsplit("::", 1)[-1]
+        for parsed_id, status in parsed_results.items():
+            if "::" in parsed_id and parsed_id.rsplit("::", 1)[-1] == suffix:
+                return status
+
+    return "NOT_FOUND"
+
+
+def parse_and_check_tests(
+    test_output: str,
+    test_framework: str,
+    fail_to_pass: List[str],
+    pass_to_pass: List[str],
+    instance_id: str = "",
+) -> Dict[str, Any]:
+    """Parse test output and check FAIL_TO_PASS / PASS_TO_PASS resolution.
+
+    The pipeline extracts structured output from the result-file markers (if
+    present), parses it with the framework dispatcher, normalizes both parsed
+    and expected test IDs, fuzzy-matches each expected test, and computes
+    ``resolved`` as all FAIL_TO_PASS passing and all PASS_TO_PASS passing.
+
+    Args:
+        test_output: Raw test log to parse.
+        test_framework: Name of the test framework (e.g. ``"pytest"``) used to
+            select the parser and normalize IDs.
+        fail_to_pass: Test IDs expected to transition from failing to passing.
+        pass_to_pass: Test IDs expected to remain passing.
+        instance_id: Optional task identifier, accepted for caller convenience.
+
+    Returns:
+        A report dict containing the overall ``resolved`` flag, per-test
+        FAIL_TO_PASS and PASS_TO_PASS results, pass/total counts for each
+        group, the number of parsed tests, and the framework name.
+    """
+    # Try to extract result file content from the markers.
+    result_file_content = _extract_between_markers(test_output, _RESULT_FILE_START, _RESULT_FILE_END)
+
+    if result_file_content:
+        parsed = parse_test_output(result_file_content, test_framework)
+        if not parsed:
+            parsed = parse_test_output(test_output, test_framework)
+    else:
+        parsed = parse_test_output(test_output, test_framework)
+
+    if parsed is None:
+        parsed = {}
+
+    # Normalize parsed test IDs
+    parsed = {normalize_test_id(tid, test_framework): status for tid, status in parsed.items()}
+
+    # Normalize expected test IDs
+    norm_f2p = [normalize_test_id(tid, test_framework) for tid in fail_to_pass]
+    norm_p2p = [normalize_test_id(tid, test_framework) for tid in pass_to_pass]
+
+    # Handle synthetic build/compile tests
+    for tid in norm_f2p + norm_p2p:
+        if (tid.endswith("::build") or tid.endswith("::compile")) and tid not in parsed:
+            parsed[tid] = "PASSED"
+
+    # Identify packages that failed to build
+    build_failed_packages = {pkg for pkg, status in parsed.items() if status == "FAILED" and "::" not in pkg}
+
+    # Match FAIL_TO_PASS
+    f2p_results = {}
+    for tid in norm_f2p:
+        f2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages)
+
+    # Match PASS_TO_PASS
+    p2p_results = {}
+    for tid in norm_p2p:
+        p2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages)
+
+    all_f2p_passed = all(v == "PASSED" for v in f2p_results.values()) if f2p_results else False
+    all_p2p_passed = all(v == "PASSED" for v in p2p_results.values())
+    resolved = all_f2p_passed and all_p2p_passed
+
+    return {
+        "resolved": resolved,
+        "patch_exists": True,
+        "patch_successfully_applied": True,
+        "fail_to_pass_results": f2p_results,
+        "pass_to_pass_results": p2p_results,
+        "f2p_passed": sum(1 for v in f2p_results.values() if v == "PASSED"),
+        "f2p_total": len(f2p_results),
+        "p2p_passed": sum(1 for v in p2p_results.values() if v == "PASSED"),
+        "p2p_total": len(p2p_results),
+        "parsed_count": len(parsed),
+        "framework": test_framework,
+    }
diff --git a/resources_servers/swe_bench/prepare.py b/resources_servers/swe_bench/prepare.py
new file mode 100644
index 0000000000..505ad263cc
--- /dev/null
+++ b/resources_servers/swe_bench/prepare.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    python prepare.py                          # full SWE-bench Verified + all SIFs
+    python prepare.py --limit 5                # 5 instances + their 5 SIFs (smoke test)
+    python prepare.py --instance-id django__django-13741
+    python prepare.py --no-images              # dataset only, skip image builds
+    python prepare.py --no-dataset --sif-dir PATH # build images only
+
+schema anyswe_agent expects: each line has
+`responses_create_params.metadata` with `instance_id`, `dataset_name`, `split`,
+`problem_statement`, and `instance_dict` (the full SWE-bench instance the eval
+harness needs). Images are Apptainer SIFs named `{instance_id}.sif` so the
+agent's container_formatter is simply `<sif-dir>/{instance_id}.sif`.
+
+Prerequisites for image builds: `apptainer` on PATH and network access to the
+SWE-bench image registry. Each SIF is multiple GB, building all of SWE-bench
+Verified (500 tasks) needs hundreds of GB of disk. Can use --limit and iterate.
+"""
+
+import argparse
+import json
+import subprocess
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+
+
+HF_DATASET = "princeton-nlp/SWE-bench_Verified"
+DEFAULT_SPLIT = "test"
+# SWE-bench publishes eval images with `__` -> `_1776_` and lowercased.
+DOCKER_IMAGE_TMPL = "docker://swebench/sweb.eval.x86_64.{tag}:latest"
+DEFAULT_MODEL = "Qwen/Qwen3-Coder-30B-A3B-Instruct"
+
+_THIS_DIR = Path(__file__).parent
+
+
+def _docker_tag(instance_id: str) -> str:
+    return instance_id.replace("__", "_1776_").lower()
+
+
+def _to_gym_row(inst: dict, split: str, sampling: dict) -> dict:
+    swe_meta = {
+        "instance_id": inst["instance_id"],
+        "dataset_name": HF_DATASET,
+        "split": split,
+        "problem_statement": inst["problem_statement"],
+        "instance_dict": json.dumps(inst),
+    }
+    user_text = inst["problem_statement"]
+    return {
+        "responses_create_params": {
+            "input": [{"role": "user", "content": user_text}],
+            **sampling,
+            "metadata": swe_meta,
+        },
+        "verifier_metadata": swe_meta,
+    }
+
+
+def build_dataset(output: Path, split: str, limit: int | None, instance_id: str | None, sampling: dict) -> list[str]:
+    try:
+        from datasets import load_dataset
+    except ImportError:
+        sys.exit("`datasets` is required for dataset prep: pip install datasets")
+
+    print(f"Loading {HF_DATASET} [{split}]...", flush=True)
+    rows = load_dataset(HF_DATASET, split=split)
+
+    if instance_id:
+        rows = [r for r in rows if r["instance_id"] == instance_id]
+        if not rows:
+            sys.exit(f"instance_id {instance_id!r} not found in {HF_DATASET}")
+    elif limit:
+        rows = rows.select(range(min(limit, len(rows))))
+
+    output.parent.mkdir(parents=True, exist_ok=True)
+    ids: list[str] = []
+    with output.open("w") as f:
+        for inst in rows:
+            inst = dict(inst)
+            f.write(json.dumps(_to_gym_row(inst, split, sampling)) + "\n")
+            ids.append(inst["instance_id"])
+    print(f"Wrote {len(ids)} rows -> {output}", flush=True)
+    return ids
+
+
+def _build_one_sif(instance_id: str, sif_dir: Path, force: bool) -> tuple[str, bool, str]:
+    sif_path = sif_dir / f"{instance_id}.sif"
+    if sif_path.exists() and not force:
+        return instance_id, True, "exists"
+    image = DOCKER_IMAGE_TMPL.format(tag=_docker_tag(instance_id))
+    proc = subprocess.run(
+        ["apptainer", "build", "--force", str(sif_path), image],
+        capture_output=True,
+        text=True,
+        errors="replace",
+    )
+    if proc.returncode != 0:
+        return instance_id, False, proc.stderr.strip()[-500:]
+    return instance_id, True, "built"
+
+
+def build_images(instance_ids: list[str], sif_dir: Path, jobs: int, force: bool) -> None:
+    if not _which("apptainer"):
+        sys.exit("`apptainer` not found on PATH. Install it or pass --no-images")
+    sif_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Building {len(instance_ids)} SIF(s) into {sif_dir} with {jobs} worker(s)...", flush=True)
+    failures: list[str] = []
+    with ThreadPoolExecutor(max_workers=jobs) as pool:
+        futures = {pool.submit(_build_one_sif, iid, sif_dir, force): iid for iid in instance_ids}
+        for done in as_completed(futures):
+            iid, ok, detail = done.result()
+            print(f"  [{'ok' if ok else 'FAIL'}] {iid}: {detail}", flush=True)
+            if not ok:
+                failures.append(iid)
+    if failures:
+        print(f"\n{len(failures)} image build(s) failed:", flush=True)
+        for iid in failures:
+            print(f"  - {iid}", flush=True)
+        sys.exit(1)
+    print(f"All images ready. Use: container_formatter='{sif_dir}/{{instance_id}}.sif'", flush=True)
+
+
+def _which(name: str) -> bool:
+    from shutil import which
+
+    return which(name) is not None
+
+
+def main() -> None:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--output", type=Path, default=_THIS_DIR / "data" / "swebench_verified.jsonl")
+    p.add_argument("--split", default=DEFAULT_SPLIT)
+    p.add_argument("--limit", type=int, default=None, help="Only the first N instances (default: all)")
+    p.add_argument("--instance-id", default=None, help="Only this instance")
+    p.add_argument("--sif-dir", type=Path, default=_THIS_DIR / "data" / "sifs")
+    p.add_argument("--no-dataset", action="store_true", help="Skip dataset build")
+    p.add_argument("--no-images", action="store_true", help="Skip image build")
+    p.add_argument("--jobs", type=int, default=4, help="Parallel image builds")
+    p.add_argument("--force", action="store_true", help="Rebuild SIFs that already exist")
+    p.add_argument("--model", default=DEFAULT_MODEL, help="Default model baked into each row")
+    p.add_argument("--temperature", type=float, default=0.7)
+    p.add_argument("--top-p", type=float, default=0.8)
+    p.add_argument("--max-output-tokens", type=int, default=12288)
+    args = p.parse_args()
+
+    sampling = {
+        "model": args.model,
+        "temperature": args.temperature,
+        "top_p": args.top_p,
+        "max_output_tokens": args.max_output_tokens,
+    }
+
+    instance_ids: list[str]
+    if args.no_dataset:
+        if not args.output.exists():
+            sys.exit(f"--no-dataset but {args.output} does not exist")
+        instance_ids = [
+            json.loads(line)["responses_create_params"]["metadata"]["instance_id"]
+            for line in args.output.read_text().splitlines()
+            if line.strip()
+        ]
+    else:
+        instance_ids = build_dataset(args.output, args.split, args.limit, args.instance_id, sampling)
+
+    if not args.no_images:
+        build_images(instance_ids, args.sif_dir, args.jobs, args.force)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/resources_servers/swe_bench/requirements.txt b/resources_servers/swe_bench/requirements.txt
new file mode 100644
index 0000000000..cef7e1d96d
--- /dev/null
+++ b/resources_servers/swe_bench/requirements.txt
@@ -0,0 +1,2 @@
+swebench
+datasets>=2.14.0
diff --git a/resources_servers/swe_bench/sandbox.py b/resources_servers/swe_bench/sandbox.py
new file mode 100644
index 0000000000..99dafcd50c
--- /dev/null
+++ b/resources_servers/swe_bench/sandbox.py
@@ -0,0 +1,230 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Async SWE sandbox: an environment wrapper plus its acquire/teardown lifecycle.
+
+``AsyncSweEnvironment`` is a thin async wrapper around a started sandbox that any
+agent or the verifier uses to run commands and move files in and out.
+``acquire_sandbox`` starts a fresh sandbox and always tears it down on exit
+(normal return, exception, or cancellation).
+"""
+
+from __future__ import annotations
+
+import os
+import tempfile
+from contextlib import asynccontextmanager
+from pathlib import Path
+from typing import Any, AsyncIterator, Mapping
+
+from nemo_gym.sandbox import AsyncSandbox, SandboxProvider, SandboxSpec
+
+
+class AsyncSweEnvironment:
+    """Thin async wrapper around a started ``AsyncSandbox``.
+
+    Agents drive their own loop with ``execute``/``upload``/``download``; the
+    verifier uses the same surface to run eval recipes. The environment never
+    owns trajectory capture or grading logic — only sandbox I/O.
+    """
+
+    def __init__(self, sandbox: AsyncSandbox) -> None:
+        """Wrap an already-started sandbox.
+
+        Args:
+            sandbox (AsyncSandbox): A started sandbox to drive I/O against.
+        """
+        self._sandbox = sandbox
+        self._closed = False
+
+    @classmethod
+    async def start(
+        cls,
+        provider: Mapping[str, Any] | SandboxProvider,
+        spec: SandboxSpec,
+    ) -> "AsyncSweEnvironment":
+        """Create and start a fresh sandbox and return the environment.
+
+        Args:
+            provider (Mapping[str, Any] | SandboxProvider): The sandbox provider
+                config or instance to launch the sandbox with.
+            spec (SandboxSpec): The sandbox spec describing image, workdir, env,
+                and other launch options.
+
+        Returns:
+            AsyncSweEnvironment: An environment wrapping the started sandbox.
+        """
+        sandbox = AsyncSandbox(provider, spec)
+        await sandbox.start()
+        return cls(sandbox)
+
+    @property
+    def sandbox(self) -> AsyncSandbox:
+        """The wrapped sandbox.
+
+        Returns:
+            AsyncSandbox: The underlying sandbox instance.
+        """
+        return self._sandbox
+
+    @property
+    def sandbox_id(self) -> str | None:
+        """The provider-assigned sandbox identifier.
+
+        Returns:
+            str | None: The sandbox id, or ``None`` if the sandbox has no handle.
+        """
+        handle = getattr(self._sandbox, "_handle", None)
+        return handle.sandbox_id if handle is not None else None
+
+    @property
+    def provider_name(self) -> str | None:
+        """The name of the provider backing the sandbox.
+
+        Returns:
+            str | None: The provider name, or ``None`` if the sandbox has no handle.
+        """
+        handle = getattr(self._sandbox, "_handle", None)
+        return handle.provider_name if handle is not None else None
+
+    async def execute(
+        self,
+        command: str,
+        *,
+        cwd: str | None = None,
+        user: str | int | None = "root",
+        timeout_s: int | float | None = None,
+        is_eval: bool = False,
+    ) -> dict[str, Any]:
+        """Run a command in the sandbox and return a normalized result.
+
+        Args:
+            command (str): The shell command to execute.
+            cwd (str | None): Working directory for the command, or ``None`` to
+                use the sandbox default.
+            user (str | int | None): User to run the command as. Defaults to
+                ``"root"``.
+            timeout_s (int | float | None): Optional timeout in seconds.
+            is_eval (bool): Accepted for caller bookkeeping; it does not affect
+                how the command is executed.
+
+        Returns:
+            dict[str, Any]: A dict with ``output`` (combined stdout and stderr),
+                ``returncode``, ``stdout``, ``stderr``, and ``error_type``.
+        """
+        result = await self._sandbox.exec(command, cwd=cwd, env=None, timeout_s=timeout_s, user=user)
+        stdout = result.stdout or ""
+        stderr = result.stderr or ""
+        output = "\n".join(part for part in (stdout, stderr) if part)
+        return {
+            "output": output,
+            "returncode": result.return_code,
+            "stdout": stdout,
+            "stderr": stderr,
+            "error_type": result.error_type,
+        }
+
+    async def upload(self, local_path: Path | str, remote_path: str) -> None:
+        """Upload a local file into the sandbox.
+
+        Args:
+            local_path (Path | str): Path to the file on the host.
+            remote_path (str): Destination path inside the sandbox.
+        """
+        await self._sandbox.upload(local_path, remote_path)
+
+    async def download(self, remote_path: str, local_path: Path | str) -> None:
+        """Download a file from the sandbox to the host.
+
+        Args:
+            remote_path (str): Source path inside the sandbox.
+            local_path (Path | str): Destination path on the host.
+        """
+        await self._sandbox.download(remote_path, local_path)
+
+    async def write_text(self, remote_path: str, content: str) -> None:
+        """Write a string to a file inside the sandbox via a temporary upload.
+
+        Args:
+            remote_path (str): Destination path inside the sandbox.
+            content (str): The text content to write.
+        """
+        tmp = tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8")
+        try:
+            tmp.write(content)
+            tmp.flush()
+            tmp.close()
+            await self._sandbox.upload(tmp.name, remote_path)
+        finally:
+            os.unlink(tmp.name)
+
+    async def cleanup(self) -> None:
+        """Stop the sandbox. Idempotent: subsequent calls are no-ops."""
+        if self._closed:
+            return
+        self._closed = True
+        await self._sandbox.stop()
+
+    async def __aenter__(self) -> "AsyncSweEnvironment":
+        """Enter the async context manager.
+
+        Returns:
+            AsyncSweEnvironment: This environment instance.
+        """
+        return self
+
+    async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None:
+        """Exit the async context manager and stop the sandbox.
+
+        Args:
+            exc_type (Any): The exception type, if one was raised.
+            exc_val (Any): The exception instance, if one was raised.
+            exc_tb (Any): The traceback, if an exception was raised.
+        """
+        await self.cleanup()
+
+
+# --- sandbox acquire/teardown lifecycle ---------------
+
+
+@asynccontextmanager
+async def acquire_sandbox(
+    provider: Mapping[str, Any] | SandboxProvider,
+    spec: SandboxSpec,
+    *,
+    instance_id: str = "",
+) -> AsyncIterator[AsyncSweEnvironment]:
+    """Start a fresh sandbox, yield it, and always stop it on exit.
+
+    Args:
+        provider: Either a ``SandboxProvider`` instance or a mapping describing
+            the provider configuration used to create the sandbox.
+        spec: The ``SandboxSpec`` describing how to provision the sandbox.
+        instance_id: Identifier accepted for logging/telemetry; it does not
+            affect behavior.
+
+    Yields:
+        AsyncSweEnvironment: The started environment wrapping the sandbox,
+        which is cleaned up when the context manager exits.
+    """
+    env: AsyncSweEnvironment | None = None
+    try:
+        env = await AsyncSweEnvironment.start(provider, spec)
+        yield env
+    finally:
+        if env is not None:
+            try:
+                await env.cleanup()
+            except Exception:
+                pass
diff --git a/resources_servers/swe_bench/self_drive.py b/resources_servers/swe_bench/self_drive.py
new file mode 100644
index 0000000000..2e00cfac69
--- /dev/null
+++ b/resources_servers/swe_bench/self_drive.py
@@ -0,0 +1,392 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Provider-neutral self-driving scaffolding for SWE agents.
+
+Any agent that runs to completion inside a sandbox (editing the repo at the task's
+working directory) can reuse these helpers: provision a working sandbox via a
+``SandboxProvider``, inject a sandbox-reachable model endpoint and/or extra
+environment for egress, run an opaque agent launch command, and extract the
+resulting unified-diff patch. Grading is decoupled — callers grade the patch
+in-process via :func:`run_self_driving` (or ``verify_task`` directly) in a fresh
+sandbox. The agent launch command, staged files, and patch-output location are
+caller-supplied, so nothing here is specific to any one agent harness.
+
+This module also defines the in-sandbox model-server egress primitive
+(``ModelEndpoint`` / ``resolve``), used to inject a sandbox-reachable endpoint
+into the agent's environment.
+"""
+
+from __future__ import annotations
+
+import dataclasses
+import json
+import shlex
+from collections.abc import Mapping
+from dataclasses import dataclass
+from typing import Any
+
+from nemo_gym.sandbox import SandboxProvider
+from resources_servers.swe_bench.harness import SweTask, get_harness, reward_from_report
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+
+
+def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str:
+    """Return the name of a sandbox provider.
+
+    Args:
+        provider: Either a mapping keyed by provider name, or a ``SandboxProvider``
+            instance with a ``name`` attribute.
+
+    Returns:
+        The provider name, or ``"?"`` if it cannot be determined.
+    """
+    if isinstance(provider, Mapping):
+        return next(iter(provider), "?")
+    return getattr(provider, "name", "?")
+
+
+async def _read_output_jsonl_row(env, output_glob: str) -> dict[str, Any]:
+    """Return the last row of the newest matching ``output.jsonl`` (or ``{}`` if absent).
+
+    Some self-driving harnesses write their result row to an ``output.jsonl`` file under an
+    output directory rather than to the working tree, so a plain ``git diff`` would miss the
+    patch. When several files match (e.g. a re-run left a stale one), the newest by mtime is
+    picked. ``find -printf "%T@ %p"`` emits ``<mtime> <path>`` per match; ``sort -n | tail -1``
+    selects the most-recently-modified, and the leading float timestamp plus single space is
+    stripped back off (so paths containing spaces survive).
+
+    Args:
+        env: The sandbox handle exposing ``execute`` for running shell commands.
+        output_glob: Path or glob under which to search for ``output.jsonl`` files.
+
+    Returns:
+        The parsed last JSON row of the newest matching ``output.jsonl`` as a dict, or an
+        empty dict if no file or content is found.
+    """
+    found = await env.execute(
+        f'find {shlex.quote(output_glob)} -name output.jsonl -printf "%T@ %p\\n" 2>/dev/null | sort -n | tail -1'
+    )
+    newest = (found.get("stdout", "") or "").strip()
+    # newest is "<mtime> <path>"; the path may contain spaces, so split only on the first one.
+    path = newest.split(" ", 1)[1].strip() if " " in newest else ""
+    if not path:
+        return {}
+    catted = await env.execute(f"cat {shlex.quote(path)}")
+    raw = (catted.get("stdout", "") or "").strip()
+    if not raw:
+        return {}
+    return json.loads(raw.splitlines()[-1])
+
+
+async def _extract_patch_from_output_jsonl(env, output_glob: str) -> str:
+    """Read the unified-diff patch from the newest matching ``output.jsonl``.
+
+    Args:
+        env: The sandbox handle exposing ``execute`` for running shell commands.
+        output_glob: Path or glob under which to search for ``output.jsonl`` files.
+
+    Returns:
+        The patch string from ``row["test_result"]["git_patch"]``, or an empty string if
+        absent.
+    """
+    row = await _read_output_jsonl_row(env, output_glob)
+    return (row.get("test_result") or {}).get("git_patch", "") or ""
+
+
+def _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env):
+    """Build the agent sandbox spec, injecting egress env (model endpoint and/or extra env).
+
+    Args:
+        task: The SWE task whose benchmark selects the harness and seeds the spec.
+        provider: The sandbox provider, used to resolve the model endpoint for egress.
+        model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+            is resolved and merged into the spec's environment.
+        opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+            model endpoint.
+        extra_env: Optional environment variables merged verbatim into the spec.
+
+    Returns:
+        The sandbox spec with egress environment variables applied.
+    """
+    harness = get_harness(task.benchmark)
+    spec = harness.build_spec(task)
+    # Model-server egress: inject only a sandbox-reachable endpoint (never the global dict).
+    if model_server is not None:
+        endpoint = resolve(_provider_name(provider), model_server, opensandbox_service_url=opensandbox_service_url)
+        spec = dataclasses.replace(spec, env={**spec.env, **endpoint.to_sandbox_env()})
+    # Any extra in-sandbox env (e.g. a NeMo-Gym ServerClient config dict, ANTHROPIC_* vars).
+    if extra_env:
+        spec = dataclasses.replace(spec, env={**spec.env, **dict(extra_env)})
+    return spec
+
+
+async def provision_and_collect(
+    task: SweTask,
+    *,
+    provider: Mapping[str, Any] | SandboxProvider,
+    agent_launch_command: str,
+    model_server: Mapping[str, Any] | None = None,
+    opensandbox_service_url: str | None = None,
+    extra_env: Mapping[str, str] | None = None,
+    stage_files: Mapping[str, str] | None = None,
+    patch_output_glob: str | None = None,
+    agent_timeout_s: int | float = 1800,
+) -> dict[str, Any]:
+    """Provision and self-drive the agent, returning the patch and error signals.
+
+    Provisions a writable sandbox from the task image, stages any caller-supplied files,
+    runs the opaque ``agent_launch_command`` at the repo working directory, then extracts the
+    unified-diff patch. No grading happens here.
+
+    Two egress styles are supported and composable:
+
+    * ``model_server`` -> a sandbox-reachable OpenAI ``base_url`` (via ``resolve``),
+      for agents that call the model via a standard OpenAI/litellm client.
+    * ``extra_env`` -> injected verbatim, for agents wired to NeMo Gym's ``ServerClient`` or to
+      a CLI that reads its endpoint from environment variables.
+
+    ``env.execute`` does not raise on timeout; it returns an ``error_type`` instead, so the
+    caller must read the returned ``"error_type"`` to set ``agent_timed_out`` (otherwise a
+    timed-out agent would wrongly not be masked).
+
+    Args:
+        task: The SWE task describing the instance, image, and working directory.
+        provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+        agent_launch_command: The shell command that runs the agent inside the sandbox.
+        model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+            is resolved and injected into the agent's environment.
+        opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+            model endpoint.
+        extra_env: Optional environment variables injected verbatim into the sandbox.
+        stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+            before launch.
+        patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+            path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+        agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+    Returns:
+        A dict with keys ``"patch"`` (the unified-diff string), ``"agent_error"`` (the
+        harness error field or ``None``), and ``"error_type"`` (``"timeout"``, ``"sandbox"``,
+        or ``None``).
+    """
+    spec = _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env)
+    async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env:
+        for remote_path, content in (stage_files or {}).items():
+            await env.write_text(remote_path, content)
+        run = await env.execute(agent_launch_command, cwd=task.repo_workdir, timeout_s=agent_timeout_s)
+        error_type = run.get("error_type")
+        if patch_output_glob:
+            row = await _read_output_jsonl_row(env, patch_output_glob)
+            patch = (row.get("test_result") or {}).get("git_patch", "") or ""
+            return {"patch": patch, "agent_error": row.get("error"), "error_type": error_type}
+        diff = await env.execute(f"cd {task.repo_workdir} && git add -A && git diff --cached", cwd=task.repo_workdir)
+        return {"patch": diff.get("stdout", "") or "", "agent_error": None, "error_type": error_type}
+
+
+async def provision_and_extract_patch(
+    task: SweTask,
+    *,
+    provider: Mapping[str, Any] | SandboxProvider,
+    agent_launch_command: str,
+    model_server: Mapping[str, Any] | None = None,
+    opensandbox_service_url: str | None = None,
+    extra_env: Mapping[str, str] | None = None,
+    stage_files: Mapping[str, str] | None = None,
+    patch_output_glob: str | None = None,
+    agent_timeout_s: int | float = 1800,
+) -> str:
+    """Provision a working sandbox, self-drive the agent, and return the unified-diff patch.
+
+    A thin wrapper over :func:`provision_and_collect` returning only the patch. No grading
+    happens here.
+
+    Args:
+        task: The SWE task describing the instance, image, and working directory.
+        provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+        agent_launch_command: The shell command that runs the agent inside the sandbox.
+        model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+            is resolved and injected into the agent's environment.
+        opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+            model endpoint.
+        extra_env: Optional environment variables injected verbatim into the sandbox.
+        stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+            before launch.
+        patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+            path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+        agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+    Returns:
+        The extracted unified-diff patch as a string (empty if none was produced).
+    """
+    result = await provision_and_collect(
+        task,
+        provider=provider,
+        agent_launch_command=agent_launch_command,
+        model_server=model_server,
+        opensandbox_service_url=opensandbox_service_url,
+        extra_env=extra_env,
+        stage_files=stage_files,
+        patch_output_glob=patch_output_glob,
+        agent_timeout_s=agent_timeout_s,
+    )
+    return result["patch"]
+
+
+async def run_self_driving(
+    task: SweTask,
+    *,
+    provider: Mapping[str, Any] | SandboxProvider,
+    agent_launch_command: str,
+    model_server: Mapping[str, Any] | None = None,
+    opensandbox_service_url: str | None = None,
+    extra_env: Mapping[str, str] | None = None,
+    stage_files: Mapping[str, str] | None = None,
+    patch_output_glob: str | None = None,
+    agent_timeout_s: int | float = 1800,
+) -> dict[str, Any]:
+    """Provision, self-drive, extract the patch, then grade it in-process in a fresh sandbox.
+
+    Bundles provisioning and verification for standalone use and tests. The patch is graded by
+    ``verify_task`` in its OWN fresh sandbox (so grading is hermetic — never the agent's dirtied
+    tree). ``verify_task`` is imported lazily to avoid a circular import between this library and
+    the verifier module.
+
+    Args:
+        task: The SWE task describing the instance, image, and working directory.
+        provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``).
+        agent_launch_command: The shell command that runs the agent inside the sandbox.
+        model_server: Optional model-server config; when given, a sandbox-reachable endpoint
+            is resolved and injected into the agent's environment.
+        opensandbox_service_url: Optional OpenSandbox service URL used when resolving the
+            model endpoint.
+        extra_env: Optional environment variables injected verbatim into the sandbox.
+        stage_files: Optional ``{remote_path: content}`` files written into the live sandbox
+            before launch.
+        patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this
+            path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``.
+        agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``.
+
+    Returns:
+        A dict with the instance id, model patch, resolution status, reward, whether a patch
+        exists, whether the sample is masked, and the verifier's error kind.
+    """
+    from resources_servers.swe_bench.verify_task import verify_task
+
+    patch = await provision_and_extract_patch(
+        task,
+        provider=provider,
+        agent_launch_command=agent_launch_command,
+        model_server=model_server,
+        opensandbox_service_url=opensandbox_service_url,
+        extra_env=extra_env,
+        stage_files=stage_files,
+        patch_output_glob=patch_output_glob,
+        agent_timeout_s=agent_timeout_s,
+    )
+    # Score the patch in the verifier's OWN fresh sandbox (decoupled, hermetic verification).
+    report = await verify_task(provider, dataclasses.replace(task, model_patch=patch))
+    masked = report.error_kind is not None
+    return {
+        "instance_id": task.instance_id,
+        "model_patch": patch,
+        "resolved": report.resolved,
+        "reward": reward_from_report(report),
+        "patch_exists": bool(patch.strip()),
+        "mask_sample": masked,
+        "error_kind": report.error_kind,
+    }
+
+
+# --- in-sandbox model-server egress --------------
+
+
+class ModelEgressUnavailable(RuntimeError):
+    """Raised when no sandbox-reachable model endpoint can be resolved for a provider."""
+
+
+@dataclass(frozen=True)
+class ModelEndpoint:
+    """A sandbox-reachable model-server endpoint.
+
+    Attributes:
+        base_url: The base URL the in-sandbox agent uses to reach the model server.
+        api_key: Optional API key for authenticating to the model server.
+        model: Optional model name to use.
+    """
+
+    base_url: str
+    api_key: str = ""
+    model: str = ""
+
+    def to_sandbox_env(self) -> dict[str, str]:
+        """Build the minimal set of environment variables to inject into the sandbox.
+
+        Returns:
+            dict[str, str]: Environment variables carrying the base URL and,
+            when set, the API key and model name. The global config dict is
+            never included.
+        """
+        env = {"OPENAI_BASE_URL": self.base_url, "NEMO_GYM_MODEL_BASE_URL": self.base_url}
+        if self.api_key:
+            env["OPENAI_API_KEY"] = self.api_key
+        if self.model:
+            env["NEMO_GYM_MODEL"] = self.model
+        return env
+
+
+def resolve(
+    provider_name: str,
+    model_server: Mapping[str, Any],
+    *,
+    host_loopback_url: str = "http://127.0.0.1:8000/v1",
+    opensandbox_service_url: str | None = None,
+) -> ModelEndpoint:
+    """Resolve a sandbox-reachable model endpoint for a sandbox provider.
+
+    Args:
+        provider_name: The sandbox provider name (e.g. ``"apptainer"``,
+            ``"opensandbox"``, ``"docker"``).
+        model_server: Mapping describing the model server, read for the
+            ``api_key``, ``model``, and ``base_url`` keys.
+        host_loopback_url: Fallback URL used when the provider shares the host
+            network namespace and no base URL is configured.
+        opensandbox_service_url: Cluster-reachable Service/ingress URL used for
+            the opensandbox provider when no other base URL is configured.
+
+    Returns:
+        ModelEndpoint: The resolved endpoint carrying the base URL, API key,
+        and model name.
+
+    Raises:
+        ModelEgressUnavailable: If the opensandbox provider cannot resolve a
+            cluster-reachable model-server URL (e.g. only loopback is available).
+    """
+    api_key = str(model_server.get("api_key", "") or "")
+    model = str(model_server.get("model", "") or "")
+    configured_base = str(model_server.get("base_url", "") or "")
+
+    if provider_name == "opensandbox":
+        base_url = opensandbox_service_url or configured_base
+        if not base_url or "127.0.0.1" in base_url or "localhost" in base_url:
+            raise ModelEgressUnavailable(
+                "opensandbox needs a cluster-reachable model-server URL (k8s Service/ingress); "
+                "loopback is unreachable from the pod. Configure 'opensandbox_service_url', or "
+                "run the agent with the docker provider instead."
+            )
+    else:
+        # docker / local: shares host network by default (host loopback reachable).
+        base_url = configured_base or host_loopback_url
+
+    return ModelEndpoint(base_url=base_url, api_key=api_key, model=model)
diff --git a/resources_servers/swe_bench/session.py b/resources_servers/swe_bench/session.py
new file mode 100644
index 0000000000..ca3112bc57
--- /dev/null
+++ b/resources_servers/swe_bench/session.py
@@ -0,0 +1,71 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""SessionDescriptor — Environment response after accepting a Task.
+
+The descriptor is **episode context**, not the Task itself: placement topology,
+sandbox spec, egress hints, and a round-trip verifier payload for ``/verify``.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Literal, Optional
+
+from pydantic import BaseModel, ConfigDict, Field
+
+from nemo_gym.base_resources_server import (
+    BaseSeedSessionRequest,
+    BaseSeedSessionResponse,
+    BaseVerifyRequest,
+    BaseVerifyResponse,
+)
+from resources_servers.swe_bench.task import ENVIRONMENT_NAME, TaskPublic
+
+
+Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"]
+
+
+class PlacementDescriptor(BaseModel):
+    topology: Topology
+
+
+class SandboxDescriptor(BaseModel):
+    spec: dict[str, Any]
+
+
+class EgressDescriptor(BaseModel):
+    env: dict[str, str] = Field(default_factory=dict)
+
+
+class SessionDescriptor(BaseSeedSessionResponse):
+    """Environment-owned episode context returned from ``seed_session``."""
+
+    environment: str = ENVIRONMENT_NAME
+    task: TaskPublic
+    placement: PlacementDescriptor
+    sandbox: SandboxDescriptor
+    egress: EgressDescriptor
+    verifier_metadata: dict[str, Any]
+
+
+class SweBenchSeedSessionRequest(BaseSeedSessionRequest):
+    model_config = ConfigDict(extra="allow")
+    verifier_metadata: Optional[dict[str, Any]] = None
+
+
+class SweBenchVerifyRequest(BaseVerifyRequest):
+    model_config = ConfigDict(extra="allow")
+    verifier_metadata: Optional[dict[str, Any]] = None
+
+
+class SweBenchVerifyResponse(BaseVerifyResponse):
+    model_config = ConfigDict(extra="allow")
+    task_id: str = ""
+    environment: str = ENVIRONMENT_NAME
+    resolved: bool = False
+    patch_exists: bool = False
+    mask_sample: bool = False
+    error_kind: Optional[str] = None
+
+
+SweBenchSeedSessionResponse = SessionDescriptor
diff --git a/resources_servers/swe_bench/task.py b/resources_servers/swe_bench/task.py
new file mode 100644
index 0000000000..53d86b506e
--- /dev/null
+++ b/resources_servers/swe_bench/task.py
@@ -0,0 +1,256 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""First-class Task model for the ``swe_bench`` Environment.
+
+A **Task** (τ) is one problem instance from a benchmark's task distribution — not the
+Environment (``swe_bench`` resources server) and not the published benchmark name alone
+(e.g. *SWE-bench Verified*).
+
+Terminology:
+
+* ``task_id`` / ``instance_id`` — unique instance key (``django__django-13741``)
+* ``dataset_name`` — published benchmark product (HuggingFace id)
+* ``harness_family`` / ``benchmark`` — harness registry key inside this Environment
+  (``swe-bench``, ``r2e-gym``, …)
+* ``problem_statement`` — initial observation (user message) for the agent
+* ``metadata`` — privileged grading fields (``instance_dict``, etc.); Environment-only
+"""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field, replace
+from typing import Any, Protocol
+
+from pydantic import BaseModel, ConfigDict
+
+from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming
+
+
+ENVIRONMENT_NAME = "swe_bench"
+
+_HARNESS_FAMILY_ALIASES: list[tuple[str, str]] = [
+    ("R2E-Gym", "r2e-gym"),
+    ("SWE-bench_Multilingual", "swe-bench-multilingual"),
+    ("SWE-bench", "swe-bench"),
+]
+
+
+class TaskRunBody(Protocol):
+    """Minimal run/seed/verify request shape carrying task fields."""
+
+    responses_create_params: NeMoGymResponseCreateParamsNonStreaming | None
+    verifier_metadata: dict[str, Any] | None
+
+
+class TaskPublic(BaseModel):
+    """Agent-visible task identity returned from ``seed_session``."""
+
+    model_config = ConfigDict(extra="forbid")
+
+    task_id: str
+    environment: str = ENVIRONMENT_NAME
+    dataset_name: str = ""
+    harness_family: str = ""
+    split: str = "test"
+
+
+class TaskSubmission(BaseModel):
+    """Agent-produced artifact graded at ``verify`` (Environment-owned scoring)."""
+
+    model_config = ConfigDict(extra="forbid")
+
+    model_patch: str = ""
+
+
+@dataclass
+class SweTask:
+    """One SWE Environment task instance — provisioning + grading input.
+
+    This is the Environment-internal task value. Harnesses consume ``SweTask``;
+    HTTP callers supply dataset rows that parse into this type.
+    """
+
+    instance_id: str
+    image: str | None = None
+    base_commit: str | None = None
+    repo_workdir: str = "/testbed"
+    test_command: str = ""
+    test_framework: str = ""
+    model_patch: str = ""
+    test_patch: str = ""
+    fail_to_pass: list[str] = field(default_factory=list)
+    pass_to_pass: list[str] = field(default_factory=list)
+    benchmark: str = "swe-bench-ext"
+    split: str = "test"
+    dataset_name: str = ""
+    problem_statement: str = ""
+    metadata: dict[str, Any] = field(default_factory=dict)
+
+    @property
+    def task_id(self) -> str:
+        return self.instance_id
+
+    @property
+    def harness_family(self) -> str:
+        return self.benchmark
+
+    def public_view(self, *, environment: str = ENVIRONMENT_NAME) -> TaskPublic:
+        """Return the agent-visible task identity (no privileged metadata)."""
+        return TaskPublic(
+            task_id=self.task_id,
+            environment=environment,
+            dataset_name=self.dataset_name,
+            harness_family=self.harness_family,
+            split=self.split,
+        )
+
+    def privileged_verifier_metadata(self, *, flat_eval: bool) -> dict[str, Any]:
+        """Privileged fields the Environment needs on verify (not for agent logic)."""
+        return {
+            "instance_id": self.instance_id,
+            "dataset_name": self.dataset_name,
+            "split": self.split,
+            "benchmark": self.benchmark,
+            "harness_family": self.harness_family,
+            "problem_statement": self.problem_statement,
+            "flat_eval": flat_eval,
+            "instance_dict": self.metadata.get("instance_dict"),
+        }
+
+    def with_submission(self, submission: TaskSubmission | None) -> SweTask:
+        """Return a copy with the agent's graded submission applied."""
+        patch = (submission.model_patch if submission else "") or ""
+        return replace(self, model_patch=patch)
+
+
+def harness_family_key(dataset_name: str) -> str:
+    """Map a HuggingFace dataset name to a harness registry key."""
+    for needle, key in _HARNESS_FAMILY_ALIASES:
+        if needle in dataset_name:
+            return key
+    return "swe-bench"
+
+
+def instance_image(container_formatter: Any, instance_id: str) -> str:
+    fmt = container_formatter[0] if isinstance(container_formatter, list) else container_formatter
+    fmt = fmt or "swebench/sweb.eval.x86_64.{instance_id}"
+    if fmt.endswith(".sif") or fmt.startswith(("/", ".")):
+        return fmt.format(instance_id=instance_id)
+    if fmt.startswith("docker://"):
+        fmt = fmt[len("docker://") :]
+    tag = instance_id.replace("__", "_1776_").lower()
+    image = fmt.format(instance_id=tag)
+    if ":" not in image.rsplit("/", 1)[-1]:
+        image += ":latest"
+    return image
+
+
+def _as_list(value: Any) -> list[str]:
+    if isinstance(value, str):
+        try:
+            return list(json.loads(value))
+        except (json.JSONDecodeError, TypeError):
+            return [value] if value else []
+    return list(value or [])
+
+
+def merge_row_metadata(
+    verifier_metadata: dict[str, Any] | None,
+    responses_metadata: dict[str, Any] | None,
+) -> dict[str, Any]:
+    """Merge dataset row fields from verifier and responses metadata."""
+    return _merge_row_metadata(verifier_metadata, responses_metadata)
+
+
+def _merge_row_metadata(
+    verifier_metadata: dict[str, Any] | None,
+    responses_metadata: dict[str, Any] | None,
+) -> dict[str, Any]:
+    info: dict[str, Any] = {}
+    if responses_metadata:
+        info.update(responses_metadata)
+    if verifier_metadata:
+        info.update(verifier_metadata)
+    return info
+
+
+def _initial_observation(row: dict[str, Any], responses_metadata: dict[str, Any] | None) -> str:
+    if row.get("problem_statement"):
+        return str(row["problem_statement"])
+    params = row.get("responses_create_params")
+    if isinstance(params, dict):
+        raw_input = params.get("input")
+    elif responses_metadata is not None:
+        raw_input = None
+    else:
+        raw_input = None
+    if raw_input is None and hasattr(row.get("responses_create_params"), "input"):
+        raw_input = row["responses_create_params"].input  # type: ignore[union-attr]
+    if isinstance(raw_input, str):
+        return raw_input
+    if isinstance(raw_input, list) and raw_input:
+        first = raw_input[0]
+        if isinstance(first, dict):
+            return str(first.get("content", ""))
+    return ""
+
+
+def build_task(
+    row: dict[str, Any],
+    *,
+    container_formatter: str,
+    flat_eval: bool = True,
+    responses_metadata: dict[str, Any] | None = None,
+) -> SweTask:
+    """Build a ``SweTask`` from merged dataset / verifier metadata."""
+    inst_raw = row.get("instance_dict")
+    inst = json.loads(inst_raw) if isinstance(inst_raw, str) else dict(inst_raw or {})
+    dataset_name = str(row.get("dataset_name", ""))
+    instance_id = row["instance_id"]
+    image = instance_image(row.get("container_formatter") or container_formatter, instance_id)
+
+    return SweTask(
+        instance_id=instance_id,
+        image=image,
+        base_commit=inst.get("base_commit"),
+        repo_workdir="/testbed",
+        test_patch=inst.get("test_patch", ""),
+        fail_to_pass=_as_list(inst.get("FAIL_TO_PASS") or inst.get("fail_to_pass")),
+        pass_to_pass=_as_list(inst.get("PASS_TO_PASS") or inst.get("pass_to_pass")),
+        benchmark=harness_family_key(dataset_name),
+        split=str(row.get("split", "test")),
+        dataset_name=dataset_name,
+        problem_statement=_initial_observation(row, responses_metadata),
+        metadata={"instance_dict": inst, "flat_eval": flat_eval, "dataset_name": dataset_name},
+    )
+
+
+def parse_task_from_request(
+    body: TaskRunBody,
+    *,
+    container_formatter: str,
+    flat_eval: bool = True,
+    environment: str = ENVIRONMENT_NAME,
+) -> SweTask:
+    """Parse a first-class Task from an agent ``/run`` or Environment HTTP body."""
+    responses_metadata = (body.responses_create_params.metadata or {}) if body.responses_create_params else {}
+    row = merge_row_metadata(body.verifier_metadata, responses_metadata)
+    if "instance_id" not in row:
+        raise ValueError(
+            "Task requires verifier_metadata.instance_id (or responses_create_params.metadata.instance_id)"
+        )
+    return build_task(
+        row,
+        container_formatter=container_formatter,
+        flat_eval=flat_eval,
+        responses_metadata=responses_metadata,
+    )
+
+
+def parse_submission(verifier_metadata: dict[str, Any] | None) -> TaskSubmission:
+    """Extract the agent submission from verify request metadata."""
+    meta = dict(verifier_metadata or {})
+    patch = meta.get("model_patch") or meta.get("git_patch") or ""
+    return TaskSubmission(model_patch=patch if isinstance(patch, str) else str(patch))
diff --git a/resources_servers/swe_bench/task_builder.py b/resources_servers/swe_bench/task_builder.py
new file mode 100644
index 0000000000..3c4df6d181
--- /dev/null
+++ b/resources_servers/swe_bench/task_builder.py
@@ -0,0 +1,20 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Backward-compatible re-exports — prefer ``resources_servers.swe_bench.task``."""
+
+from resources_servers.swe_bench.task import (
+    SweTask,
+)
+from resources_servers.swe_bench.task import (
+    build_task as build_swetask,
+)
+from resources_servers.swe_bench.task import (
+    harness_family_key as benchmark_key,
+)
+from resources_servers.swe_bench.task import (
+    merge_row_metadata as problem_info_from_row,
+)
+
+
+__all__ = ["SweTask", "benchmark_key", "build_swetask", "problem_info_from_row"]
diff --git a/resources_servers/swe_bench/tests/__init__.py b/resources_servers/swe_bench/tests/__init__.py
new file mode 100644
index 0000000000..777f2341ac
--- /dev/null
+++ b/resources_servers/swe_bench/tests/__init__.py
@@ -0,0 +1 @@
+"""Test suite for the swe_env agent harness."""
diff --git a/resources_servers/swe_bench/tests/conftest.py b/resources_servers/swe_bench/tests/conftest.py
new file mode 100644
index 0000000000..5bc774e7c4
--- /dev/null
+++ b/resources_servers/swe_bench/tests/conftest.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Pytest collection guard for the swe_env tests.
+
+The flat-eval parser fixtures are recorded eval logs whose lines begin with the
+SWE-bench ``>>>>>`` sentinels. Under doctest collection those look like
+(malformed) ``>>>`` prompts, so the fixtures directory is excluded from
+collection entirely. It holds only data, never tests.
+"""
+
+from __future__ import annotations
+
+
+collect_ignore_glob = ["fixtures/*"]
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt
new file mode 100644
index 0000000000..bb67958525
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt
@@ -0,0 +1,9 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+error: while searching for:
+    def format_signature(self):
+error: patch failed: sphinx/ext/autodoc/__init__.py:120
+error: sphinx/ext/autodoc/__init__.py: patch does not apply
+>>>>> Patch Apply Failed
++ git checkout abc123 tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt
new file mode 100644
index 0000000000..bc8d678e61
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt
@@ -0,0 +1,14 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+collected 3 items
+>>>>> End Test Output
+PASSED tests/test_ext_autodoc.py::test_format_signature
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+=================== 3 passed in 1.92s =========================================
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt
new file mode 100644
index 0000000000..c4f0e56654
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt
@@ -0,0 +1,11 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
++ git apply -v /tmp/test_patch.diff
+error: patch failed: tests/test_ext_autodoc.py:1
+error: tests/test_ext_autodoc.py: patch does not apply
++ python -m pytest tests/test_ext_autodoc.py
+ERROR: file or directory not found: tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt
new file mode 100644
index 0000000000..1d0ba6a53a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt
@@ -0,0 +1,25 @@
++ source /opt/miniconda3/bin/activate
++ conda activate testbed
++ git config --global --add safe.directory /testbed
++ cd /testbed
++ git status
++ git restore .
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
++ git apply -v /tmp/test_patch.diff
+Checking patch tests/test_ext_autodoc.py...
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+PASSED tests/test_ext_autodoc.py::test_format_signature
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+SKIPPED tests/test_ext_autodoc.py::test_optional_feature
+=================== 3 passed, 1 skipped in 2.41s ===============================
+>>>>> End Test Output
++ git checkout abc123 tests/test_ext_autodoc.py
+Updated 1 path from the index
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt
new file mode 100644
index 0000000000..0a27e668e1
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt
@@ -0,0 +1,10 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+>>>>> Tests Timed Out
diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt
new file mode 100644
index 0000000000..59dc10159f
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt
@@ -0,0 +1,16 @@
++ cd /testbed
++ git apply -v /tmp/patch.diff
+Checking patch sphinx/ext/autodoc/__init__.py...
+Applied patch sphinx/ext/autodoc/__init__.py cleanly.
+>>>>> Applied Patch
++ git apply -v /tmp/test_patch.diff
+Checking patch tests/test_ext_autodoc.py...
+Applied patch tests/test_ext_autodoc.py cleanly.
+>>>>> Start Test Output
+============================= test session starts ==============================
+FAILED tests/test_ext_autodoc.py::test_format_signature - AssertionError: signature mismatch
+PASSED tests/test_ext_autodoc.py::test_autodoc_inherited
+PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members
+=================== 2 passed, 1 failed in 2.10s ================================
+>>>>> End Test Output
++ git checkout abc123 tests/test_ext_autodoc.py
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt
new file mode 100644
index 0000000000..5f1200be91
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt
@@ -0,0 +1,6 @@
+{"Time":"2026-06-23T00:00:00Z","Action":"run","Package":"github.com/acme/widget","Test":"TestAlpha"}
+{"Time":"2026-06-23T00:00:00Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestAlpha","Elapsed":0.01}
+{"Time":"2026-06-23T00:00:01Z","Action":"run","Package":"github.com/acme/widget","Test":"TestBeta"}
+{"Time":"2026-06-23T00:00:01Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestBeta","Elapsed":0.02}
+{"Time":"2026-06-23T00:00:02Z","Action":"run","Package":"github.com/acme/widget","Test":"TestGamma"}
+{"Time":"2026-06-23T00:00:02Z","Action":"fail","Package":"github.com/acme/widget","Test":"TestGamma","Elapsed":0.01}
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml
new file mode 100644
index 0000000000..028b436db3
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml
@@ -0,0 +1,10 @@
+<?xml version="1.0" encoding="utf-8"?>
+<testsuites>
+  <testsuite name="pytest" tests="3" failures="1" errors="0" skipped="0">
+    <testcase classname="tests.test_core" name="test_fix_applied" time="0.01"/>
+    <testcase classname="tests.test_core" name="test_regression_guard" time="0.02"/>
+    <testcase classname="tests.test_core" name="test_unrelated_broken" time="0.01">
+      <failure message="AssertionError">boom</failure>
+    </testcase>
+  </testsuite>
+</testsuites>
diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt
new file mode 100644
index 0000000000..d566714983
--- /dev/null
+++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt
@@ -0,0 +1,15 @@
+============================= test session starts ==============================
+platform linux -- Python 3.12.0, pytest-8.0.0
+collected 3 items
+
+src/pkg/tests/test_widget.py::test_alpha PASSED                          [ 33%]
+src/pkg/tests/test_widget.py::test_beta PASSED                           [ 66%]
+src/pkg/tests/test_widget.py::test_gamma FAILED                          [100%]
+
+=================================== FAILURES ===================================
+________________________________ test_gamma ___________________________________
+    assert 1 == 2
+E   assert 1 == 2
+=========================== short test summary info ============================
+FAILED src/pkg/tests/test_widget.py::test_gamma - assert 1 == 2
+========================= 2 passed, 1 failed in 0.12s ==========================
diff --git a/resources_servers/swe_bench/tests/test_app.py b/resources_servers/swe_bench/tests/test_app.py
new file mode 100644
index 0000000000..3e50958ab7
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_app.py
@@ -0,0 +1,95 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import json
+from unittest.mock import MagicMock
+
+import pytest
+
+import resources_servers.swe_bench.tests.test_swe_env  # noqa: F401  — registers fake-swe provider
+from nemo_gym.openai_utils import NeMoGymResponse, NeMoGymResponseCreateParamsNonStreaming
+from nemo_gym.server_utils import ServerClient
+from resources_servers.swe_bench.app import (
+    SweBenchResourcesServer,
+    SweBenchResourcesServerConfig,
+    SweBenchSeedSessionRequest,
+    SweBenchVerifyRequest,
+)
+
+
+@pytest.fixture
+def server() -> SweBenchResourcesServer:
+    return SweBenchResourcesServer(
+        config=SweBenchResourcesServerConfig(
+            host="127.0.0.1",
+            port=12346,
+            entrypoint="app.py",
+            name="swe_bench",
+            sandbox_provider={"fake-swe": {}},
+        ),
+        server_client=MagicMock(spec=ServerClient),
+    )
+
+
+def _sample_row() -> dict:
+    inst = {
+        "instance_id": "astropy__astropy-12907",
+        "base_commit": "abc123",
+        "test_patch": "",
+        "FAIL_TO_PASS": '["tests/test_x.py::a"]',
+        "PASS_TO_PASS": '["tests/test_x.py::b"]',
+    }
+    meta = {
+        "instance_id": "astropy__astropy-12907",
+        "dataset_name": "princeton-nlp/SWE-bench_Verified",
+        "split": "test",
+        "problem_statement": "Fix the bug.",
+        "instance_dict": json.dumps(inst),
+    }
+    return {
+        "responses_create_params": NeMoGymResponseCreateParamsNonStreaming(
+            input=[{"role": "user", "content": "Fix the bug."}],
+            metadata=meta,
+        ),
+        "verifier_metadata": meta,
+    }
+
+
+@pytest.mark.asyncio
+async def test_seed_session_agent_in_env(server: SweBenchResourcesServer) -> None:
+    body = SweBenchSeedSessionRequest(**_sample_row())
+    resp = await server.seed_session(body)
+    assert resp.environment == "swe_bench"
+    assert resp.placement.topology == "agent_in_env"
+    assert resp.sandbox.spec["image"].startswith("swebench/")
+    assert resp.task.task_id == "astropy__astropy-12907"
+    assert resp.task.harness_family == "swe-bench"
+    assert resp.task.dataset_name == "princeton-nlp/SWE-bench_Verified"
+    assert resp.verifier_metadata["instance_id"] == "astropy__astropy-12907"
+
+
+@pytest.mark.asyncio
+async def test_verify_empty_patch(server: SweBenchResourcesServer) -> None:
+    row = _sample_row()
+    row["verifier_metadata"] = {**row["verifier_metadata"], "model_patch": ""}
+    body = SweBenchVerifyRequest(
+        **row,
+        response=NeMoGymResponse(
+            id="r1",
+            created_at=0,
+            model="m",
+            object="response",
+            output=[],
+            parallel_tool_calls=False,
+            tool_choice="auto",
+            tools=[],
+        ),
+    )
+    resp = await server.verify(body)
+    assert resp.task_id == "astropy__astropy-12907"
+    assert resp.environment == "swe_bench"
+    assert resp.reward == 0.0
+    assert resp.patch_exists is False
+    assert resp.resolved is False
diff --git a/resources_servers/swe_bench/tests/test_flat_eval.py b/resources_servers/swe_bench/tests/test_flat_eval.py
new file mode 100644
index 0000000000..35a00fc580
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_flat_eval.py
@@ -0,0 +1,594 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the opt-in flat (host-graded) eval mode of the nested families.
+
+The suite has two layers:
+
+* Parser unit tests on recorded fixture logs cover the SWE-bench eval-script log
+  parser (``parse_eval_log``) on a success log, a failure log, the bad-code logs
+  (patch-apply-failed / timeout), a no-markers log, and the
+  output-outside-markers fallback. The fixtures use the
+  ``>>>>> Start/End Test Output`` shape the SWE-bench eval script emits.
+
+* Flat run_eval and grade via FakeSandbox drive the flat path of both nested
+  harnesses (``swe-bench``, ``r2e-gym``) end-to-end with a scripted provider that
+  returns a fixture log, asserting ``resolved`` is computed from ``FAIL_TO_PASS``
+  / ``PASS_TO_PASS``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from pathlib import Path
+
+import pytest
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses import flat_eval
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+_FIXTURES = Path(__file__).parent / "fixtures" / "flat_eval"
+
+
+def _fixture(name: str) -> str:
+    """Read a recorded fixture log by name.
+
+    Fixtures are stored with a ``.txt`` suffix, so a caller may pass either the
+    ``.log`` stem name or the real ``.txt`` name.
+
+    Args:
+        name: The fixture file name, with either a ``.log`` or ``.txt`` suffix.
+
+    Returns:
+        The fixture file contents as text.
+    """
+    path = _FIXTURES / name
+    if not path.exists() and path.suffix == ".log":
+        path = path.with_suffix(".txt")
+    return path.read_text()
+
+
+# ---- parser: recorded fixture logs (CI) -------------------------------------
+
+
+def test_parse_success_log_all_pass():
+    """A success log parses to a status map with the expected passed and skipped tests."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("resolved_success.log"))
+    assert applied is True
+    assert status_map == {
+        "tests/test_ext_autodoc.py::test_format_signature": "PASSED",
+        "tests/test_ext_autodoc.py::test_autodoc_inherited": "PASSED",
+        "tests/test_ext_autodoc.py::test_autodoc_exclude_members": "PASSED",
+        "tests/test_ext_autodoc.py::test_optional_feature": "SKIPPED",
+    }
+    assert sorted(flat_eval.passed_tests(status_map)) == [
+        "tests/test_ext_autodoc.py::test_autodoc_exclude_members",
+        "tests/test_ext_autodoc.py::test_autodoc_inherited",
+        "tests/test_ext_autodoc.py::test_format_signature",
+    ]
+
+
+def test_parse_failure_log_strips_failed_reason():
+    """A failure log parses with the failure reason stripped down to the node id."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("unresolved_failure.log"))
+    assert applied is True
+    # The "FAILED <id> - <reason>" line keeps only the node id.
+    assert status_map["tests/test_ext_autodoc.py::test_format_signature"] == "FAILED"
+    assert "tests/test_ext_autodoc.py::test_autodoc_inherited" in flat_eval.passed_tests(status_map)
+
+
+def test_parse_apply_patch_failed_is_untrusted():
+    """A patch-apply-failed log yields an empty status map and patch_applied False."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("apply_patch_failed.log"))
+    assert status_map == {}
+    assert applied is False
+
+
+def test_parse_timeout_is_untrusted():
+    """A timeout log yields an empty status map and patch_applied False."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("tests_timeout.log"))
+    assert status_map == {}
+    assert applied is False
+
+
+def test_parse_no_markers_is_untrusted():
+    """A log with no test-output markers yields an empty status map and patch_applied False."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("no_markers.log"))
+    assert status_map == {}
+    assert applied is False
+
+
+def test_parse_fallback_outside_markers():
+    """Per-test lines appearing after the End marker are recovered by the whole-log fallback."""
+    status_map, applied = flat_eval.parse_eval_log(_fixture("fallback_outside_markers.log"))
+    assert applied is True
+    assert len(flat_eval.passed_tests(status_map)) == 3
+
+
+def test_parse_duplicate_node_last_status_wins():
+    """For a duplicated node id the last reported status wins.
+
+    A node first reported FAILED then re-reported PASSED (e.g. via a rerun plugin)
+    ends up PASSED, and vice versa.
+    """
+    log = "\n".join(
+        [
+            flat_eval.APPLY_PATCH_PASS,
+            flat_eval.START_TEST_OUTPUT,
+            "FAILED tests/test_x.py::test_flaky",
+            "PASSED tests/test_x.py::test_flaky",
+            "PASSED tests/test_x.py::test_regressed",
+            "FAILED tests/test_x.py::test_regressed",
+            flat_eval.END_TEST_OUTPUT,
+        ]
+    )
+    status_map, applied = flat_eval.parse_eval_log(log)
+    assert applied is True
+    # Last line wins for each node, not the first.
+    assert status_map["tests/test_x.py::test_flaky"] == "PASSED"
+    assert status_map["tests/test_x.py::test_regressed"] == "FAILED"
+    assert flat_eval.passed_tests(status_map) == ["tests/test_x.py::test_flaky"]
+
+
+def test_parse_xfail_counts_as_pass():
+    """An XFAIL node counts as a passed test."""
+    log = "\n".join(
+        [
+            flat_eval.APPLY_PATCH_PASS,
+            flat_eval.START_TEST_OUTPUT,
+            "XFAIL tests/test_x.py::test_known_bug",
+            "PASSED tests/test_x.py::test_ok",
+            flat_eval.END_TEST_OUTPUT,
+        ]
+    )
+    status_map, applied = flat_eval.parse_eval_log(log)
+    assert applied is True
+    assert set(flat_eval.passed_tests(status_map)) == {
+        "tests/test_x.py::test_known_bug",
+        "tests/test_x.py::test_ok",
+    }
+
+
+# ---- flat_grade over parsed fixtures (CI) -----------------------------------
+
+
+def _task(benchmark: str = "swe-bench", **overrides) -> SweTask:
+    """Build a SweTask with sensible defaults, overridable per keyword.
+
+    Args:
+        benchmark: The benchmark name for the task.
+        **overrides: Field overrides merged onto the default task fields.
+
+    Returns:
+        A SweTask configured for the given benchmark.
+    """
+    base = dict(
+        instance_id="repo__inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        model_patch="diff --git a/x b/x\n",
+        fail_to_pass=["tests/test_ext_autodoc.py::test_format_signature"],
+        pass_to_pass=["tests/test_ext_autodoc.py::test_autodoc_inherited"],
+        benchmark=benchmark,
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+def _flat_artifacts(log: str) -> EvalArtifacts:
+    """Wrap an eval log in flat-eval EvalArtifacts.
+
+    Args:
+        log: The eval-script log text.
+
+    Returns:
+        EvalArtifacts carrying the log with a clean (non-error) flat raw payload.
+    """
+    return EvalArtifacts(test_output=log, return_code=0, patch_applied=True, raw={"error_type": None, "flat": True})
+
+
+def test_flat_grade_resolved_on_success():
+    """Flat grading resolves a success log with reward 1.0."""
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("resolved_success.log")))
+    assert report.resolved is True
+    assert report.patch_applied is True
+    assert report.patch_exists is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_unresolved_on_failure():
+    """Flat grading leaves a failure log unresolved with reward 0.0."""
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("unresolved_failure.log")))
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_unresolved_on_apply_failed():
+    """A failed patch apply grades as a legitimate unresolved, not an infra mask."""
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("apply_patch_failed.log")))
+    assert report.resolved is False
+    assert report.patch_applied is False
+    assert report.error_kind is None
+    assert reward_from_report(report) == 0.0
+
+
+# ---- consistency of flat grading --------------------------------------------
+#
+# Flat grading takes ``resolved`` straight from the parser's verdict (all F2P +
+# all P2P passed) and never re-gates it on ``patch_applied``. The parser's
+# ``log_patch_applied`` flag never changes ``resolved`` relative to a pure
+# ``compute_resolved`` verdict: whenever ``parse_eval_log`` reports
+# ``patch_applied=False`` it also returns an empty status map, so
+# ``compute_resolved`` already yields False. These tests lock in that invariant
+# so a future edit cannot reintroduce a divergent gate.
+
+
+@pytest.mark.parametrize(
+    "fixture_name",
+    [
+        "resolved_success.log",
+        "unresolved_failure.log",
+        "apply_patch_failed.log",
+        "tests_timeout.log",
+        "no_markers.log",
+        "fallback_outside_markers.log",
+    ],
+)
+def test_flat_grade_resolved_matches_ungated_compute_resolved(fixture_name):
+    """``flat_grade``'s resolved verdict agrees with a bare ``compute_resolved`` over the parsed passed-set.
+
+    The patch-applied gate is redundant and never flips the verdict True<->False.
+
+    Args:
+        fixture_name: The recorded fixture log to parse and grade.
+    """
+    from resources_servers.swe_bench.harness import compute_resolved
+
+    task = _task()
+    log = _fixture(fixture_name)
+    status_map, _applied = flat_eval.parse_eval_log(log)
+    ungated = compute_resolved(
+        fail_to_pass=task.fail_to_pass,
+        pass_to_pass=task.pass_to_pass,
+        passed=flat_eval.passed_tests(status_map),
+    )
+    report = flat_eval.flat_grade(task, _flat_artifacts(log))
+    assert report.resolved is ungated
+
+
+@pytest.mark.parametrize(
+    "bad_code_attr",
+    ["APPLY_PATCH_FAIL", "RESET_FAILED", "TESTS_ERROR", "TESTS_TIMEOUT"],
+)
+def test_parse_eval_log_bad_code_empties_status_map_even_with_status_lines(bad_code_attr):
+    """A bad code forces an empty status map and patch_applied False even with per-test status lines.
+
+    This is what makes the flat_grade patch-applied gate redundant: no path yields
+    patch_applied=False together with a non-empty status map.
+
+    Args:
+        bad_code_attr: Name of the bad-code marker attribute on ``flat_eval``.
+    """
+    bad_code = getattr(flat_eval, bad_code_attr)
+    log = "\n".join(
+        [
+            bad_code,
+            flat_eval.START_TEST_OUTPUT,
+            "PASSED tests/test_ext_autodoc.py::test_format_signature",
+            "PASSED tests/test_ext_autodoc.py::test_autodoc_inherited",
+            flat_eval.END_TEST_OUTPUT,
+        ]
+    )
+    status_map, applied = flat_eval.parse_eval_log(log)
+    assert applied is False
+    assert status_map == {}
+    # And it grades as a legitimate unresolved (not an infra mask): error_kind
+    # stays None, resolved False -> reward 0.0, matching the flat families.
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+    assert report.resolved is False
+    assert report.error_kind is None
+    assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_resolved_does_not_gate_on_artifact_patch_applied():
+    """Flat ``resolved`` is the parser's verdict only and ignores the artifact's patch_applied flag.
+
+    Even if the EvalArtifacts carries patch_applied False (e.g. the model patch
+    did not cleanly apply), a passing eval log still resolves, since grading is
+    based on the tests rather than the apply status.
+    """
+    artifacts = EvalArtifacts(
+        test_output=_fixture("resolved_success.log"),
+        return_code=0,
+        patch_applied=False,
+        raw={"error_type": None, "flat": True},
+    )
+    report = flat_eval.flat_grade(_task(), artifacts)
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_neutral_skipped_required_test_is_not_a_failure():
+    """A required test reported SKIPPED is neutral (excluded), not a failure.
+
+    This mirrors swebench's ``get_eval_tests_report`` + ``get_resolution_status``: a
+    required test counts as a failure only when absent or FAILED/ERROR. A neutral
+    status (SKIPPED/XPASS) is excluded from both the success and failure tallies, so a
+    run whose only "non-pass" required test is SKIPPED still resolves. A bare
+    ``passed``-set membership check (the prior behavior) would have treated the
+    SKIPPED test as a failure and wrongly graded it unresolved.
+    """
+    log = "\n".join(
+        [
+            flat_eval.APPLY_PATCH_PASS,
+            flat_eval.START_TEST_OUTPUT,
+            "PASSED tests/test_ext_autodoc.py::test_format_signature",
+            "SKIPPED tests/test_ext_autodoc.py::test_autodoc_inherited",
+            flat_eval.END_TEST_OUTPUT,
+        ]
+    )
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+    # F2P passed; the SKIPPED P2P is neutral (excluded) -> zero failures -> resolved.
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_flat_grade_absent_required_test_is_a_failure():
+    """A required test absent from the status map is a failure (not neutral).
+
+    Per swebench's ``test_failed`` (``case not in sm``), an absent required test counts
+    as a failure, so the run must grade unresolved.
+    """
+    log = "\n".join(
+        [
+            flat_eval.APPLY_PATCH_PASS,
+            flat_eval.START_TEST_OUTPUT,
+            "PASSED tests/test_ext_autodoc.py::test_format_signature",
+            flat_eval.END_TEST_OUTPUT,
+        ]
+    )
+    # P2P (test_autodoc_inherited) is absent from the log -> failure -> unresolved.
+    report = flat_eval.flat_grade(_task(), _flat_artifacts(log))
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_masks_infra_error():
+    """Flat grading masks an infra timeout to reward 0.0 with a timeout error kind."""
+    artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout", "flat": True})
+    report = flat_eval.flat_grade(_task(), artifacts)
+    assert report.error_kind == "timeout"
+    assert reward_from_report(report) == 0.0
+
+
+def test_flat_grade_unbuildable_eval_script_is_unmasked_unresolved():
+    """An unbuildable / missing eval script grades UNMASKED unresolved (reward 0), not eval_error.
+
+    Per main, only genuine sandbox/timeout infra failures are masked; an empty/unbuildable eval
+    spec produces no test markers and so grades as a legitimate unresolved (``error_kind`` None).
+    """
+    artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "eval_error", "flat": True})
+    report = flat_eval.flat_grade(_task(), artifacts)
+    assert report.error_kind is None
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+# ---- gating (CI) ------------------------------------------------------------
+
+
+def test_flat_eval_enabled_harness_flag():
+    """The harness-level flat-eval flag enables flat eval."""
+    assert flat_eval.flat_eval_enabled(True, _task()) is True
+
+
+def test_flat_eval_enabled_task_metadata():
+    """Per-task ``flat_eval`` metadata enables flat eval."""
+    assert flat_eval.flat_eval_enabled(False, _task(metadata={"flat_eval": True})) is True
+
+
+def test_flat_eval_disabled_by_default():
+    """Flat eval is disabled when neither the harness flag nor task metadata enables it."""
+    assert flat_eval.flat_eval_enabled(False, _task()) is False
+
+
+def test_swebench_supports_provider_gating():
+    """The swe-bench harness is host-graded (flat), so it runs on any exec-capable provider."""
+    harness = SweBenchHarness("swe-bench")
+    assert harness.supports_provider("docker") is True
+    assert harness.supports_provider("apptainer") is True
+    assert harness.supports_provider("opensandbox") is True
+    assert harness.grade_strategy == "flat-host-grade"
+
+
+def test_r2egym_supports_provider_gating():
+    """The r2e-gym harness is host-graded (flat), so it runs on any exec-capable provider."""
+    harness = R2EGymHarness()
+    assert harness.supports_provider("docker") is True
+    assert harness.supports_provider("apptainer") is True
+    assert harness.supports_provider("opensandbox") is True
+    assert harness.grade_strategy == "flat-host-grade"
+
+
+# ---- flat run_eval end-to-end via FakeSandbox (CI) --------------------------
+
+
+class _FakeFlatProvider:
+    """Scripted provider: ``bash eval.sh ...`` streams a fixture log; ``cat`` echoes it."""
+
+    name = "fake-flat-eval"
+
+    def __init__(self, *, log_text="", run_rc=0, error_type=None, stream_empty=False, **_):
+        """Configure the scripted flat-eval provider's responses.
+
+        Args:
+            log_text: The eval-script log text returned by the run and ``cat``.
+            run_rc: Return code returned for the eval-script run.
+            error_type: Optional error type attached to the run result.
+            stream_empty: When True, the eval-script run streams empty stdout so
+                the harness falls back to reading the tee'd log file.
+            **_: Ignored extra keyword arguments.
+        """
+        self._log_text = log_text
+        self._run_rc = run_rc
+        self._error_type = error_type
+        self._stream_empty = stream_empty
+        self.commands: list[str] = []
+        self.uploaded: dict[str, str] = {}
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        self.commands.append(command)
+        if command.startswith("cat "):
+            return SandboxExecResult(stdout=self._log_text, stderr="", return_code=0)
+        # The eval script run.
+        stdout = "" if self._stream_empty else self._log_text
+        return SandboxExecResult(stdout=stdout, stderr="", return_code=self._run_rc, error_type=self._error_type)
+
+    async def upload_file(self, handle, local_path, remote_path):
+        try:
+            with open(local_path, encoding="utf-8") as fh:
+                self.uploaded[remote_path] = fh.read()
+        except OSError:
+            self.uploaded[remote_path] = ""
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-flat-eval", _FakeFlatProvider, override=True)
+
+
+def _drive_flat(harness, task, *, log_text, run_rc=0, error_type=None, stream_empty=False):
+    """Drive materialize -> run_eval -> grade for a flat harness via the scripted provider.
+
+    Args:
+        harness: The flat-capable harness under test.
+        task: The SweTask to evaluate.
+        log_text: The eval-script log text the provider returns.
+        run_rc: Return code returned for the eval-script run.
+        error_type: Optional error type attached to the run result.
+        stream_empty: When True, the run streams empty stdout so the harness falls
+            back to reading the tee'd log file.
+
+    Returns:
+        A tuple of the graded report, the EvalArtifacts, and the provider instance.
+    """
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def _go():
+        provider = {
+            "fake-flat-eval": {
+                "log_text": log_text,
+                "run_rc": run_rc,
+                "error_type": error_type,
+                "stream_empty": stream_empty,
+            }
+        }
+        env = await AsyncSweEnvironment.start(provider, harness.build_spec(task))
+        try:
+            await harness.materialize(env, task)
+            artifacts = await harness.run_eval(env, task)
+            return harness.grade(task, artifacts), artifacts, env.sandbox._provider
+        finally:
+            await env.cleanup()
+
+    return asyncio.run(_go())
+
+
+def test_swebench_flat_run_eval_resolved():
+    """The swe-bench flat path resolves a success run and uploads the eval script."""
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={"eval_script": "echo running", "flat_eval": True})
+    report, artifacts, provider = _drive_flat(harness, task, log_text=_fixture("resolved_success.log"))
+    assert artifacts.raw["flat"] is True
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+    # The eval script was uploaded into the sandbox.
+    assert provider.uploaded.get(flat_eval.EVAL_SCRIPT_PATH, "").startswith("echo running")
+
+
+def test_swebench_flat_run_eval_unresolved():
+    """The swe-bench flat path leaves a failure run unresolved."""
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={"eval_script": "echo running"})
+    report, _artifacts, _ = _drive_flat(harness, task, log_text=_fixture("unresolved_failure.log"))
+    assert report.resolved is False
+
+
+def test_swebench_flat_run_eval_stream_empty_uses_log_file():
+    """When streamed output is empty, run_eval reads back the tee'd log file."""
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={"eval_script": "echo running"})
+    report, _artifacts, provider = _drive_flat(
+        harness, task, log_text=_fixture("resolved_success.log"), stream_empty=True
+    )
+    assert any(cmd.startswith("cat ") for cmd in provider.commands)
+    assert report.resolved is True
+
+
+def test_swebench_flat_run_eval_masks_sandbox_error():
+    """The swe-bench flat path masks a sandbox error reported by the run."""
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={"eval_script": "echo running"})
+    report, artifacts, _ = _drive_flat(harness, task, log_text="", run_rc=1, error_type="sandbox")
+    assert artifacts.raw["error_type"] == "sandbox"
+    assert report.error_kind == "sandbox"
+
+
+def test_swebench_flat_run_eval_missing_script_is_unmasked_unresolved():
+    """A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error.
+
+    ``flat_run_eval`` still tags the artifact ``error_type == "eval_error"`` (so callers can log
+    it), but grading no longer masks on it: per main only genuine sandbox/timeout infra failures
+    are masked, and an empty spec simply produces no test markers and grades unresolved.
+    """
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={})  # no eval_script
+    report, artifacts, _ = _drive_flat(harness, task, log_text="")
+    assert artifacts.raw["error_type"] == "eval_error"
+    assert report.error_kind is None
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_r2egym_flat_run_eval_resolved_via_task_metadata():
+    """Per-task ``flat_eval`` metadata drives the r2e-gym flat path to a resolved run."""
+    harness = R2EGymHarness()
+    task = _task(benchmark="r2e-gym", instance_id="r2e__pkg-1", metadata={"eval_script": "echo run"})
+    report, artifacts, _ = _drive_flat(harness, task, log_text=_fixture("resolved_success.log"))
+    assert artifacts.raw["flat"] is True
+    assert report.resolved is True
diff --git a/resources_servers/swe_bench/tests/test_lifecycle.py b/resources_servers/swe_bench/tests/test_lifecycle.py
new file mode 100644
index 0000000000..dbde539ada
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_lifecycle.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Sandbox lifecycle (``acquire_sandbox``) and ``verify_task`` happy/timeout/empty paths.
+
+These tests cover always-teardown on context exit and the fresh-sandbox verify
+sequence, including the resolved, empty-patch fast path, and eval-timeout cases.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+import pytest
+
+import resources_servers.swe_bench.harnesses  # noqa: F401  (register harnesses)
+from nemo_gym.sandbox import SandboxExecResult, SandboxHandle, SandboxStatus
+from resources_servers.swe_bench.harness import SweTask
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+from resources_servers.swe_bench.verify_task import verify_task
+
+
+class _CountingProvider:
+    """Provider instance passed directly so the test can count create/close/exec.
+
+    Args:
+        exec_sleep: Seconds to sleep inside each ``exec`` call, used to simulate a
+            slow evaluation that triggers the eval timeout.
+        test_output: Stdout returned for pytest commands. The trailing-status
+            pytest format is the shape the test parser recognizes, and the ``.py``
+            path normalizes to the F2P id in ``_task``.
+    """
+
+    name = "fake-life"
+
+    def __init__(self, *, exec_sleep=0.0, test_output="tests/test_x.py::a PASSED\n"):
+        self.create_count = 0
+        self.close_count = 0
+        self._exec_sleep = exec_sleep
+        self._test_output = test_output
+
+    async def create(self, spec):
+        self.create_count += 1
+        return SandboxHandle(
+            sandbox_id=f"sb-{self.create_count}", provider_name=self.name, raw={"workdir": spec.workdir}
+        )
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        if self._exec_sleep:
+            await asyncio.sleep(self._exec_sleep)
+        if "pytest" in command:
+            return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, *a, **k):
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        self.close_count += 1
+
+    async def aclose(self):
+        return None
+
+
+def _task(**kw) -> SweTask:
+    """Build a SweTask with sensible defaults, overridable per keyword.
+
+    Args:
+        **kw: Field overrides merged onto the default task fields.
+
+    Returns:
+        A SweTask configured for the swe-bench-ext benchmark.
+    """
+    base = dict(
+        instance_id="inst-1",
+        image="img:tag",
+        base_commit="HEAD",
+        test_command="python -m pytest -rA -q",
+        model_patch="diff --git a/x b/x\n",
+        test_framework="pytest",
+        fail_to_pass=["tests/test_x.py::a"],
+        benchmark="swe-bench-ext",
+    )
+    base.update(kw)
+    return SweTask(**base)
+
+
+# ---- acquire_sandbox: starts an env, ALWAYS stops it ------------------------
+
+
+def test_acquire_sandbox_starts_and_cleans_up():
+    """``acquire_sandbox`` creates one sandbox and tears it down on normal exit."""
+    provider = _CountingProvider()
+
+    async def run():
+        spec = SweBenchExtHarness().build_spec(_task())
+        async with acquire_sandbox(provider, spec, instance_id="inst-1") as env:
+            assert env.sandbox_id is not None
+        return provider.create_count, provider.close_count
+
+    created, closed = asyncio.run(run())
+    assert created == 1
+    assert closed == 1  # torn down on normal exit
+
+
+def test_acquire_sandbox_cleans_up_on_exception():
+    """``acquire_sandbox`` tears down the sandbox even when the body raises."""
+    provider = _CountingProvider()
+
+    async def run():
+        spec = SweBenchExtHarness().build_spec(_task())
+        with pytest.raises(RuntimeError):
+            async with acquire_sandbox(provider, spec) as env:
+                assert env.sandbox_id is not None
+                raise RuntimeError("boom")
+
+    asyncio.run(run())
+    assert provider.close_count == 1  # torn down even on exception
+
+
+# ---- verify_task: resolved / empty-patch fast path / eval-timeout mask -------
+
+
+def test_verify_task_resolved_in_fresh_sandbox():
+    """``verify_task`` resolves a passing task in a freshly created sandbox."""
+    provider = _CountingProvider()
+    report = asyncio.run(verify_task(provider, _task()))
+    assert report.resolved is True
+    assert provider.create_count == 1
+    assert provider.close_count == 1
+
+
+def test_verify_task_empty_patch_fast_path_no_create():
+    """An empty model patch short-circuits to unresolved without creating a sandbox."""
+    provider = _CountingProvider()
+    report = asyncio.run(verify_task(provider, _task(model_patch="")))
+    assert report.patch_exists is False
+    assert report.resolved is False
+    assert provider.create_count == 0  # no sandbox spun up for an empty patch
+
+
+def test_verify_task_eval_timeout_masks():
+    """An evaluation that exceeds the eval timeout is masked as an eval_timeout error."""
+    provider = _CountingProvider(exec_sleep=0.5)
+    report = asyncio.run(verify_task(provider, _task(), eval_timeout_s=0.05))
+    assert report.error_kind == "eval_timeout"
diff --git a/resources_servers/swe_bench/tests/test_model_endpoint.py b/resources_servers/swe_bench/tests/test_model_endpoint.py
new file mode 100644
index 0000000000..c55232d863
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_model_endpoint.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the model-server egress primitive that resolves a model endpoint per provider."""
+
+from __future__ import annotations
+
+import pytest
+
+from resources_servers.swe_bench.self_drive import ModelEgressUnavailable, ModelEndpoint, resolve
+
+
+def test_apptainer_uses_host_loopback_by_default():
+    """Apptainer resolves to the host loopback base URL when none is configured."""
+    ep = resolve("apptainer", {"model": "qwen"})
+    assert ep.base_url == "http://127.0.0.1:8000/v1"
+    assert ep.model == "qwen"
+
+
+def test_docker_uses_configured_base_when_present():
+    """Docker uses the explicitly configured base URL."""
+    ep = resolve("docker", {"base_url": "http://10.0.0.5:8000/v1"})
+    assert ep.base_url == "http://10.0.0.5:8000/v1"
+
+
+def test_opensandbox_requires_service_url():
+    """Opensandbox raises when no reachable service URL is supplied."""
+    with pytest.raises(ModelEgressUnavailable):
+        resolve("opensandbox", {"base_url": "http://127.0.0.1:8000/v1"})
+
+
+def test_opensandbox_with_service_url_ok():
+    """Opensandbox resolves to the provided service URL."""
+    ep = resolve("opensandbox", {"model": "m"}, opensandbox_service_url="http://gym-model.svc.cluster.local/v1")
+    assert ep.base_url == "http://gym-model.svc.cluster.local/v1"
+
+
+def test_to_sandbox_env_is_minimal():
+    """The sandbox env carries only the base URL, API key, and model name."""
+    ak_value = "abc-test"
+    env = ModelEndpoint(base_url="http://h/v1", api_key=ak_value, model="m").to_sandbox_env()
+    assert env["OPENAI_BASE_URL"] == "http://h/v1"
+    assert env["OPENAI_API_KEY"] == ak_value
+    assert env["NEMO_GYM_MODEL"] == "m"
+    # never leaks a full global-config dict
+    assert "NEMO_GYM_CONFIG_DICT" not in env
diff --git a/resources_servers/swe_bench/tests/test_nv_internal.py b/resources_servers/swe_bench/tests/test_nv_internal.py
new file mode 100644
index 0000000000..3b1f049f2d
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_nv_internal.py
@@ -0,0 +1,547 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the nv-internal-1 harness, driven by a FakeSandbox provider.
+
+nv-internal-1 is flat + host-graded, so it runs on any exec-capable provider.
+The scripted provider returns the parsing_script ``output.json`` report on the
+``cat /root/output.json`` hop; grading is a pure host-side parse.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.nv_internal import (
+    NV_DEFAULT_WORKDIR,
+    NVInternalHarness,
+    _coerce_test_list,
+    _format_test_files,
+    _nv_workdir,
+    _parse_dockerfile_env,
+    _resolve_required_tests,
+    parse_passed_tests,
+)
+from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+
+class _FakeProvider:
+    """Scripted provider: ``cat /root/output.json`` returns a canned report."""
+
+    name = "fake-nv"
+
+    def __init__(self, *, report="", apply_rc=0, **_):
+        """Configure the scripted provider's responses.
+
+        Args:
+            report: JSON report stdout returned for ``cat /root/output.json``.
+            apply_rc: Return code returned for ``git apply`` commands.
+            **_: Ignored extra keyword arguments.
+        """
+        self._report = report
+        self._apply_rc = apply_rc
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        if "cat /root/output.json" in command:
+            return SandboxExecResult(stdout=self._report, stderr="", return_code=0)
+        if "git apply" in command:
+            return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, *a, **k):
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-nv", _FakeProvider, override=True)
+
+
+class _RecordingProvider:
+    """Provider that records exec ``cwd`` per command and captures uploads.
+
+    Uploads are captured as ``{target_path: content}`` by reading the temp file
+    that ``write_text`` hands to ``upload_file``; execs are captured as a list of
+    ``(command, cwd)`` so tests can assert which directory each hop ran in.
+    """
+
+    name = "fake-nv-rec"
+
+    def __init__(self, *, report="", **_):
+        """Configure the recording provider's canned report.
+
+        Args:
+            report: JSON report stdout returned for ``cat /root/output.json``.
+            **_: Ignored extra keyword arguments.
+        """
+        self._report = report
+        self.execs: list[tuple[str, str | None]] = []
+        self.uploads: dict[str, str] = {}
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        self.execs.append((command, cwd))
+        if "cat /root/output.json" in command:
+            return SandboxExecResult(stdout=self._report, stderr="", return_code=0)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, handle, source_path, target_path):
+        with open(source_path, encoding="utf-8") as fh:
+            self.uploads[target_path] = fh.read()
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-nv-rec", _RecordingProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+    """Build an nv-internal-1 SweTask with sensible defaults, overridable per keyword.
+
+    Args:
+        **overrides: Field overrides merged onto the default task fields.
+
+    Returns:
+        A SweTask configured for the nv-internal-1 benchmark.
+    """
+    base = dict(
+        instance_id="nv-inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/app",
+        model_patch="diff --git a/x b/x\n",
+        fail_to_pass=["pkg/test_x.py::a"],
+        pass_to_pass=["pkg/test_x.py::b"],
+        benchmark="nv-internal-1",
+        metadata={
+            "run_script": "echo run\n",
+            "parsing_script": "import sys\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+def _report(*passed, failed=()):
+    """Build a JSON test report with the given passed and failed test names.
+
+    Args:
+        *passed: Names of tests reported as PASSED.
+        failed: Names of tests reported as FAILED.
+
+    Returns:
+        The report serialized as a JSON string under a ``tests`` key.
+    """
+    tests = [{"name": name, "status": "PASSED"} for name in passed]
+    tests += [{"name": name, "status": "FAILED"} for name in failed]
+    return json.dumps({"tests": tests})
+
+
+async def _run(provider_cfg, task) -> SweEvalReport:
+    """Drive reset -> materialize -> run_eval -> grade against a scripted provider.
+
+    Args:
+        provider_cfg: Provider configuration mapping for the ``fake-nv`` provider.
+        task: The SweTask to evaluate.
+
+    Returns:
+        The graded SweEvalReport for the run.
+    """
+    harness = NVInternalHarness()
+    env = await AsyncSweEnvironment.start({"fake-nv": provider_cfg}, harness.build_spec(task))
+    try:
+        await harness.reset_repo(env, task)
+        await harness.materialize(env, task)
+        artifacts = await harness.run_eval(env, task)
+    finally:
+        await env.cleanup()
+    return harness.grade(task, artifacts)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_parse_passed_tests():
+    """``parse_passed_tests`` returns only PASSED names and ignores malformed entries."""
+    report = {"tests": [{"name": "a", "status": "PASSED"}, {"name": "b", "status": "FAILED"}]}
+    assert parse_passed_tests(report) == ["a"]
+    assert parse_passed_tests({}) == []
+    # Malformed entries are ignored, not crashed on.
+    assert parse_passed_tests({"tests": ["junk", {"status": "PASSED"}]}) == []
+
+
+def test_format_test_files():
+    """``_format_test_files`` joins list/JSON/CSV inputs into a comma-separated string."""
+    assert _format_test_files(["a", "b"]) == "a,b"
+    assert _format_test_files('["a", "b"]') == "a,b"
+    assert _format_test_files("a,b") == "a,b"
+    assert _format_test_files(None) == ""
+
+
+def test_format_test_files_single_quoted_list():
+    """``_format_test_files`` parses repr-style single-quoted lists.
+
+    Single-quoted lists are not valid JSON, so they are parsed with
+    ``ast.literal_eval``; unparseable bracketed text falls back to the raw string.
+    """
+    assert _format_test_files("['pkg/test_x.py', 'pkg/test_y.py']") == "pkg/test_x.py,pkg/test_y.py"
+    # A single-element single-quoted list.
+    assert _format_test_files("['only.py']") == "only.py"
+    # Unparseable bracketed text falls back to the raw string, not a crash.
+    assert _format_test_files("[not a list") == "[not a list"
+
+
+def test_build_spec():
+    """The nv-internal-1 harness builds a sandbox spec from a task."""
+    harness = NVInternalHarness()
+    assert harness.name == "nv-internal-1"
+    assert harness.grade_strategy == "flat-host-grade"
+    spec = harness.build_spec(_task())
+    assert spec.image == "img:tag"
+    assert spec.workdir == "/app"
+    assert spec.metadata["instance_id"] == "nv-inst-1"
+
+
+def test_supports_any_provider():
+    """The nv-internal-1 harness supports any exec-capable provider."""
+    assert NVInternalHarness().supports_provider("docker") is True
+    assert NVInternalHarness().supports_provider("apptainer") is True
+
+
+def test_grade_masks_on_infra_error():
+    """Grading masks an infra timeout to reward 0.0 and records its error kind."""
+    harness = NVInternalHarness()
+    report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+    assert report.error_kind == "timeout"
+    assert reward_from_report(report) == 0.0
+
+
+def test_grade_masks_on_sandbox_error():
+    """Grading masks a sandbox error to reward 0.0 and records its error kind."""
+    harness = NVInternalHarness()
+    report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "sandbox"}))
+    assert report.error_kind == "sandbox"
+    assert reward_from_report(report) == 0.0
+
+
+def test_grade_empty_report_is_unresolved():
+    """An empty report grades as unresolved."""
+    harness = NVInternalHarness()
+    report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=0, patch_applied=True))
+    assert report.resolved is False
+
+
+def test_grade_malformed_report_is_unresolved():
+    """A malformed (non-JSON) report grades as unresolved."""
+    harness = NVInternalHarness()
+    report = harness.grade(_task(), EvalArtifacts(test_output="not json", return_code=0, patch_applied=True))
+    assert report.resolved is False
+
+
+# ---- full reset -> materialize -> run_eval -> grade -------------------------
+
+
+def test_resolved():
+    """A run with all required tests passing resolves with reward 1.0."""
+    report = _report("pkg/test_x.py::a", "pkg/test_x.py::b")
+    result = asyncio.run(_run({"report": report}, _task()))
+    assert result.patch_applied is True
+    assert result.resolved is True
+    assert reward_from_report(result) == 1.0
+
+
+def test_unresolved_failing_required_test():
+    """A failing fail-to-pass test leaves the run unresolved with reward 0.0."""
+    report = _report("pkg/test_x.py::b", failed=["pkg/test_x.py::a"])
+    result = asyncio.run(_run({"report": report}, _task()))
+    assert result.resolved is False
+    assert reward_from_report(result) == 0.0
+
+
+def test_unresolved_missing_required_test():
+    """A required test missing from the report leaves the run unresolved."""
+    report = _report("pkg/test_x.py::a")
+    result = asyncio.run(_run({"report": report}, _task()))
+    assert result.resolved is False
+
+
+def test_patch_apply_rc_does_not_gate_resolved():
+    """A non-zero patch-apply return code does not gate ``resolved``.
+
+    Grading derives ``resolved`` from the tests alone, so a rejected patch
+    (apply_rc != 0) with all required tests passing is still resolved.
+    """
+    report = _report("pkg/test_x.py::a", "pkg/test_x.py::b")
+    result = asyncio.run(_run({"report": report, "apply_rc": 1}, _task()))
+    assert result.patch_applied is False
+    assert result.resolved is True
+    assert reward_from_report(result) == 1.0
+
+
+# ---- *_select precedence ----------------------------------------------------
+
+
+def test_resolve_required_tests_prefers_select_keys():
+    """``fail_to_pass_select`` / ``pass_to_pass_select`` take precedence over the plain keys."""
+    task = _task(
+        fail_to_pass=["plain::f2p"],
+        pass_to_pass=["plain::p2p"],
+        metadata={
+            "fail_to_pass_select": ["sel::f2p"],
+            "pass_to_pass_select": ["sel::p2p"],
+        },
+    )
+    f2p, p2p = _resolve_required_tests(task)
+    assert f2p == ["sel::f2p"]
+    assert p2p == ["sel::p2p"]
+
+
+def test_resolve_required_tests_falls_back_to_plain_keys():
+    """Without ``*_select`` keys, the plain fail_to_pass / pass_to_pass keys are used."""
+    task = _task(fail_to_pass=["plain::f2p"], pass_to_pass=["plain::p2p"], metadata={})
+    f2p, p2p = _resolve_required_tests(task)
+    assert f2p == ["plain::f2p"]
+    assert p2p == ["plain::p2p"]
+
+
+def test_resolve_required_tests_parses_stringified_select():
+    """A ``*_select`` value given as a repr-style stringified list is parsed."""
+    task = _task(
+        metadata={
+            "fail_to_pass_select": "['sel::f2p']",
+            "pass_to_pass_select": "['sel::p2p']",
+        },
+    )
+    f2p, p2p = _resolve_required_tests(task)
+    assert f2p == ["sel::f2p"]
+    assert p2p == ["sel::p2p"]
+
+
+def test_coerce_test_list():
+    """``_coerce_test_list`` accepts lists and stringified lists, returning [] on bad input."""
+    assert _coerce_test_list(["a", "b"]) == ["a", "b"]
+    assert _coerce_test_list("['a', 'b']") == ["a", "b"]
+    assert _coerce_test_list('["a", "b"]') == ["a", "b"]
+    assert _coerce_test_list("not a list") == []
+    assert _coerce_test_list("[broken") == []
+
+
+def test_resolved_uses_select_tests_end_to_end():
+    """End to end, ``*_select`` precedence resolves a run whose report has only the select tests."""
+    # The report only contains the *_select tests; the plain keys would be unmet.
+    report = _report("sel::f2p", "sel::p2p")
+    task = _task(
+        fail_to_pass=["plain::f2p"],
+        pass_to_pass=["plain::p2p"],
+        metadata={
+            "run_script": "echo run\n",
+            "parsing_script": "import sys\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+            "fail_to_pass_select": ["sel::f2p"],
+            "pass_to_pass_select": ["sel::p2p"],
+        },
+    )
+    result = asyncio.run(_run({"report": report}, task))
+    assert result.resolved is True
+
+
+# ---- dockerfile ENV replay --------------------------------------------------
+
+
+def test_parse_dockerfile_env_equals_and_space_forms():
+    """``_parse_dockerfile_env`` parses both ``ENV K=V`` and ``ENV K V`` forms, skipping non-ENV lines."""
+    task = _task(
+        metadata={
+            "base_dockerfile": "FROM ubuntu\nENV FOO=bar\nENV SPACED  spaced_value\n",
+            "instance_dockerfile": "ENV BAZ = qux\nRUN echo hi\n",
+        },
+    )
+    env = _parse_dockerfile_env(task)
+    assert env["FOO"] == "bar"
+    assert env["SPACED"] == "spaced_value"
+    assert env["BAZ"] == "qux"
+    assert "RUN" not in env
+
+
+def test_parse_dockerfile_env_absent_is_noop():
+    """``_parse_dockerfile_env`` returns an empty mapping when no dockerfile is present."""
+    assert _parse_dockerfile_env(_task(metadata={})) == {}
+
+
+def test_build_spec_injects_dockerfile_env():
+    """``build_spec`` injects dockerfile ENV entries while preserving the existing git env."""
+    task = _task(metadata={"base_dockerfile": "ENV PATH=/custom/bin:$PATH\n"})
+    spec = NVInternalHarness().build_spec(task)
+    # Existing git env preserved; dockerfile ENV injected.
+    assert spec.env["GIT_CONFIG_GLOBAL"] == "/dev/null"
+    assert spec.env["PATH"] == "/custom/bin:$PATH"
+
+
+# ---- dotted script keys are uploaded ----------------------------------------
+
+
+async def _run_recording(task) -> _RecordingProvider:
+    """Drive reset -> materialize -> run_eval with a recording provider.
+
+    Args:
+        task: The SweTask to evaluate.
+
+    Returns:
+        The recording provider, so tests can inspect captured execs and uploads.
+    """
+    provider = _RecordingProvider(report=_report("pkg/test_x.py::a", "pkg/test_x.py::b"))
+    harness = NVInternalHarness()
+    env = await AsyncSweEnvironment.start(provider, harness.build_spec(task))
+    try:
+        await harness.reset_repo(env, task)
+        await harness.materialize(env, task)
+        await harness.run_eval(env, task)
+    finally:
+        await env.cleanup()
+    return provider
+
+
+def test_materialize_reads_dotted_script_keys():
+    """``materialize`` uploads scripts stored under the dotted keys ``run_script.sh`` / ``parsing_script.py``."""
+    task = _task(
+        repo_workdir="/app",
+        metadata={
+            "run_script.sh": "echo DOTTED_RUN\n",
+            "parsing_script.py": "print('DOTTED_PARSE')\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    provider = asyncio.run(_run_recording(task))
+    assert provider.uploads["/root/run_script.sh"] == "echo DOTTED_RUN\n"
+    assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED_PARSE')\n"
+
+
+def test_materialize_dotted_keys_take_precedence_over_extensionless():
+    """When both dotted and extensionless script keys are present, the dotted keys win."""
+    task = _task(
+        repo_workdir="/app",
+        metadata={
+            "run_script.sh": "echo DOTTED\n",
+            "run_script": "echo EXTLESS\n",
+            "parsing_script.py": "print('DOTTED')\n",
+            "parsing_script": "print('EXTLESS')\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    provider = asyncio.run(_run_recording(task))
+    assert provider.uploads["/root/run_script.sh"] == "echo DOTTED\n"
+    assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED')\n"
+
+
+def test_materialize_falls_back_to_extensionless_keys():
+    """When only the extensionless script keys are present, they are used."""
+    task = _task(
+        repo_workdir="/app",
+        metadata={
+            "run_script": "echo EXTLESS_RUN\n",
+            "parsing_script": "print('EXTLESS_PARSE')\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    provider = asyncio.run(_run_recording(task))
+    assert provider.uploads["/root/run_script.sh"] == "echo EXTLESS_RUN\n"
+    assert provider.uploads["/root/parsing_script.py"] == "print('EXTLESS_PARSE')\n"
+
+
+# ---- hops run in /app -------------------------------------------------------
+
+
+def test_nv_workdir_defaults_to_app():
+    """``_nv_workdir`` maps the generic /testbed default (or empty) to /app, honoring pinned paths."""
+    assert _nv_workdir(_task(repo_workdir="/testbed")) == NV_DEFAULT_WORKDIR
+    assert _nv_workdir(_task(repo_workdir="")) == NV_DEFAULT_WORKDIR
+    # A row that pins a non-default workdir is honored.
+    assert _nv_workdir(_task(repo_workdir="/srv/repo")) == "/srv/repo"
+    assert _nv_workdir(_task(repo_workdir="/app")) == "/app"
+
+
+def test_build_spec_workdir_defaults_to_app_for_generic_default():
+    """``build_spec`` rewrites the generic /testbed default workdir to /app."""
+    spec = NVInternalHarness().build_spec(_task(repo_workdir="/testbed"))
+    assert spec.workdir == NV_DEFAULT_WORKDIR
+
+
+def test_all_hops_run_in_app_for_generic_default():
+    """With the generic /testbed default, every reset/apply/run/parse/cat hop runs in /app."""
+    task = _task(
+        repo_workdir="/testbed",
+        metadata={
+            "run_script.sh": "echo run\n",
+            "parsing_script.py": "import sys\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    provider = asyncio.run(_run_recording(task))
+    cwds = {cwd for _, cwd in provider.execs}
+    assert cwds == {NV_DEFAULT_WORKDIR}
+    # Spot-check that the key hops were exercised in /app.
+    by_cwd = {cmd: cwd for cmd, cwd in provider.execs}
+    assert any("git reset --hard" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+    assert any("git apply" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+    assert any("run_script.sh" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+    assert any("parsing_script.py" in cmd and cwd == "/app" for cmd, cwd in provider.execs)
+    assert by_cwd["cat /root/output.json"] == "/app"
+
+
+def test_all_hops_honor_explicit_non_default_workdir():
+    """A row that pins ``repo_workdir`` to a non-default path runs every hop there."""
+    task = _task(
+        repo_workdir="/srv/repo",
+        metadata={
+            "run_script.sh": "echo run\n",
+            "parsing_script.py": "import sys\n",
+            "selected_test_files_to_run": ["pkg/test_x.py"],
+        },
+    )
+    provider = asyncio.run(_run_recording(task))
+    assert {cwd for _, cwd in provider.execs} == {"/srv/repo"}
diff --git a/resources_servers/swe_bench/tests/test_r2egym.py b/resources_servers/swe_bench/tests/test_r2egym.py
new file mode 100644
index 0000000000..b8a671b42a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_r2egym.py
@@ -0,0 +1,185 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the r2e-gym flat (host-graded) harness.
+
+r2e-gym now grades host-side via the shared flat-eval path (the apptainer-only nested
+``run_local_evaluation`` grader was removed when PR #1694 took over the apptainer provider; the
+nested re-wiring is tracked for a follow-up PR). These tests cover provisioning, the agent-phase
+test-hiding command shape, ``reset_repo``, and the flat ``run_eval`` + ``grade`` path against a
+scripted ``_FakeProvider``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness
+
+
+_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n"
+
+
+class _FakeProvider:
+    """Scripted provider: returns a canned eval log for the eval-script run; records uploads."""
+
+    name = "fake-r2egym"
+
+    def __init__(self, *, log_text="", exec_rc=0, **_):
+        self._log_text = log_text
+        self._exec_rc = exec_rc
+        self.uploaded: dict[str, str] = {}
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        rc = 0 if command.startswith("cat ") else self._exec_rc
+        return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc)
+
+    async def upload_file(self, handle, local_path, remote_path):
+        try:
+            with open(local_path, encoding="utf-8") as fh:
+                self.uploaded[remote_path] = fh.read()
+        except OSError:
+            self.uploaded[remote_path] = ""
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-r2egym", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+    """Build an r2e-gym ``SweTask`` with sensible defaults."""
+    base = dict(
+        instance_id="repo__inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        model_patch="diff --git a/x b/x\n",
+        fail_to_pass=["t::a"],
+        pass_to_pass=["t::b"],
+        benchmark="r2e-gym",
+        split="test",
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+def test_harness_identity():
+    harness = R2EGymHarness()
+    assert harness.name == "r2e-gym"
+    assert harness.grade_strategy == "flat-host-grade"
+
+
+def test_build_spec_image_workdir_metadata():
+    spec = R2EGymHarness().build_spec(_task())
+    assert spec.image == "img:tag"
+    assert spec.workdir == "/testbed"
+    assert spec.metadata["harness"] == "r2e-gym"
+
+
+def test_build_spec_truncates_long_instance_id():
+    spec = R2EGymHarness().build_spec(_task(instance_id="x" * 100))
+    assert len(spec.metadata["instance_id"]) == 63
+
+
+def test_supports_provider_any_exec_capable():
+    harness = R2EGymHarness()
+    assert harness.supports_provider("docker") is True
+    assert harness.supports_provider("apptainer") is True
+
+
+def test_hide_eval_tests_commands_shape():
+    commands = R2EGymHarness().hide_eval_tests_commands()
+    assert len(commands) == 3
+    assert all("r2e_tests" in c for c in commands)
+
+
+def test_materialize_writes_patch_diff():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = R2EGymHarness()
+        task = _task()
+        env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+        await harness.materialize(env, task)
+        return env.sandbox._provider
+
+    provider = asyncio.run(run())
+    assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n"
+
+
+def test_run_eval_then_grade_flat_resolved():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = R2EGymHarness()
+        task = _task(metadata={"eval_script": "echo run"})
+        env = await AsyncSweEnvironment.start({"fake-r2egym": {"log_text": _PASSING_LOG}}, harness.build_spec(task))
+        artifacts = await harness.run_eval(env, task)
+        return harness.grade(task, artifacts)
+
+    report = asyncio.run(run())
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_run_eval_missing_eval_script_is_unmasked_unresolved():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    # A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error:
+    # only genuine sandbox/timeout infra failures are masked.
+    async def run():
+        harness = R2EGymHarness()
+        task = _task()
+        env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+        artifacts = await harness.run_eval(env, task)
+        return harness.grade(task, artifacts)
+
+    report = asyncio.run(run())
+    assert report.error_kind is None
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_reset_repo_is_noop():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = R2EGymHarness()
+        task = _task()
+        env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task))
+        await harness.reset_repo(env, task)  # must not raise
+
+    asyncio.run(run())
diff --git a/resources_servers/swe_bench/tests/test_swe_bench_ext.py b/resources_servers/swe_bench/tests/test_swe_bench_ext.py
new file mode 100644
index 0000000000..673689ed6e
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_bench_ext.py
@@ -0,0 +1,491 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the swe-bench-ext harness grading.
+
+These cover two grading behaviors:
+
+* ``grade`` delegates to the vendored lighthouse parser
+  (``parse_and_check_tests``) — so junit-xml parsing, ``normalize_test_id`` plus
+  4-stage fuzzy matching, the 20+ framework dispatch, and the
+  ``::build``/``::compile`` synthetic-PASS injection all drive ``resolved``.
+  Recorded fixture logs (one per parser path) anchor the expectation.
+* ``resolved`` is the parser's verdict only; a failed ``git apply`` is recorded
+  in ``patch_applied`` but never gates ``resolved``.
+
+The harness is flat / host-graded (no nested container), so ``run_eval`` runs
+against a scripted ``FakeSandbox`` rather than a real image.
+"""
+
+from __future__ import annotations
+
+import asyncio
+from pathlib import Path
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+
+
+_FIXTURES = Path(__file__).parent / "fixtures" / "swe_bench_ext"
+
+
+def _fixture(name: str) -> str:
+    """Read a recorded fixture log by file name.
+
+    Args:
+        name: The fixture file name under the ``swe_bench_ext`` fixtures dir.
+
+    Returns:
+        str: The fixture file contents.
+    """
+    return (_FIXTURES / name).read_text()
+
+
+def _task(**overrides) -> SweTask:
+    """Build a swe-bench-ext ``SweTask`` with sensible defaults.
+
+    Args:
+        **overrides: Field values overriding the defaults.
+
+    Returns:
+        SweTask: A task populated from the defaults merged with overrides.
+    """
+    base = dict(
+        instance_id="repo__inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        test_command="python -m pytest -rA -q",
+        test_framework="pytest",
+        model_patch="diff --git a/x b/x\n",
+        fail_to_pass=["tests/test_core.py::test_fix_applied"],
+        pass_to_pass=["tests/test_core.py::test_regression_guard"],
+        benchmark="swe-bench-ext",
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+def _artifacts(test_output: str, *, patch_applied: bool = True, error_type=None) -> EvalArtifacts:
+    """Build ``EvalArtifacts`` for a graded run.
+
+    Args:
+        test_output: The captured test transcript handed to the parser.
+        patch_applied: Whether the model patch applied cleanly.
+        error_type: Infrastructure error kind, or None for a clean run.
+
+    Returns:
+        EvalArtifacts: The artifacts passed to ``grade``.
+    """
+    return EvalArtifacts(
+        test_output=test_output,
+        return_code=0,
+        patch_applied=patch_applied,
+        raw={"error_type": error_type},
+    )
+
+
+# --- vendored parser drives resolved ----------------------------------------
+
+
+def test_grade_junit_xml_resolved():
+    """junit-xml parsing + fuzzy id matching resolves a clean F2P/P2P pass."""
+    harness = SweBenchExtHarness()
+    report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml")))
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+    # The parser report is surfaced for inspection.
+    assert report.tests_status["framework"] == "pytest"
+    assert report.tests_status["f2p_passed"] == 1
+    assert report.tests_status["p2p_passed"] == 1
+
+
+def test_grade_junit_xml_unresolved_when_p2p_fails():
+    harness = SweBenchExtHarness()
+    task = _task(pass_to_pass=["tests/test_core.py::test_unrelated_broken"])
+    report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml")))
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_grade_pytest_text_fuzzy_id_match():
+    """Normalized/fuzzy id matching: ``src/pkg/...py::test`` log id resolves a
+    differently-delimited expected id via normalize_test_id."""
+    harness = SweBenchExtHarness()
+    task = _task(
+        fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha"],
+        pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+    )
+    report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_grade_pytest_text_unresolved_when_f2p_fails():
+    harness = SweBenchExtHarness()
+    task = _task(
+        fail_to_pass=["src/pkg/tests/test_widget.py::test_gamma"],
+        pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+    )
+    report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+    assert report.resolved is False
+
+
+def test_grade_build_synthetic_pass_injection():
+    """An F2P entry ending ``::build`` not present in the parsed output is
+    injected as PASSED (synthetic build/compile handling)."""
+    harness = SweBenchExtHarness()
+    task = _task(
+        fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha", "mypkg::build"],
+        pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"],
+    )
+    report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt")))
+    assert report.resolved is True
+    assert report.tests_status["fail_to_pass_results"]["mypkg::build"] == "PASSED"
+
+
+def test_grade_non_pytest_framework_go_json():
+    """A non-pytest framework (``go``) dispatches to the go-json parser."""
+    harness = SweBenchExtHarness()
+    task = _task(
+        test_framework="go",
+        fail_to_pass=["github.com/acme/widget::TestAlpha"],
+        pass_to_pass=["github.com/acme/widget::TestBeta"],
+    )
+    report = harness.grade(task, _artifacts(_fixture("go_json.txt")))
+    assert report.resolved is True
+    assert report.tests_status["framework"] == "go"
+
+
+def test_grade_non_pytest_framework_go_json_unresolved():
+    harness = SweBenchExtHarness()
+    task = _task(
+        test_framework="go",
+        fail_to_pass=["github.com/acme/widget::TestGamma"],
+        pass_to_pass=["github.com/acme/widget::TestBeta"],
+    )
+    report = harness.grade(task, _artifacts(_fixture("go_json.txt")))
+    assert report.resolved is False
+
+
+# --- empty framework is passed VERBATIM (NOT coerced to pytest) --------------
+
+
+def test_grade_empty_framework_passed_verbatim_not_coerced_to_pytest():
+    """``test_framework`` is passed through UNCHANGED — an empty framework reaches
+    ``parse_and_check_tests`` as ``""`` and hits the parser's auto-detect path, NOT
+    the pytest junit-xml parser.
+
+    Coercing ``""`` -> ``"pytest"`` would let junit-xml parse and report
+    ``resolved`` for an instance that should auto-detect. We assert the framework
+    reaches the parser verbatim (recorded in ``report.framework``) and that
+    junit-xml is therefore NOT parsed under an empty framework.
+    """
+    harness = SweBenchExtHarness()
+    task = _task(test_framework="")
+    report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml")))
+    # Framework recorded verbatim — not silently rewritten to "pytest".
+    assert report.tests_status["framework"] == ""
+    # Auto-detect path does not understand junit-xml -> nothing parsed -> unresolved.
+    assert report.tests_status["parsed_count"] == 0
+    assert report.resolved is False
+
+
+def test_grade_empty_framework_uses_autodetect_path():
+    """An empty framework grades via parse_test_output's auto-detect path (TAP /
+    Mocha-Hardhat console) when the instance ships no framework. Here a TAP
+    transcript resolves without any framework hint."""
+    harness = SweBenchExtHarness()
+    tap_output = (
+        "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>\n"
+        "TAP version 13\n"
+        "1..2\n"
+        "ok 1 - test_fix_applied\n"
+        "ok 2 - test_regression_guard\n"
+        "<<<SWE_BENCH_EXT_TEST_OUTPUT_END>>>\n"
+    )
+    task = _task(
+        test_framework="",
+        fail_to_pass=["test_fix_applied"],
+        pass_to_pass=["test_regression_guard"],
+    )
+    report = harness.grade(task, _artifacts(tap_output))
+    assert report.tests_status["framework"] == ""
+    assert report.tests_status["parsed_count"] >= 2
+    assert report.resolved is True
+
+
+def test_run_eval_and_grade_share_framework_value():
+    """run_eval (flag/result-file selection) and grade (parsing) use the SAME
+    framework. With an empty framework, run_eval must NOT inject pytest's
+    ``--junitxml`` flag and must wrap the bare command, and grade must parse under
+    ``""`` — proving the two share ``_resolve_framework`` rather than diverging on a
+    pytest default."""
+    task = _task(test_framework="", test_command="run-my-tests")
+    _, _, provider = _run_eval(task, test_output="", run_cmd="run-my-tests")
+    eval_cmds = [c for c in provider.commands if "run-my-tests" in c]
+    assert eval_cmds, "expected the bare framework command to be wrapped"
+    wrapped = eval_cmds[-1]
+    # Empty framework => default framework config => no output flag, no result file.
+    assert "--junitxml" not in wrapped
+    assert "<<<SWE_BENCH_EXT_RESULT_FILE_START>>>" not in wrapped
+    # The mkdir parent-dir creation is present regardless.
+    assert "mkdir -p /workspace/test-results" in wrapped
+
+
+def test_run_eval_command_less_row_injects_no_default_runner():
+    """A command-less row runs NO test runner, matching main's SweBenchExtDatasetProcessor.
+
+    Main uses ``inst.get("test_command", "")`` verbatim (empty when absent), so a row that
+    ships no command runs nothing and grades unresolved. The harness must not fabricate a
+    ``python -m pytest`` default that would diverge from main by manufacturing results.
+    """
+    task = _task(test_command="", test_framework="")
+    _, _, provider = _run_eval(task, test_output="", run_cmd="__never__")
+    eval_cmds = [c for c in provider.commands if "git apply" not in c and "cat " not in c]
+    wrapped = eval_cmds[-1]
+    assert "pytest" not in wrapped  # no default runner injected
+    # The command slot between the START/END marker echoes is empty (no runner line).
+    assert 'echo "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>"\n\necho "<<<SWE_BENCH_EXT_TEST_OUTPUT_END>>>"' in wrapped
+
+
+# --- patch_applied does not gate resolved -----------------------------------
+
+
+def test_grade_resolved_even_when_patch_apply_failed():
+    """Grading is on tests ONLY; a failed apply is recorded but never flips a
+    tests-passing run to unresolved."""
+    harness = SweBenchExtHarness()
+    report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml"), patch_applied=False))
+    assert report.patch_applied is False
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+# --- infra masking ----------------------------------------------------------
+
+
+def test_grade_masks_on_infra_error():
+    harness = SweBenchExtHarness()
+    report = harness.grade(_task(), _artifacts("", error_type="timeout"))
+    assert report.error_kind == "timeout"
+    assert reward_from_report(report) == 0.0
+
+
+# --- run_eval against a scripted FakeSandbox --------------------------------
+
+
+class _FakeExtProvider:
+    """Scripted provider that records git-apply attempts and returns a transcript.
+
+    Args:
+        test_output: The transcript returned for the wrapped eval command.
+        apply_rc: Return code for ``git apply`` commands.
+        run_cmd: Substring identifying the wrapped eval command.
+        git_dir: Directory whose ``.git`` probe succeeds; ``None`` means every
+            probed dir reports a checkout (so the first ladder entry wins).
+    """
+
+    name = "fake-ext"
+
+    def __init__(self, *, test_output="", apply_rc=0, run_cmd="pytest", git_dir=None, **_):
+        self._test_output = test_output
+        self._apply_rc = apply_rc
+        # Marker that identifies the wrapped eval command (defaults to the pytest
+        # command); tests with a custom command pass run_cmd.
+        self._run_cmd = run_cmd
+        # Which directory holds the repo checkout: a ``test -d "<dir>/.git"`` probe
+        # succeeds only for this dir. None => every probed dir reports a checkout
+        # (so the first ladder entry, /testbed, wins).
+        self._git_dir = git_dir
+        self.commands: list[str] = []
+        self.exec_cwds: list[str | None] = []
+        self.uploaded: dict[str, str] = {}
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        self.commands.append(command)
+        self.exec_cwds.append(cwd)
+        if command.startswith("test -d "):
+            # The repo-workdir probe: succeed only for the configured git dir (or any
+            # dir when unconfigured).
+            if self._git_dir is None or f'"{self._git_dir}/.git"' in command:
+                return SandboxExecResult(stdout="", stderr="", return_code=0)
+            return SandboxExecResult(stdout="", stderr="", return_code=1)
+        if "git apply" in command:
+            return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+        if self._run_cmd in command:
+            return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, handle, local_path, remote_path):
+        try:
+            with open(local_path, encoding="utf-8") as fh:
+                self.uploaded[remote_path] = fh.read()
+        except OSError:
+            self.uploaded[remote_path] = ""
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-ext", _FakeExtProvider, override=True)
+
+
+def _run_eval(task: SweTask, *, test_output: str, apply_rc: int = 0, run_cmd: str = "pytest", git_dir=None):
+    """Run the harness through a scripted provider and return the run outputs.
+
+    Args:
+        task: The task to evaluate.
+        test_output: The transcript the provider returns for the eval command.
+        apply_rc: Return code for ``git apply`` commands.
+        run_cmd: Substring identifying the wrapped eval command.
+        git_dir: Directory whose ``.git`` probe succeeds (None => any dir).
+
+    Returns:
+        tuple: The harness, the produced ``EvalArtifacts``, and the provider
+        instance (for command inspection).
+    """
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = SweBenchExtHarness()
+        env = await AsyncSweEnvironment.start(
+            {"fake-ext": {"test_output": test_output, "apply_rc": apply_rc, "run_cmd": run_cmd, "git_dir": git_dir}},
+            harness.build_spec(task),
+        )
+        await harness.materialize(env, task)
+        artifacts = await harness.run_eval(env, task)
+        return harness, artifacts, env.sandbox._provider
+
+    return asyncio.run(run())
+
+
+def test_run_eval_uses_legacy_apply_flags_and_grades_resolved():
+    task = _task()
+    harness, artifacts, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"))
+    apply_cmds = [c for c in provider.commands if "git apply" in c]
+    assert apply_cmds, "expected a git-apply attempt"
+    # The git-apply flag set, with no --3way fallback.
+    assert all("--reject --recount --ignore-space-change --ignore-whitespace" in c for c in apply_cmds)
+    assert all("--3way" not in c for c in apply_cmds)
+    assert artifacts.patch_applied is True
+    report = harness.grade(task, artifacts)
+    assert report.resolved is True
+
+
+def test_run_eval_apply_failure_still_resolves_on_tests():
+    # End-to-end through run_eval -> grade: a failed apply records
+    # patch_applied=False but a tests-passing run still resolves.
+    task = _task()
+    harness, artifacts, _ = _run_eval(task, test_output=_fixture("pytest_junit.xml"), apply_rc=1)
+    assert artifacts.patch_applied is False
+    report = harness.grade(task, artifacts)
+    assert report.patch_applied is False
+    assert report.resolved is True
+
+
+def test_run_eval_wraps_command_with_structured_output_and_markers():
+    # run_eval wraps the command — add the structured-output flag (--junitxml) via
+    # get_test_command_with_output and run between the SWE_BENCH_EXT markers (plus
+    # result-file dump), so parse_and_check_tests receives junit-xml / marked
+    # output rather than raw "-rA" text it cannot parse.
+    task = _task()
+    _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"))
+    eval_cmds = [c for c in provider.commands if "pytest" in c and "git apply" not in c]
+    assert eval_cmds, "expected a wrapped pytest eval command"
+    wrapped = eval_cmds[-1]
+    assert "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>" in wrapped
+    assert "<<<SWE_BENCH_EXT_TEST_OUTPUT_END>>>" in wrapped
+    assert "--junitxml=" in wrapped  # structured-output flag from get_test_command_with_output
+    assert "<<<SWE_BENCH_EXT_RESULT_FILE_START>>>" in wrapped  # junit result-file dumped for the parser
+    # The result-file parent dir is created first.
+    assert "mkdir -p /workspace/test-results" in wrapped
+
+
+# --- repo-workdir fallback ladder (matches main's cd /testbed||/workspace/repo||/app) ----
+
+
+def _eval_cwd(provider) -> str | None:
+    """Return the cwd of the wrapped eval command (the command holding the markers)."""
+    for command, cwd in zip(provider.commands, provider.exec_cwds):
+        if "<<<SWE_BENCH_EXT_TEST_OUTPUT_START>>>" in command:
+            return cwd
+    return None
+
+
+def test_run_eval_resolves_workdir_from_ladder_when_repo_not_at_testbed():
+    """A repo at /workspace/repo (not /testbed) is found via main's fallback ladder.
+
+    Main's eval script runs ``cd /testbed || cd /workspace/repo || cd /app``; the harness
+    must reproduce that so the patches and tests run in the real checkout rather than the
+    hardcoded /testbed default.
+    """
+    task = _task()  # default repo_workdir == /testbed
+    _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/workspace/repo")
+    # The patch-apply and the wrapped eval command run in the located checkout.
+    apply_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if "git apply" in cmd]
+    assert apply_cwds and all(cwd == "/workspace/repo" for cwd in apply_cwds)
+    assert _eval_cwd(provider) == "/workspace/repo"
+
+
+def test_run_eval_prefers_explicit_non_default_row_workdir():
+    """An explicit, non-default ``repo_workdir`` holding a checkout wins over the ladder."""
+    task = _task(repo_workdir="/srv/project")
+    _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/srv/project")
+    assert _eval_cwd(provider) == "/srv/project"
+
+
+def test_run_eval_defaults_to_testbed_when_present():
+    """When /testbed holds the checkout it wins (first ladder entry), preserving prior behavior."""
+    task = _task()
+    _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/testbed")
+    assert _eval_cwd(provider) == "/testbed"
+
+
+def test_reset_repo_resolves_workdir_from_ladder():
+    """reset_repo runs ``git reset --hard`` in the located checkout, not a hardcoded /testbed."""
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = SweBenchExtHarness()
+        task = _task()
+        env = await AsyncSweEnvironment.start(
+            {"fake-ext": {"git_dir": "/app"}},
+            harness.build_spec(task),
+        )
+        await harness.reset_repo(env, task)
+        return env.sandbox._provider
+
+    provider = asyncio.run(run())
+    reset_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if cmd.startswith("git reset --hard")]
+    assert reset_cwds == ["/app"]
diff --git a/resources_servers/swe_bench/tests/test_swe_env.py b/resources_servers/swe_bench/tests/test_swe_env.py
new file mode 100644
index 0000000000..e6be92588a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_env.py
@@ -0,0 +1,414 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe_env library, driven by a FakeSandbox provider."""
+
+from __future__ import annotations
+
+import ast
+import asyncio
+from pathlib import Path
+
+import resources_servers.swe_bench.harnesses  # noqa: F401  (registers harnesses)
+from nemo_gym.sandbox import (
+    SandboxCreateError,
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench import (
+    compute_resolved,
+    get_harness,
+    list_harnesses,
+    reward_from_report,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask
+from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness
+from resources_servers.swe_bench.verify_task import ProviderCapabilityError, verify_task
+
+
+# Trailing-status pytest text (``<node_id> PASSED``) is the format the test
+# parser recognizes; node ids carry a ``.py`` path so they normalize to the
+# F2P/P2P ids below.
+_PASS_OUTPUT = "tests/test_x.py::a PASSED\ntests/test_x.py::b PASSED\n"
+_F2P_FAIL_OUTPUT = "tests/test_x.py::a FAILED\ntests/test_x.py::b PASSED\n"
+
+
+class _FakeProvider:
+    """Scripted provider: pytest commands return a canned transcript."""
+
+    name = "fake-swe"
+
+    def __init__(self, *, test_output="", test_rc=0, apply_rc=0, create_error=False, sink=None, **_):
+        """Configure the scripted provider's responses.
+
+        Args:
+            test_output: Stdout returned for pytest commands.
+            test_rc: Return code returned for pytest commands.
+            apply_rc: Return code returned for ``git apply`` commands.
+            create_error: When True, ``create`` raises a SandboxCreateError.
+            sink: Optional list each created spec is appended to, for asserting on what
+                ``verify_task`` passed the provider (e.g. the stamped ``ttl_s``).
+            **_: Ignored extra keyword arguments.
+        """
+        self._test_output = test_output
+        self._test_rc = test_rc
+        self._apply_rc = apply_rc
+        self._create_error = create_error
+        self._sink = sink
+
+    async def create(self, spec):
+        if self._sink is not None:
+            self._sink.append(spec)
+        if self._create_error:
+            raise SandboxCreateError("simulated create failure")
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        if "pytest" in command:
+            return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+        if "git apply" in command:
+            return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, *a, **k):
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-swe", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+    """Build a SweTask with sensible defaults, overridable per keyword.
+
+    Args:
+        **overrides: Field overrides merged onto the default task fields.
+
+    Returns:
+        A SweTask configured for the swe-bench-ext benchmark.
+    """
+    base = dict(
+        instance_id="inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        test_command="python -m pytest -rA -q",
+        model_patch="diff --git a/x b/x\n",
+        test_framework="pytest",
+        fail_to_pass=["tests/test_x.py::a"],
+        pass_to_pass=["tests/test_x.py::b"],
+        benchmark="swe-bench-ext",
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_compute_resolved():
+    """``compute_resolved`` is True only when all required tests are in the passed set."""
+    assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a", "b"]) is True
+    assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a"]) is False
+    assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=["a"]) is False
+
+
+def test_compute_resolved_fail_only():
+    """The ``fail_only`` eval type mirrors swebench's ``check_fail_only``.
+
+    A required test is success UNLESS it is present in the status map AND ==FAILED, so an
+    absent test (silent success) still resolves; a present-and-FAILED test does not.
+    """
+    # Required test absent from the status map -> success (silent) -> resolved.
+    assert (
+        compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=[], eval_type="fail_only", status_map={})
+        is True
+    )
+    # A present-and-FAILED required test -> failure -> unresolved.
+    assert (
+        compute_resolved(
+            fail_to_pass=["a"],
+            pass_to_pass=["b"],
+            passed=["b"],
+            eval_type="fail_only",
+            status_map={"a": "FAILED", "b": "PASSED"},
+        )
+        is False
+    )
+    # Present but not FAILED (e.g. SKIPPED/ERROR) -> success under fail_only -> resolved.
+    assert (
+        compute_resolved(
+            fail_to_pass=["a"],
+            pass_to_pass=["b"],
+            passed=[],
+            eval_type="fail_only",
+            status_map={"a": "SKIPPED", "b": "ERROR"},
+        )
+        is True
+    )
+    # Empty required set is still unresolved under fail_only (the validated edge).
+    assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=[], eval_type="fail_only") is False
+
+
+def test_compute_resolved_pass_and_fail_status_map():
+    """The default ``pass_and_fail`` rule with a populated status_map mirrors swebench.
+
+    This is the path that runs for SWE-bench Verified: a required test is a failure only when it
+    is absent or its status is FAILED/ERROR; PASSED/XFAIL pass and any other status (SKIPPED/XPASS)
+    is neutral (excluded, not a failure). Locking it in guards the swebench-equivalence this PR
+    depends on.
+    """
+    f2p, p2p = ["a"], ["b"]
+    # All required tests PASSED -> resolved.
+    assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED", "b": "PASSED"})
+    # A required test FAILED -> unresolved.
+    assert not compute_resolved(
+        fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "FAILED", "b": "PASSED"}
+    )
+    # A required test ERROR -> unresolved.
+    assert not compute_resolved(
+        fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "ERROR", "b": "PASSED"}
+    )
+    # A required test absent from the status_map -> unresolved.
+    assert not compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED"})
+    # XFAIL passes; SKIPPED/XPASS are neutral (not failures) -> resolved.
+    assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "XFAIL", "b": "SKIPPED"})
+
+
+def test_agent_adapters_do_not_call_grading_methods():
+    """Agent-facing swe_env modules never call the grader-only harness methods.
+
+    ``harness.py`` documents a trust boundary: ``reset_repo`` / ``run_eval`` / ``grade`` are used
+    ONLY by the grader (``verify_task``). This AST guard enforces it — the agent adapters
+    (``self_drive``, ``sandbox``) must reach grading through ``verify_task``, never by calling
+    those methods directly — so the boundary the docstring promises cannot silently regress.
+    """
+    grading_only = {"reset_repo", "run_eval", "grade"}
+    adapter_dir = Path(__file__).resolve().parent.parent
+    for module in ("self_drive.py", "sandbox.py"):
+        tree = ast.parse((adapter_dir / module).read_text())
+        referenced = sorted(
+            node.attr for node in ast.walk(tree) if isinstance(node, ast.Attribute) and node.attr in grading_only
+        )
+        assert not referenced, f"{module} calls grader-only methods {referenced}; route grading via verify_task"
+
+
+def test_reward_from_report():
+    """``reward_from_report`` is 1.0 for a resolved report and 0.0 otherwise or when masked."""
+    assert reward_from_report(SweEvalReport(instance_id="i", resolved=True)) == 1.0
+    assert reward_from_report(SweEvalReport(instance_id="i", resolved=False)) == 0.0
+    assert reward_from_report(SweEvalReport(instance_id="i", resolved=True, error_kind="sandbox")) == 0.0
+
+
+def test_registry_and_build_spec():
+    """The swe-bench-ext harness is registered and builds the expected sandbox spec."""
+    assert "swe-bench-ext" in list_harnesses()
+    harness = get_harness("swe-bench-ext")
+    assert isinstance(harness, SweBenchExtHarness)
+    spec = harness.build_spec(_task())
+    assert spec.image == "img:tag"
+    assert spec.workdir == "/testbed"
+    assert spec.metadata["instance_id"] == "inst-1"
+
+
+def test_grade_masks_on_infra_error():
+    """Grading masks an infra error to reward 0.0 and records its error kind."""
+    harness = get_harness("swe-bench-ext")
+    report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+    assert report.error_kind == "timeout"
+    assert reward_from_report(report) == 0.0
+
+
+# ---- verify_task orchestrator (fresh-sandbox, FakeProvider) -----------------
+
+
+def test_verify_task_resolved():
+    """``verify_task`` resolves a task whose required tests all pass."""
+    provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}}
+    report = asyncio.run(verify_task(provider, _task()))
+    assert report.resolved is True
+    assert report.patch_applied is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_verify_task_unresolved():
+    """``verify_task`` leaves a task unresolved when a required test fails."""
+    provider = {"fake-swe": {"test_output": _F2P_FAIL_OUTPUT, "test_rc": 1}}
+    report = asyncio.run(verify_task(provider, _task()))
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_verify_task_empty_patch_fast_path():
+    """An empty model patch short-circuits to an unresolved report."""
+    report = asyncio.run(verify_task({"fake-swe": {}}, _task(model_patch="")))
+    assert report.patch_exists is False
+    assert report.resolved is False
+
+
+def test_verify_task_non_timeout_eval_failure_unmasked():
+    """A non-timeout eval-stage failure is unmasked: resolved=False, reward 0.0.
+
+    Mirrors main's app.py, which catches any eval exception, returns no report file
+    (resolved=False) and leaves eval_timed_out False (so mask_sample stays False).
+    Only a genuine wall-clock eval timeout is masked.
+    """
+    report = asyncio.run(verify_task({"fake-swe": {"create_error": True}}, _task()))
+    assert report.error_kind is None
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_verify_task_golden():
+    """Running with ``run_golden`` applies the golden patch and resolves the task."""
+    provider = {"fake-swe": {"test_output": _PASS_OUTPUT}}
+    task = _task(model_patch="", metadata={"golden_patch": "diff --git a/x b/x\n"})
+    report = asyncio.run(verify_task(provider, task, run_golden=True))
+    assert report.resolved is True
+
+
+def test_verify_task_patch_apply_failure_does_not_gate_resolved():
+    """A failed patch apply is recorded but does not gate ``resolved``.
+
+    The patch is applied best-effort and grading is based on the tests only, so a
+    failed apply (patch_applied=False) does not flip a tests-passing run to
+    unresolved.
+    """
+    provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "apply_rc": 1}}
+    report = asyncio.run(verify_task(provider, _task()))
+    assert report.patch_applied is False
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_unsupported_provider_raises():
+    """``verify_task`` raises when the harness does not support the given provider."""
+
+    class _NestedOnly(SweBenchExtHarness):
+        name = "nested-only-test"
+
+        def supports_provider(self, provider_name: str) -> bool:
+            """Report support for every provider except ``fake-swe``.
+
+            Args:
+                provider_name: The provider name being checked.
+
+            Returns:
+                True for any provider other than ``fake-swe``.
+            """
+            return provider_name != "fake-swe"
+
+    from resources_servers.swe_bench.harness import register_harness
+
+    register_harness(_NestedOnly(), override=True)
+    task = _task(benchmark="nested-only-test")
+    try:
+        asyncio.run(verify_task({"fake-swe": {}}, task))
+    except ProviderCapabilityError:
+        return
+    raise AssertionError("expected ProviderCapabilityError")
+
+
+def test_verify_task_propagates_grader_dependency_error():
+    """``verify_task`` propagates ``GraderDependencyError`` instead of swallowing it to reward-0.
+
+    A missing grading dependency (e.g. swebench for a SWE-bench instance) must fail loud rather
+    than silently degrade the resolve rate, so it is re-raised, not caught by the unmasked
+    eval-stage handler.
+    """
+    from resources_servers.swe_bench.harness import GraderDependencyError, register_harness
+
+    class _MissingGrader(SweBenchExtHarness):
+        name = "missing-grader-test"
+
+        def grade(self, task, artifacts):
+            """Simulate a harness whose required grading dependency is unavailable.
+
+            Args:
+                task: The task being graded.
+                artifacts: The eval artifacts (unused).
+
+            Raises:
+                GraderDependencyError: Always, to exercise the propagation path.
+            """
+            raise GraderDependencyError("grading dependency missing")
+
+    register_harness(_MissingGrader(), override=True)
+    try:
+        asyncio.run(verify_task({"fake-swe": {"test_output": _PASS_OUTPUT}}, _task(benchmark="missing-grader-test")))
+    except GraderDependencyError:
+        return
+    raise AssertionError("expected GraderDependencyError to propagate")
+
+
+def test_verify_task_flat_eval_metadata():
+    """``metadata['flat_eval']`` routes grading through the harness's flat variant."""
+    provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}}
+    report = asyncio.run(verify_task(provider, _task(metadata={"flat_eval": True})))
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_verify_task_stamps_ttl_when_unset():
+    """``verify_task`` stamps ``ttl_s = eval_timeout_s + slack`` when the harness leaves it unset.
+
+    The stamp lets TTL-honoring backends (opensandbox) self-expire an eval sandbox orphaned by a
+    hard crash; harnesses that already set ``ttl_s`` (e.g. swe-bench-ext) keep their own value.
+    """
+    import dataclasses
+
+    from resources_servers.swe_bench.harness import register_harness
+    from resources_servers.swe_bench.verify_task import _TTL_SLACK_S
+
+    class _NoTtl(SweBenchExtHarness):
+        name = "no-ttl-test"
+
+        def build_spec(self, task):
+            """Build the swe-bench-ext spec but clear ``ttl_s`` so verify_task must stamp it.
+
+            Args:
+                task: The task to build a spec for.
+
+            Returns:
+                The base spec with ``ttl_s`` reset to None.
+            """
+            return dataclasses.replace(super().build_spec(task), ttl_s=None)
+
+    register_harness(_NoTtl(), override=True)
+    captured: list = []
+    provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "sink": captured}}
+    asyncio.run(verify_task(provider, _task(benchmark="no-ttl-test"), eval_timeout_s=120))
+    assert captured, "expected create() to be called with a stamped spec"
+    assert captured[-1].ttl_s == 120 + _TTL_SLACK_S
+
+
+def test_report_to_reward_wrapper():
+    """``report_to_reward`` is a thin wrapper that scores a report like ``reward_from_report``."""
+    from resources_servers.swe_bench.verify_task import report_to_reward
+
+    assert report_to_reward(SweEvalReport(instance_id="i", resolved=True)) == 1.0
+    assert report_to_reward(SweEvalReport(instance_id="i", resolved=False)) == 0.0
diff --git a/resources_servers/swe_bench/tests/test_swe_rebench.py b/resources_servers/swe_bench/tests/test_swe_rebench.py
new file mode 100644
index 0000000000..9d6faa1cbf
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swe_rebench.py
@@ -0,0 +1,483 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe-rebench harness (FakeSandbox provider).
+
+A tiny fake ``agent/log_parsers.py`` is written to a tmp dir so the real
+``_load_rebench_log_parsers`` import and ``NAME_TO_PARSER`` resolution path is
+exercised end to end, then the resolved / unresolved / masked grade paths are
+driven.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import textwrap
+from pathlib import Path
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask
+from resources_servers.swe_bench.harnesses.swe_rebench import (
+    SweRebenchHarness,
+    _normalize_test_name,
+)
+
+
+class _FakeProvider:
+    """Scripted provider: test command returns a canned transcript."""
+
+    name = "fake-rebench"
+
+    def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_):
+        """Initialize the scripted provider.
+
+        Args:
+            test_output: Transcript returned for the test command.
+            test_rc: Return code for the test command.
+            apply_rc: Return code for ``git apply`` commands.
+        """
+        self._test_output = test_output
+        self._test_rc = test_rc
+        self._apply_rc = apply_rc
+
+    async def create(self, spec):
+        raw = {"workdir": spec.workdir, "env": spec.env}
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw=raw)
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        if "git apply" in command:
+            return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+        if "pytest" in command or "test" in command:
+            return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, *a, **k):
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-rebench", _FakeProvider, override=True)
+
+
+class _RecordingProvider:
+    """Scripted provider that records every exec command, in order."""
+
+    name = "recording-rebench"
+    commands: list[str] = []
+    # (command, timeout_s) for every exec, so tests can assert the eval timeout
+    # is threaded into the test exec.
+    exec_calls: list[tuple[str, object]] = []
+
+    def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_):
+        """Initialize the recording provider.
+
+        Args:
+            test_output: Transcript returned for the test command.
+            test_rc: Return code for the test command.
+            apply_rc: Return code for ``git apply`` commands.
+        """
+        self._test_output = test_output
+        self._test_rc = test_rc
+        self._apply_rc = apply_rc
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="rec", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        type(self).commands.append(command)
+        type(self).exec_calls.append((command, timeout_s))
+        if "git apply" in command:
+            return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc)
+        if "pytest" in command or "test" in command:
+            return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc)
+        return SandboxExecResult(stdout="", stderr="", return_code=0)
+
+    async def upload_file(self, *a, **k):
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("recording-rebench", _RecordingProvider, override=True)
+
+
+# A standalone log_parsers module the harness imports dynamically. The parser
+# splits "<node> <STATUS>" lines into {node: STATUS} and exposes a
+# NAME_TO_PARSER registry of callables, matching the shape the harness expects.
+_FAKE_LOG_PARSERS = textwrap.dedent(
+    """
+    def parse_simple(log):
+        results = {}
+        for line in log.splitlines():
+            line = line.strip()
+            if not line:
+                continue
+            node, _, status = line.rpartition(" ")
+            if node and status:
+                results[node] = status
+        return results
+
+    NAME_TO_PARSER = {"simple": parse_simple}
+    """
+)
+
+
+def _write_fake_parsers(tmp_path: Path) -> Path:
+    """Write the fake ``agent/log_parsers.py`` module under a tmp repo dir.
+
+    Args:
+        tmp_path: The pytest tmp dir to create the repo under.
+
+    Returns:
+        Path: The created ``SWE-rebench-V2`` repo directory.
+    """
+    repo_dir = tmp_path / "SWE-rebench-V2"
+    (repo_dir / "agent").mkdir(parents=True)
+    (repo_dir / "agent" / "log_parsers.py").write_text(_FAKE_LOG_PARSERS)
+    return repo_dir
+
+
+def _task(**overrides) -> SweTask:
+    """Build a swe-rebench ``SweTask`` with sensible defaults.
+
+    Args:
+        **overrides: Field values overriding the defaults.
+
+    Returns:
+        SweTask: A task populated from the defaults merged with overrides.
+    """
+    base = dict(
+        instance_id="rebench-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        test_command="python -m pytest -rA -q",
+        model_patch="diff --git a/x b/x\n",
+        test_patch="diff --git a/t b/t\n",
+        fail_to_pass=["t::a"],
+        pass_to_pass=["t::b"],
+        benchmark="swe-rebench",
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+# ---- pure helpers -----------------------------------------------------------
+
+
+def test_normalize_test_name_strips_timing():
+    assert _normalize_test_name("t::a [ 12 ms ]") == "t::a"
+    assert _normalize_test_name("t::a [0.3s]") == "t::a"
+    assert _normalize_test_name("t::a in 1.2 sec") == "t::a"
+    assert _normalize_test_name("t::a (5 ms)") == "t::a"
+    assert _normalize_test_name("  t::a  ") == "t::a"
+    # No timing suffix -> unchanged.
+    assert _normalize_test_name("pkg::mod::test_x") == "pkg::mod::test_x"
+
+
+def test_build_spec_sets_java_env():
+    harness = SweRebenchHarness()
+    spec = harness.build_spec(_task())
+    assert spec.env["_JAVA_OPTIONS"] == "-Djava.net.preferIPv6Addresses=false"
+    assert spec.metadata["harness"] == "swe-rebench"
+    assert spec.image == "img:tag"
+
+
+# ---- grade paths (real dynamic-import of the fake parser) --------------------
+
+
+def test_grade_resolved(tmp_path):
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+    )
+    # Both required tests pass; timing suffix on one exercises normalization.
+    artifacts = EvalArtifacts(test_output="t::a [ 12 ms ] PASSED\nt::b PASSED\n", patch_applied=True)
+    report = harness.grade(task, artifacts)
+    assert report.resolved is True
+    assert report.error_kind is None
+    assert set(report.tests_status["passed"]) == {"t::a", "t::b"}
+
+
+def test_grade_unresolved_missing_pass_to_pass(tmp_path):
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+    )
+    artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b FAILED\n", patch_applied=True)
+    report = harness.grade(task, artifacts)
+    assert report.resolved is False
+    assert report.error_kind is None
+
+
+def test_grade_no_patch_applied_gate(tmp_path):
+    """``resolved`` is the test verdict ONLY and does not gate on patch_applied.
+    So even when the model patch failed to apply (``patch_applied=False``), a run
+    where every F2P/P2P test passes scores resolved=True."""
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+    )
+    artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b PASSED\n", patch_applied=False)
+    report = harness.grade(task, artifacts)
+    assert report.resolved is True
+    assert report.error_kind is None
+
+
+def test_grade_masks_missing_clone():
+    harness = SweRebenchHarness()
+    # No rebench_repo_dir in metadata -> the clone is not provisioned.
+    report = harness.grade(_task(), EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True))
+    assert report.error_kind == "eval_error"
+    assert report.resolved is False
+
+
+def test_grade_masks_unknown_parser(tmp_path):
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "does_not_exist"}},
+    )
+    report = harness.grade(task, EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True))
+    assert report.error_kind == "eval_error"
+
+
+def test_grade_masks_on_infra_error():
+    harness = SweRebenchHarness()
+    report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"}))
+    assert report.error_kind == "timeout"
+
+
+# ---- run_eval (FakeSandbox) -------------------------------------------------
+
+
+def test_run_eval_then_grade_resolved(tmp_path):
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={
+            "rebench_repo_dir": str(repo_dir),
+            "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+        },
+    )
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "test_rc": 0}}
+
+    async def _run():
+        spec = harness.build_spec(task)
+        env = await AsyncSweEnvironment.start(provider, spec)
+        try:
+            await harness.reset_repo(env, task)
+            await harness.materialize(env, task)
+            artifacts = await harness.run_eval(env, task)
+        finally:
+            await env.cleanup()
+        return artifacts
+
+    artifacts = asyncio.run(_run())
+    assert artifacts.patch_applied is True
+    report = harness.grade(task, artifacts)
+    assert report.resolved is True
+
+
+def test_run_eval_patch_not_applied_still_grades_on_tests(tmp_path):
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}})
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    # apply_rc=1 -> model patch fails to apply -> patch_applied False, but grading
+    # is on the tests only (no patch_applied gate), so a run where every F2P/P2P
+    # test passes is still resolved=True.
+    provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "apply_rc": 1}}
+
+    async def _run():
+        spec = harness.build_spec(task)
+        env = await AsyncSweEnvironment.start(provider, spec)
+        try:
+            await harness.run_eval(env, task)
+            return await harness.run_eval(env, task)
+        finally:
+            await env.cleanup()
+
+    artifacts = asyncio.run(_run())
+    assert artifacts.patch_applied is False
+    assert harness.grade(task, artifacts).resolved is True
+
+
+# ---- apply order ------------------------------------------------------------
+
+
+def test_run_eval_applies_model_patch_before_test_patch(tmp_path):
+    """The model patch (/root/patch.diff) is applied BEFORE the test patch
+    (/root/test_patch.diff)."""
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}})
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    _RecordingProvider.commands = []
+    _RecordingProvider.exec_calls = []
+    provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+    async def _run():
+        spec = harness.build_spec(task)
+        env = await AsyncSweEnvironment.start(provider, spec)
+        try:
+            await harness.run_eval(env, task)
+        finally:
+            await env.cleanup()
+
+    asyncio.run(_run())
+    applies = [c for c in _RecordingProvider.commands if "git apply" in c]
+    assert len(applies) == 2
+    assert "/root/patch.diff" in applies[0], applies
+    assert "/root/test_patch.diff" in applies[1], applies
+
+
+# ---- eval timeout threaded into the test exec -------------------------------
+
+
+def _rebench_test_exec_timeout(commands_and_timeouts):
+    """Return the timeout_s passed to the test exec (the one running the tests).
+
+    The test block is the only exec that is neither a ``git apply`` nor an
+    install command; in these tests the test command always contains ``pytest``.
+
+    Args:
+        commands_and_timeouts: An iterable of ``(command, timeout_s)`` pairs.
+
+    Returns:
+        The ``timeout_s`` value recorded for the test exec.
+
+    Raises:
+        AssertionError: If no test exec is found in the recorded calls.
+    """
+    for command, timeout_s in commands_and_timeouts:
+        if "git apply" not in command and ("pytest" in command or "test" in command):
+            return timeout_s
+    raise AssertionError(f"no test exec found in {commands_and_timeouts!r}")
+
+
+def test_run_eval_threads_tests_timeout_into_test_exec(tmp_path):
+    """The test exec receives timeout_s = task.metadata['tests_timeout'] when
+    present so a stuck run is bounded instead of hanging the verifier. Uses a
+    non-default value (600) so this distinguishes an explicit override from the
+    1800 default."""
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={
+            "rebench_repo_dir": str(repo_dir),
+            "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+            "tests_timeout": 600,
+        },
+    )
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    _RecordingProvider.commands = []
+    _RecordingProvider.exec_calls = []
+    provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+    async def _run():
+        spec = harness.build_spec(task)
+        env = await AsyncSweEnvironment.start(provider, spec)
+        try:
+            await harness.run_eval(env, task)
+        finally:
+            await env.cleanup()
+
+    asyncio.run(_run())
+    assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 600
+
+
+def test_run_eval_tests_timeout_absent_defaults_to_1800(tmp_path):
+    """The timeout (default 30*60) is applied to every swe-rebench run. Rows that
+    carry no tests_timeout (including SWE-bench-Verified) still get the 1800s
+    bound rather than an unbounded (None) run."""
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        metadata={
+            "rebench_repo_dir": str(repo_dir),
+            "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"},
+        },
+    )
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    _RecordingProvider.commands = []
+    _RecordingProvider.exec_calls = []
+    provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}}
+
+    async def _run():
+        spec = harness.build_spec(task)
+        env = await AsyncSweEnvironment.start(provider, spec)
+        try:
+            await harness.run_eval(env, task)
+        finally:
+            await env.cleanup()
+
+    asyncio.run(_run())
+    assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 1800
+
+
+# ---- grading parity / empty-required ----------------------------------------
+
+
+def test_grade_empty_required_resolves_true(tmp_path):
+    """``resolved`` is purely (fail_to_pass_set <= passed) and
+    (pass_to_pass_set <= passed). With no required tests, both empty sets are
+    subsets of any passed set, so resolved=True — there is no bool(required)
+    requirement."""
+    repo_dir = _write_fake_parsers(tmp_path)
+    harness = SweRebenchHarness()
+    task = _task(
+        fail_to_pass=[],
+        pass_to_pass=[],
+        metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}},
+    )
+    artifacts = EvalArtifacts(test_output="something PASSED\n", patch_applied=True)
+    report = harness.grade(task, artifacts)
+    assert report.resolved is True
+    assert report.error_kind is None
diff --git a/resources_servers/swe_bench/tests/test_swebench.py b/resources_servers/swe_bench/tests/test_swebench.py
new file mode 100644
index 0000000000..282049630a
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_swebench.py
@@ -0,0 +1,234 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the swe-bench / swe-bench-multilingual flat (host-graded) harness.
+
+The harness runs the instance's eval script in the sandbox and grades the produced log
+host-side (swebench's per-repo parser, falling back to the generic flat parser), so it runs on
+any exec-capable provider. These tests validate provisioning (``build_spec`` / ``materialize``),
+the flat ``run_eval`` + ``grade`` path, and family validation, against a scripted ``_FakeProvider``.
+"""
+
+from __future__ import annotations
+
+import asyncio
+
+import pytest
+
+from nemo_gym.sandbox import (
+    SandboxExecResult,
+    SandboxHandle,
+    SandboxStatus,
+    register_provider,
+)
+from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report
+from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness
+
+
+# Canned eval-script log with the SWE-bench sentinels + pytest-style passing lines.
+_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n"
+
+
+class _FakeProvider:
+    """Scripted provider: returns a canned eval log for the eval-script run; records uploads.
+
+    Args:
+        log_text: Text returned by the eval-script (``bash``) and ``cat`` commands.
+        exec_rc: Return code for the eval-script command.
+    """
+
+    name = "fake-swebench"
+
+    def __init__(self, *, log_text="", exec_rc=0, **_):
+        self._log_text = log_text
+        self._exec_rc = exec_rc
+        self.uploaded: dict[str, str] = {}
+        self.commands: list[str] = []
+
+    async def create(self, spec):
+        return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir})
+
+    async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None):
+        self.commands.append(command)
+        rc = 0 if command.startswith("cat ") else self._exec_rc
+        return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc)
+
+    async def upload_file(self, handle, local_path, remote_path):
+        try:
+            with open(local_path, encoding="utf-8") as fh:
+                self.uploaded[remote_path] = fh.read()
+        except OSError:
+            self.uploaded[remote_path] = ""
+        return None
+
+    async def download_file(self, *a, **k):
+        return None
+
+    async def status(self, handle):
+        return SandboxStatus.RUNNING
+
+    async def close(self, handle):
+        return None
+
+    async def aclose(self):
+        return None
+
+
+register_provider("fake-swebench", _FakeProvider, override=True)
+
+
+def _task(**overrides) -> SweTask:
+    """Build a swe-bench ``SweTask`` with sensible defaults."""
+    base = dict(
+        instance_id="repo__inst-1",
+        image="img:tag",
+        base_commit="abc123",
+        repo_workdir="/testbed",
+        model_patch="diff --git a/x b/x\n",
+        fail_to_pass=["t::a"],
+        pass_to_pass=["t::b"],
+        benchmark="swe-bench",
+        split="test",
+    )
+    base.update(overrides)
+    return SweTask(**base)
+
+
+def test_grade_strategy_is_flat():
+    assert SweBenchHarness("swe-bench").grade_strategy == "flat-host-grade"
+    assert SweBenchHarness("swe-bench-multilingual").grade_strategy == "flat-host-grade"
+
+
+def test_unknown_family_rejected():
+    with pytest.raises(ValueError):
+        SweBenchHarness("not-a-family")
+
+
+def test_build_spec_image_workdir_metadata():
+    spec = SweBenchHarness("swe-bench").build_spec(_task())
+    assert spec.image == "img:tag"
+    assert spec.workdir == "/testbed"
+    assert spec.metadata["instance_id"] == "repo__inst-1"
+    assert spec.metadata["harness"] == "swe-bench"
+
+
+def test_build_spec_preserves_task_provider_options():
+    spec = SweBenchHarness("swe-bench").build_spec(_task(metadata={"provider_options": {"network": "host"}}))
+    assert spec.provider_options.get("network") == "host"
+
+
+def test_supports_provider_any_exec_capable():
+    harness = SweBenchHarness("swe-bench")
+    assert harness.supports_provider("docker") is True
+    assert harness.supports_provider("apptainer") is True
+    assert harness.supports_provider("opensandbox") is True
+
+
+def test_with_flat_eval_is_self():
+    harness = SweBenchHarness("swe-bench")
+    assert harness.with_flat_eval() is harness
+
+
+def test_materialize_writes_patch_diff():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = SweBenchHarness("swe-bench")
+        task = _task()
+        env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+        await harness.materialize(env, task)
+        return env.sandbox._provider
+
+    provider = asyncio.run(run())
+    assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n"
+
+
+def test_materialize_empty_patch_writes_nothing():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    async def run():
+        harness = SweBenchHarness("swe-bench")
+        task = _task(model_patch="")
+        env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+        await harness.materialize(env, task)
+        return env.sandbox._provider
+
+    provider = asyncio.run(run())
+    assert "/root/patch.diff" not in provider.uploaded
+
+
+def test_run_eval_then_grade_flat_resolved():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    # eval_script preset so flat_run_eval executes it; no instance_dict -> grade falls back to
+    # the generic flat parser over the canned passing log.
+    async def run():
+        harness = SweBenchHarness("swe-bench")
+        task = _task(metadata={"eval_script": "echo run"})
+        env = await AsyncSweEnvironment.start({"fake-swebench": {"log_text": _PASSING_LOG}}, harness.build_spec(task))
+        artifacts = await harness.run_eval(env, task)
+        return harness.grade(task, artifacts)
+
+    report = asyncio.run(run())
+    assert report.resolved is True
+    assert reward_from_report(report) == 1.0
+
+
+def test_run_eval_missing_eval_script_is_unmasked_unresolved():
+    from resources_servers.swe_bench.sandbox import AsyncSweEnvironment
+
+    # No instance_dict + no preset eval_script -> _flat_eval_script returns "" -> the run tags an
+    # eval_error, but grading no longer masks it: per main an unbuildable/empty spec grades as a
+    # legitimate unmasked unresolved (reward 0), not an eval_error mask.
+    async def run():
+        harness = SweBenchHarness("swe-bench")
+        task = _task()
+        env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task))
+        artifacts = await harness.run_eval(env, task)
+        return harness.grade(task, artifacts)
+
+    report = asyncio.run(run())
+    assert report.error_kind is None
+    assert report.resolved is False
+    assert reward_from_report(report) == 0.0
+
+
+def test_grade_masks_on_infra_error():
+    report = SweBenchHarness("swe-bench").grade(_task(), EvalArtifacts(raw={"error_type": "timeout"}))
+    assert report.error_kind == "timeout"
+    assert reward_from_report(report) == 0.0
+
+
+def test_flat_eval_script_empty_without_instance_dict():
+    assert SweBenchHarness("swe-bench")._flat_eval_script(_task()) == ""
+
+
+def test_grade_fails_loud_when_swebench_unavailable(monkeypatch):
+    """A SWE-bench instance whose ``swebench`` install is missing fails loud, not silent-degrade.
+
+    Degrading to the generic pytest-only parser would mis-score non-pytest repos (e.g. django) as
+    unresolved, silently skewing the resolve rate. Instead grading raises ``GraderDependencyError``
+    so the misconfiguration surfaces.
+    """
+    import sys
+
+    from resources_servers.swe_bench.harness import GraderDependencyError
+
+    # Simulate a missing / broken swebench install for the import inside _swebench_flat_grade.
+    monkeypatch.setitem(sys.modules, "swebench.harness.constants", None)
+    harness = SweBenchHarness("swe-bench")
+    task = _task(metadata={"instance_dict": {"instance_id": "repo__inst-1", "repo": "x/y"}})
+    artifacts = EvalArtifacts(test_output=_PASSING_LOG, return_code=0, raw={})
+    with pytest.raises(GraderDependencyError):
+        harness.grade(task, artifacts)
diff --git a/resources_servers/swe_bench/tests/test_task.py b/resources_servers/swe_bench/tests/test_task.py
new file mode 100644
index 0000000000..4888746d86
--- /dev/null
+++ b/resources_servers/swe_bench/tests/test_task.py
@@ -0,0 +1,81 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import json
+
+import pytest
+
+from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming
+from resources_servers.swe_bench.task import (
+    ENVIRONMENT_NAME,
+    SweTask,
+    TaskSubmission,
+    build_task,
+    harness_family_key,
+    parse_submission,
+    parse_task_from_request,
+)
+
+
+def _sample_row() -> dict:
+    inst = {
+        "instance_id": "astropy__astropy-12907",
+        "base_commit": "abc123",
+        "test_patch": "",
+        "FAIL_TO_PASS": '["tests/test_x.py::a"]',
+        "PASS_TO_PASS": '["tests/test_x.py::b"]',
+    }
+    return {
+        "instance_id": "astropy__astropy-12907",
+        "dataset_name": "princeton-nlp/SWE-bench_Verified",
+        "split": "test",
+        "problem_statement": "Fix the bug.",
+        "instance_dict": json.dumps(inst),
+        "responses_create_params": NeMoGymResponseCreateParamsNonStreaming(
+            input=[{"role": "user", "content": "Fix the bug."}],
+        ),
+    }
+
+
+def test_harness_family_key_from_dataset_name() -> None:
+    assert harness_family_key("princeton-nlp/SWE-bench_Verified") == "swe-bench"
+    assert harness_family_key("something/R2E-Gym/foo") == "r2e-gym"
+
+
+def test_build_task_sets_benchmark_fields() -> None:
+    task = build_task(_sample_row(), container_formatter="swebench/sweb.eval.x86_64.{instance_id}")
+    assert task.task_id == "astropy__astropy-12907"
+    assert task.harness_family == "swe-bench"
+    assert task.dataset_name == "princeton-nlp/SWE-bench_Verified"
+    assert task.problem_statement == "Fix the bug."
+    assert task.metadata["instance_dict"]["base_commit"] == "abc123"
+
+
+def test_public_view_excludes_privileged_metadata() -> None:
+    task = build_task(_sample_row(), container_formatter="x.{instance_id}")
+    public = task.public_view()
+    assert public.task_id == task.task_id
+    assert public.environment == ENVIRONMENT_NAME
+    assert public.harness_family == "swe-bench"
+    assert not hasattr(public, "instance_dict")
+
+
+def test_parse_task_from_request_requires_instance_id() -> None:
+    class Body:
+        responses_create_params = None
+        verifier_metadata = {}
+
+    with pytest.raises(ValueError, match="instance_id"):
+        parse_task_from_request(Body(), container_formatter="x.{instance_id}")
+
+
+def test_with_submission() -> None:
+    task = SweTask(instance_id="x", benchmark="swe-bench")
+    updated = task.with_submission(TaskSubmission(model_patch="diff"))
+    assert updated.model_patch == "diff"
+
+
+def test_parse_submission_accepts_git_patch_alias() -> None:
+    assert parse_submission({"git_patch": "p"}).model_patch == "p"
diff --git a/resources_servers/swe_bench/verify_task.py b/resources_servers/swe_bench/verify_task.py
new file mode 100644
index 0000000000..3c08c5a3cf
--- /dev/null
+++ b/resources_servers/swe_bench/verify_task.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Verification orchestrator for the SWE environment.
+
+Grades an agent patch via the ``swe_bench`` resources server ``/verify`` endpoint.
+Runs a fresh-only sequence via ``acquire_sandbox`` (always-teardown), bounded by a
+per-call eval timeout.
+
+Every eval spec is stamped with a ``ttl_s`` so TTL-honoring backends (such as
+opensandbox) self-expire orphaned sandboxes.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import dataclasses
+from collections.abc import Mapping
+from typing import Any
+
+# Importing this package registers the swe_bench harnesses; the docker/apptainer
+# providers are built into nemo_gym.sandbox and resolve lazily (no import needed).
+import resources_servers.swe_bench.harnesses  # noqa: F401
+from nemo_gym.sandbox import SandboxProvider
+from resources_servers.swe_bench.harness import (
+    GraderDependencyError,
+    SweEvalReport,
+    SweTask,
+    get_harness,
+    reward_from_report,
+)
+from resources_servers.swe_bench.sandbox import acquire_sandbox
+
+
+#: Slack added to the eval timeout when stamping a sandbox TTL (covers spin-up +
+#: teardown so a TTL-honoring backend does not expire a still-running eval).
+_TTL_SLACK_S = 600.0
+
+
+class ProviderCapabilityError(RuntimeError):
+    """Raised when a task's harness does not support the configured provider."""
+
+
+def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str:
+    """Return the provider's name.
+
+    Args:
+        provider: Either a single-key provider mapping or a ``SandboxProvider``
+            instance.
+
+    Returns:
+        str: The provider name, or ``"?"`` if it cannot be determined.
+    """
+    if isinstance(provider, Mapping):
+        return next(iter(provider), "?")
+    return getattr(provider, "name", "?")
+
+
+async def verify_task(
+    provider: Mapping[str, Any] | SandboxProvider,
+    task: SweTask,
+    *,
+    run_golden: bool = False,
+    eval_timeout_s: float | None = None,
+) -> SweEvalReport:
+    """Grade a task's patch in a fresh sandbox and return a report.
+
+    Selects the harness for the task's benchmark, optionally substitutes the
+    golden patch, then resets the repo, materializes the patch, runs the eval,
+    and grades the artifacts. An empty patch short-circuits without spinning up
+    a sandbox. A genuine wall-clock eval timeout is returned as a report carrying
+    ``error_kind="eval_timeout"``; other non-timeout eval-stage failures are
+    returned unmasked (``resolved=False``, ``error_kind=None``) to mirror main,
+    rather than raised.
+
+    Args:
+        provider: Single-key provider mapping or ``SandboxProvider`` selecting
+            the sandbox backend.
+        task: The task whose patch is graded.
+        run_golden: When True, grade the task's golden patch instead of the
+            model patch.
+        eval_timeout_s: Optional override for the per-call eval timeout in
+            seconds; falls back to the task metadata or a default.
+
+    Returns:
+        SweEvalReport: The grading outcome, with ``error_kind="eval_timeout"`` set
+            only on a genuine wall-clock eval timeout; non-timeout eval-stage
+            failures are reported unmasked (``resolved=False``, ``error_kind=None``).
+
+    Raises:
+        ProviderCapabilityError: If the task's harness does not support the provider.
+        GraderDependencyError: If a required grading dependency is unavailable for an
+            instance the harness must grade exactly (propagated, not swallowed).
+    """
+    harness = get_harness(task.benchmark)
+    if task.metadata.get("flat_eval"):
+        # Grade host-side (flat) so nested families (swe-bench / r2e-gym) can be graded on
+        # exec-only providers like docker; a no-op for already-flat families.
+        harness = harness.with_flat_eval()
+
+    if run_golden:
+        task = dataclasses.replace(task, model_patch=task.metadata.get("golden_patch", ""))
+
+    # Empty/falsy-patch fast path: skip eval spin-up entirely.
+    if not (task.model_patch or "").strip():
+        return SweEvalReport(instance_id=task.instance_id, patch_exists=False, resolved=False)
+
+    provider_name = _provider_name(provider)
+    if not harness.supports_provider(provider_name):
+        raise ProviderCapabilityError(
+            f"Harness {harness.name!r} does not support provider {provider_name!r} "
+            f"(grade_strategy={harness.grade_strategy})"
+        )
+
+    spec = harness.build_spec(task)
+    timeout = eval_timeout_s if eval_timeout_s is not None else float(task.metadata.get("eval_timeout_s", 1800))
+    # Stamp a TTL so backends that honor it (opensandbox) self-expire an eval sandbox
+    # orphaned by a hard crash. docker ignores ttl_s; its finally-teardown covers it.
+    if spec.ttl_s is None:
+        spec = dataclasses.replace(spec, ttl_s=timeout + _TTL_SLACK_S)
+
+    try:
+        async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env:
+
+            async def _sequence() -> SweEvalReport:
+                await harness.reset_repo(env, task)
+                await harness.materialize(env, task)
+                artifacts = await harness.run_eval(env, task)
+                return harness.grade(task, artifacts)
+
+            return await asyncio.wait_for(_sequence(), timeout=timeout)
+    except GraderDependencyError:
+        # A required grader dependency is missing (e.g. swebench for a SWE-bench instance).
+        # Propagate rather than degrading to an unmasked reward-0 so the misconfiguration is
+        # loud (a crash in the standalone path; every sample masked in the anyswe path) instead
+        # of silently skewing the resolve rate.
+        raise
+    except (asyncio.TimeoutError, TimeoutError):
+        # Genuine wall-clock eval timeout: mask via error_kind. This mirrors main's
+        # app.py, which sets eval_timed_out (-> mask_sample) only when the final eval
+        # elapsed time reaches the configured tests timeout.
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            patch_exists=bool(task.model_patch),
+            error_kind="eval_timeout",
+            tests_status={"timeout_s": timeout},
+        )
+    except Exception as exc:  # non-timeout eval-stage failure -> unmasked reward 0
+        # A non-timeout eval-stage crash is NOT masked: main's app.py catches any eval
+        # exception, returns no report file (resolved=False) and leaves eval_timed_out
+        # False, so the sample stays in the gradient at reward 0. Returning
+        # error_kind=None here keeps mask_sample aligned with main rather than masking
+        # the infra crash (which main does not do).
+        return SweEvalReport(
+            instance_id=task.instance_id,
+            patch_exists=bool(task.model_patch),
+            resolved=False,
+            error_kind=None,
+            tests_status={"exception": repr(exc)},
+        )
+
+
+def report_to_reward(report: SweEvalReport) -> float:
+    """Convert an eval report into a scalar reward.
+
+    Args:
+        report: The grading outcome to score.
+
+    Returns:
+        float: The reward derived from the report.
+    """
+    return reward_from_report(report)
diff --git a/responses_api_agents/claude_code_agent/app.py b/responses_api_agents/claude_code_agent/app.py
index 6970d92ce1..9a399a5094 100644
--- a/responses_api_agents/claude_code_agent/app.py
+++ b/responses_api_agents/claude_code_agent/app.py
@@ -15,9 +15,11 @@
 
 import asyncio
 import copy
+import dataclasses
 import json
 import logging
 import os
+import shlex
 import shutil
 import subprocess
 import tempfile
@@ -50,6 +52,7 @@
     NeMoGymResponseOutputTokensDetails,
     NeMoGymResponseUsage,
 )
+from nemo_gym.sandbox import AsyncSandbox, SandboxResources, SandboxSpec
 from nemo_gym.server_utils import get_response_json, raise_for_status
 from nemo_gym.skills import stage_skills
 from responses_api_agents.claude_code_agent.setup_claude_code import ensure_claude_code
@@ -237,10 +240,13 @@ class ClaudeCodeAgentConfig(BaseResponsesAPIAgentConfig):
     bare: bool = True
     mcp_config: Optional[str] = None
     settings: Optional[str] = None
+    sandbox_provider: Optional[dict[str, Any]] = None
+    in_box_timeout_s: int = 1800
 
 
 class ClaudeCodeAgentRunRequest(BaseRunRequest):
     model_config = ConfigDict(extra="allow")
+    verifier_metadata: Optional[dict[str, Any]] = None
 
 
 class ClaudeCodeAgentVerifyResponse(BaseVerifyResponse):
@@ -490,6 +496,131 @@ def _write_rollout_mcp_config(self, seed_response_json: dict[str, Any], output_d
         config_path.write_text(json.dumps(config, indent=2, sort_keys=True))
         return str(config_path)
 
+    @staticmethod
+    def _sandbox_spec_from_descriptor(spec_dict: dict[str, Any]) -> SandboxSpec:
+        payload = dict(spec_dict)
+        resources = payload.pop("resources", None)
+        if resources is None:
+            resources = SandboxResources()
+        elif not isinstance(resources, SandboxResources):
+            resources = SandboxResources.from_mapping(resources)
+        return SandboxSpec(**payload, resources=resources)
+
+    def _anthropic_env(self) -> tuple[dict[str, str], str]:
+        base_url = self._resolve_base_url()
+        model = self.config.model if base_url else self.config.model.split("/")[-1]
+        api_key = self.config.anthropic_api_key
+        env = {
+            "ANTHROPIC_API_KEY": api_key,  # pragma: allowlist secret
+            "ANTHROPIC_MODEL": model,
+            "ANTHROPIC_DEFAULT_HAIKU_MODEL": model,
+            "ANTHROPIC_DEFAULT_SONNET_MODEL": model,
+            "ANTHROPIC_DEFAULT_OPUS_MODEL": model,
+            "CLAUDE_CODE_SUBAGENT_MODEL": model,
+            "IS_SANDBOX": "1",
+        }
+        if base_url:
+            env["ANTHROPIC_BASE_URL"] = base_url
+            env["ANTHROPIC_AUTH_TOKEN"] = api_key or "local"
+        return env, model
+
+    async def _run_in_box(
+        self,
+        body: ClaudeCodeAgentRunRequest,
+        seed_resp_json: dict[str, Any],
+        *,
+        skills_path: Optional[str] = None,
+    ) -> tuple[dict[str, Any], str]:
+        spec_dict = (seed_resp_json.get("sandbox") or {}).get("spec") or {}
+        workdir = spec_dict.get("workdir") or "/testbed"
+        spec = self._sandbox_spec_from_descriptor(spec_dict)
+        egress_env = (seed_resp_json.get("egress") or {}).get("env") or {}
+        anthropic_env, model = self._anthropic_env()
+        spec = dataclasses.replace(spec, env={**spec.env, **egress_env, **anthropic_env})
+
+        provider = self.config.sandbox_provider or {"docker": {}}
+        sandbox = AsyncSandbox(provider, spec)
+        await sandbox.start()
+        claude_config_dir: Path | None = None
+        try:
+            claude_config_dir = self._setup_config_dir(skills_path=skills_path)
+            remote_cfg = "/tmp/nemo_gym_claude"
+            await sandbox.exec(f"mkdir -p {shlex.quote(remote_cfg)}", cwd=workdir, timeout_s=60)
+            await sandbox.upload(str(claude_config_dir / "settings.json"), f"{remote_cfg}/settings.json")
+
+            params = body.responses_create_params.model_copy(deep=True)
+            if isinstance(params.input, str):
+                params.input = [NeMoGymEasyInputMessage(role="user", content=params.input)]
+            user_message, input_system = _extract_instruction(params.input)
+            system_parts = [p for p in [self.config.system_prompt, input_system] if p]
+            system_prompt = "\n\n".join(system_parts) if system_parts else None
+
+            cmd_parts = self._build_command(
+                model,
+                user_message,
+                system_prompt=system_prompt,
+                skills_active=bool(skills_path),
+            )
+            env_prefix = " ".join(f"{shlex.quote(k)}={shlex.quote(v)}" for k, v in spec.env.items())
+            remote_cmd = f"{env_prefix} CLAUDE_CONFIG_DIR={shlex.quote(remote_cfg)} {shlex.join(cmd_parts)}"
+            result = await sandbox.exec(remote_cmd, cwd=workdir, timeout_s=self.config.in_box_timeout_s)
+            stdout = result.stdout or ""
+            if result.error_type == "timeout":
+                LOG.warning("claude-code in-box timed out after %ss", self.config.in_box_timeout_s)
+            elif result.return_code not in (0, None) and stdout.strip() == "":
+                LOG.warning(
+                    "claude-code in-box exited %s: %s",
+                    result.return_code,
+                    (result.stderr or "")[:500],
+                )
+
+            output_items, usage = parse_stream_json(stdout)
+            if not any(
+                getattr(item, "type", None) == "message" and getattr(item, "role", None) == "assistant"
+                for item in output_items
+            ):
+                output_items.append(
+                    NeMoGymResponseOutputMessage(
+                        id=f"msg_{uuid4().hex}",
+                        content=[NeMoGymResponseOutputText(text="", annotations=[])],
+                        role="assistant",
+                        status="completed",
+                        type="message",
+                    )
+                )
+
+            input_tokens = usage.get("input_tokens", 0)
+            output_tokens = usage.get("output_tokens", 0)
+            agent_resp = NeMoGymResponse(
+                id=f"resp_{uuid4().hex}",
+                created_at=int(time()),
+                model=model,
+                object="response",
+                output=output_items,
+                tool_choice=params.tool_choice,
+                tools=params.tools,
+                parallel_tool_calls=params.parallel_tool_calls,
+                usage=NeMoGymResponseUsage(
+                    input_tokens=input_tokens,
+                    input_tokens_details=NeMoGymResponseInputTokensDetails(cached_tokens=0),
+                    output_tokens=output_tokens,
+                    output_tokens_details=NeMoGymResponseOutputTokensDetails(reasoning_tokens=0),
+                    total_tokens=input_tokens + output_tokens,
+                ),
+            )
+
+            patch_result = await sandbox.exec(
+                f"cd {shlex.quote(workdir)} && git add -A && git diff --cached",
+                cwd=workdir,
+                timeout_s=120,
+            )
+            patch = patch_result.stdout or ""
+            return agent_resp.model_dump(mode="json"), patch
+        finally:
+            if claude_config_dir is not None:
+                shutil.rmtree(claude_config_dir, ignore_errors=True)
+            await sandbox.stop()
+
     async def _create_response(
         self,
         body: NeMoGymResponseCreateParamsNonStreaming,
@@ -569,23 +700,32 @@ async def run(self, request: Request, body: ClaudeCodeAgentRunRequest) -> Claude
             cookies = seed_resp.cookies
             seed_resp_json = await get_response_json(seed_resp)
 
-            # The run-level skills_ref (stamped by rollout collection) rides on the request body
-            # (extra="allow"). Pass its path straight into _create_response so the CLI invocation
-            # can stage the skills into its per-request CLAUDE_CONFIG_DIR. run() calls _create_response
-            # in-process, so no metadata side-channel is needed (unlike the schema-forbidden HTTP path).
             skills_path = ((body.model_extra or {}).get(SKILLS_REF_KEY_NAME) or {}).get("path")
-
-            with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir:
-                mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir))
-                agent_resp = await self._create_response(
-                    body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path
-                )
-                agent_resp_json = agent_resp.model_dump(mode="json")
+            topology = (seed_resp_json.get("placement") or {}).get("topology") or "none"
+
+            if topology == "agent_in_env":
+                agent_resp_json, model_patch = await self._run_in_box(body, seed_resp_json, skills_path=skills_path)
+                verifier_metadata = {
+                    **(body.verifier_metadata or {}),
+                    **(seed_resp_json.get("verifier_metadata") or {}),
+                    "model_patch": model_patch,
+                }
+            else:
+                with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir:
+                    mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir))
+                    agent_resp = await self._create_response(
+                        body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path
+                    )
+                    agent_resp_json = agent_resp.model_dump(mode="json")
+                verifier_metadata = {
+                    **(body.verifier_metadata or {}),
+                    **(seed_resp_json.get("verifier_metadata") or {}),
+                }
 
             verify_resp = await self.server_client.post(
                 server_name=self.config.resources_server.name,
                 url_path="/verify",
-                json=body.model_dump() | {"response": agent_resp_json},
+                json=body.model_dump() | {"response": agent_resp_json, "verifier_metadata": verifier_metadata},
                 cookies=cookies,
             )
             await raise_for_status(verify_resp)
diff --git a/tests/unit_tests/test_docker_provider.py b/tests/unit_tests/test_docker_provider.py
new file mode 100644
index 0000000000..46062a60d6
--- /dev/null
+++ b/tests/unit_tests/test_docker_provider.py
@@ -0,0 +1,213 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the local Docker ``SandboxProvider`` (CLI mocked, no docker required)."""
+
+import asyncio
+from pathlib import Path
+from typing import Any, Callable
+
+import pytest
+
+from nemo_gym.sandbox.providers.base import (
+    SandboxCreateError,
+    SandboxResources,
+    SandboxSpec,
+    SandboxStatus,
+)
+from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider
+
+
+class RunRecorder:
+    """Stand-in for ``DockerSandboxProvider._run`` that records argv and returns canned output.
+
+    The responder maps the captured ``docker`` args to a ``(rc, stdout, stderr)`` tuple, and may
+    raise (e.g. ``TimeoutError``) to simulate a CLI failure.
+    """
+
+    def __init__(self, responder: Callable[[list[str]], tuple[int, str, str]]) -> None:
+        self.calls: list[dict[str, Any]] = []
+        self._responder = responder
+
+    async def __call__(self, *args: str, timeout_s: float | None = None) -> tuple[int, str, str]:
+        self.calls.append({"args": list(args), "timeout_s": timeout_s})
+        return self._responder(list(args))
+
+
+def _make_provider(
+    monkeypatch: pytest.MonkeyPatch, responder: Callable[[list[str]], tuple[int, str, str]], **kwargs: Any
+) -> tuple[DockerSandboxProvider, RunRecorder]:
+    provider = DockerSandboxProvider(**kwargs)
+    rec = RunRecorder(responder)
+    monkeypatch.setattr(provider, "_run", rec)
+    return provider, rec
+
+
+def _ran(rec: RunRecorder, *prefix: str) -> bool:
+    """True if any recorded call's args start with ``prefix`` (e.g. ``"rm", "-f"``)."""
+    return any(call["args"][: len(prefix)] == list(prefix) for call in rec.calls)
+
+
+# --------------------------------------------------------------------------- #
+# Construction
+# --------------------------------------------------------------------------- #
+def test_concurrency_must_be_positive() -> None:
+    """A non-positive concurrency is rejected up front."""
+    with pytest.raises(ValueError):
+        DockerSandboxProvider(concurrency=0)
+
+
+def test_concurrency_bounds_the_semaphore() -> None:
+    """The provider's shared semaphore is sized to the configured concurrency."""
+    assert DockerSandboxProvider(concurrency=4)._semaphore._value == 4
+
+
+# --------------------------------------------------------------------------- #
+# create()
+# --------------------------------------------------------------------------- #
+def test_create_returns_handle_with_last_line_id(monkeypatch: pytest.MonkeyPatch) -> None:
+    """create() uses the LAST stdout line as the container id and pre-assigns a unique name."""
+    provider, rec = _make_provider(
+        monkeypatch, lambda args: (0, "WARNING: noise\ncontainer-abc\n", ""), network="host"
+    )
+    handle = asyncio.run(provider.create(SandboxSpec(image="img:tag", workdir="/testbed", env={"A": "1"})))
+    assert handle.sandbox_id == "container-abc"
+    run_args = rec.calls[0]["args"]
+    assert run_args[:3] == ["run", "-d", "--init"]
+    assert "--name" in run_args and run_args[run_args.index("--name") + 1].startswith("nemo-gym-")
+    assert ["--network", "host"] == run_args[run_args.index("--network") : run_args.index("--network") + 2]
+    assert "img:tag" in run_args
+
+
+def test_create_requires_image() -> None:
+    """A spec without an image is rejected before any docker call."""
+    with pytest.raises(SandboxCreateError):
+        asyncio.run(DockerSandboxProvider().create(SandboxSpec(image=None)))
+
+
+def test_create_empty_stdout_guard_and_reap(monkeypatch: pytest.MonkeyPatch) -> None:
+    """rc 0 with empty stdout raises (no IndexError) and reaps the pre-assigned name."""
+    provider, rec = _make_provider(monkeypatch, lambda args: (0, "   \n", "") if args[0] == "run" else (0, "", ""))
+    with pytest.raises(SandboxCreateError, match="did not return a container id"):
+        asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+    assert _ran(rec, "rm", "-f")
+
+
+def test_create_nonzero_rc_reaps(monkeypatch: pytest.MonkeyPatch) -> None:
+    """A non-zero ``docker run`` reaps the orphan and raises with the stderr."""
+    provider, rec = _make_provider(monkeypatch, lambda args: (125, "", "boom") if args[0] == "run" else (0, "", ""))
+    with pytest.raises(SandboxCreateError, match="boom"):
+        asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+    assert _ran(rec, "rm", "-f")
+
+
+def test_create_timeout_reaps(monkeypatch: pytest.MonkeyPatch) -> None:
+    """A timed-out ``docker run`` reaps the (possibly daemon-started) orphan by name."""
+
+    def responder(args: list[str]) -> tuple[int, str, str]:
+        if args[0] == "run":
+            raise asyncio.TimeoutError
+        return (0, "", "")
+
+    provider, rec = _make_provider(monkeypatch, responder)
+    with pytest.raises(SandboxCreateError, match="timed out"):
+        asyncio.run(provider.create(SandboxSpec(image="img:tag")))
+    assert _ran(rec, "rm", "-f")
+
+
+def test_create_applies_resource_limits(monkeypatch: pytest.MonkeyPatch) -> None:
+    """Resource requests become ``--memory``/``--cpus``/``--gpus`` run args."""
+    provider, rec = _make_provider(monkeypatch, lambda args: (0, "cid\n", ""))
+    spec = SandboxSpec(image="img:tag", resources=SandboxResources(cpu=2, memory_mib=512, gpu=1))
+    asyncio.run(provider.create(spec))
+    run_args = rec.calls[0]["args"]
+    assert "--memory=512m" in run_args
+    assert "--cpus=2" in run_args
+    assert "--gpus=all" in run_args
+
+
+# --------------------------------------------------------------------------- #
+# exec()
+# --------------------------------------------------------------------------- #
+def test_exec_classifies_docker_level_failure(monkeypatch: pytest.MonkeyPatch) -> None:
+    """rc 125/126/127 with no stdout is a docker-level (``sandbox``) failure."""
+    provider, _ = _make_provider(monkeypatch, lambda args: (125, "", "no such container"))
+    res = asyncio.run(provider.exec(_handle(), "echo hi"))
+    assert res.return_code == 125
+    assert res.error_type == "sandbox"
+
+
+def test_exec_success_has_no_error_type(monkeypatch: pytest.MonkeyPatch) -> None:
+    """A successful exec carries stdout and no error type."""
+    provider, _ = _make_provider(monkeypatch, lambda args: (0, "ok", ""))
+    res = asyncio.run(provider.exec(_handle(), "true"))
+    assert res.return_code == 0 and res.stdout == "ok" and res.error_type is None
+
+
+def test_exec_timeout_returns_124(monkeypatch: pytest.MonkeyPatch) -> None:
+    """A timed-out exec returns rc 124 + ``timeout`` error type rather than raising."""
+
+    def responder(args: list[str]) -> tuple[int, str, str]:
+        raise asyncio.TimeoutError
+
+    provider, _ = _make_provider(monkeypatch, responder)
+    res = asyncio.run(provider.exec(_handle(), "sleep 1", timeout_s=0.01))
+    assert res.return_code == 124 and res.error_type == "timeout"
+
+
+# --------------------------------------------------------------------------- #
+# status / close / file transfer
+# --------------------------------------------------------------------------- #
+def test_status_running_and_stopped(monkeypatch: pytest.MonkeyPatch) -> None:
+    """status() maps docker inspect output to RUNNING/STOPPED/UNKNOWN."""
+    provider, _ = _make_provider(monkeypatch, lambda args: (0, "true\n", ""))
+    assert asyncio.run(provider.status(_handle())) is SandboxStatus.RUNNING
+    provider2, _ = _make_provider(monkeypatch, lambda args: (0, "false\n", ""))
+    assert asyncio.run(provider2.status(_handle())) is SandboxStatus.STOPPED
+    provider3, _ = _make_provider(monkeypatch, lambda args: (1, "", "gone"))
+    assert asyncio.run(provider3.status(_handle())) is SandboxStatus.UNKNOWN
+
+
+def test_close_force_removes(monkeypatch: pytest.MonkeyPatch) -> None:
+    """close() force-removes the container by id."""
+    provider, rec = _make_provider(monkeypatch, lambda args: (0, "", ""))
+    asyncio.run(provider.close(_handle()))
+    assert _ran(rec, "rm", "-f", "cid")
+
+
+def test_upload_failure_raises(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None:
+    """A failed ``docker cp`` upload raises a clear RuntimeError."""
+    provider, _ = _make_provider(monkeypatch, lambda args: (0, "", "") if args[0] == "exec" else (1, "", "nope"))
+    src = tmp_path / "f.txt"
+    src.write_text("x")
+    with pytest.raises(RuntimeError, match="upload failed"):
+        asyncio.run(provider.upload_file(_handle(), src, "/dst/f.txt"))
+
+
+def test_reap_orphan_swallows_errors(monkeypatch: pytest.MonkeyPatch) -> None:
+    """_reap_orphan never raises, even when the ``docker rm`` itself fails/raises."""
+
+    def responder(args: list[str]) -> tuple[int, str, str]:
+        raise RuntimeError("docker daemon down")
+
+    provider, _ = _make_provider(monkeypatch, responder)
+    asyncio.run(provider._reap_orphan("nemo-gym-x"))  # must not raise
+
+
+def _handle():
+    """A minimal docker SandboxHandle for exec/status/close tests."""
+    from nemo_gym.sandbox.providers.base import SandboxHandle
+
+    return SandboxHandle(sandbox_id="cid", provider_name="docker", raw={"workdir": "/testbed"})