diff --git a/fern/versions/latest/pages/about/concepts/key-terminology.mdx b/fern/versions/latest/pages/about/concepts/key-terminology.mdx index 0a90334638..c69943494d 100644 --- a/fern/versions/latest/pages/about/concepts/key-terminology.mdx +++ b/fern/versions/latest/pages/about/concepts/key-terminology.mdx @@ -63,7 +63,11 @@ The FastAPI service (under `resources_servers/`) that holds per-task state, expo **Agent Server (Responses API Agent)** -The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the harness that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own. +The FastAPI service (under `responses_api_agents/`) that drives the model through a task — the **agent harness** that runs the multi-step / tool-calling loop against a resources server. Gym ships several built-in harnesses (e.g. `simple_agent`, `aviary_agent`, and others under `responses_api_agents/`); pick whichever fits your control flow, or bring your own. + +**Harness (disambiguation)** + +“Harness” is overloaded in the agent-eval community. In Gym docs it usually means **agent harness** (orchestration in an agent server). In [SWE-bench](https://www.swebench.com/SWE-bench/reference/harness/), it often means the **grading pipeline** (`swebench.harness`). See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology). **Model Server (Responses API Model)** diff --git a/fern/versions/latest/pages/evaluation/index.mdx b/fern/versions/latest/pages/evaluation/index.mdx index 850ddd0cff..412c359025 100644 --- a/fern/versions/latest/pages/evaluation/index.mdx +++ b/fern/versions/latest/pages/evaluation/index.mdx @@ -40,6 +40,10 @@ Harness changes often affect metrics, and some models are better tuned to use sp NeMo Gym treats an agent as model plus harness. The model server stays stateless; the agent server owns the loop that calls the model, routes tool calls, manages conversation state, and asks the resources server to verify the final attempt. + +Outside Gym, “harness” often means the SWE-bench **grading pipeline** (`swebench.harness`) rather than agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology). + + diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx new file mode 100644 index 0000000000..c99929d2f6 --- /dev/null +++ b/fern/versions/latest/pages/infrastructure/engineering-notes/claude-code-agent-protocol-stack.mdx @@ -0,0 +1,250 @@ +--- +title: "Claude Code Agent — Protocol Stack & Data Contracts" +description: "How claude_code_agent, model servers, PR #1627 /v1/messages, and NeMoGymResponse fit together." +--- +This engineering note summarizes the protocol stack behind the `claude_code_agent` harness: which Gym entities participate in a rollout, which data contracts apply at each hop, what [PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added, and where RL metadata lives. + +## The four Gym server types + +An environment decomposes into four concepts. Each maps to a FastAPI server type: + +| Concept | Component | Key endpoints | +| --- | --- | --- | +| Dataset | JSONL rows | `responses_create_params` per task | +| Agent harness | `responses_api_agents/` | `POST /run`, `POST /v1/responses` | +| Verifier + state | `resources_servers/` | `POST /seed_session`, `POST /verify` | +| Model | `responses_api_models/` | `POST /v1/responses`, `/v1/chat/completions`, `/v1/messages` | + +The **Claude Code CLI** (`claude -p`) is *not* a Gym server. It is a black-box subprocess spawned by `claude_code_agent` that only speaks the **Anthropic Messages API**. + +## Gym's canonical data contracts + +Gym standardizes on the **OpenAI Responses API shape** (with NeMo-specific extensions). Two types form the request/response pair: + +| Type | Role | +| --- | --- | +| `NeMoGymResponseCreateParamsNonStreaming` | **Request** — input messages, tools, sampling params (from dataset JSONL) | +| `NeMoGymResponse` | **Response** — accumulated trajectory in `output[]`, plus `usage` | + +`NeMoGymResponse` is **not** owned exclusively by models or agents. It is Gym's **shared trajectory contract**: + +- **Model servers** produce it on `POST /v1/responses` +- **Agents** produce it on `POST /v1/responses` +- **Resources servers** consume it in `POST /verify` (`BaseVerifyRequest.response`) +- **Rollout harness** reads it from agent `/run` results + +The trajectory building block is **`NeMoGymResponse.output[]`** — a sequence of messages, tool calls, tool results, and reasoning items. + +### Training variants (`*ForTraining`) + +RL metadata lives on **individual output items**, not on the top-level response envelope: + +```python +class TokenIDLogProbMixin(BaseModel): + prompt_token_ids: List[int] + generation_token_ids: List[int] + generation_log_probs: List[float] +``` + +Training subclasses (`NeMoGymResponseOutputMessageForTraining`, etc.) mix this in. A rollout JSONL row with RL data looks like: + +```json +{ + "output": [ + { + "type": "message", + "role": "assistant", + "content": [...], + "prompt_token_ids": [1, 2, 3], + "generation_token_ids": [4, 5, 6], + "generation_log_probs": [-0.1, -0.2, -0.3] + } + ] +} +``` + +## Alternate wire formats (not the Gym trajectory contract) + +Model servers expose **three** HTTP endpoints. Only one returns `NeMoGymResponse` on the wire: + +| Endpoint | Wire format | Gym trajectory? | +| --- | --- | --- | +| `POST /v1/responses` | `NeMoGymResponse` JSON | **Yes** | +| `POST /v1/chat/completions` | `NeMoGymChatCompletion` JSON | No — one chat turn; converted internally or by caller | +| `POST /v1/messages` | Anthropic Message JSON or SSE | No — foreign protocol adapter ([PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627)) | + +`NeMoGymChatCompletion` is a **backend/wire format** (one assistant turn in `choices[0].message`). `vllm_model` uses it internally: `responses()` converts to chat params, calls `chat_completions()`, then converts back to `NeMoGymResponse.output[]`. + +Agents like `simple_agent` call model `POST /v1/responses` directly and never see chat completion. `harbor_agent` calls chat completions directly and converts its trajectory to `NeMoGymResponse` output items at the end. + +## What PR #1627 added (and did not add) + +[PR #1627](https://github.com/NVIDIA-NeMo/Gym/pull/1627) added a **third spoke** on model servers — not changes to the agent: + +**Before:** `SimpleResponsesAPIModel` exposed `/v1/chat/completions` and `/v1/responses` only. + +**After:** Every model server also exposes `POST /v1/messages` with a default handler that: + +1. Converts Anthropic request → `NeMoGymResponseCreateParams` +2. Delegates to the server's own `responses()` → internal `NeMoGymResponse` +3. Converts `NeMoGymResponse` → Anthropic response (JSON or synthesized SSE) + +**Not in PR #1627:** + +- `claude_code_agent` itself (from #1336) — already had `model_server` ref and `_resolve_base_url()` +- Default `reasoning_gym_claude_code_agent.yaml` — still points at real Anthropic API +- RL side-channel plumbing — converter explicitly drops token IDs before Anthropic conversion + +## End-to-end rollout flow (linear) + +This is the full stack when using `reasoning_gym_claude_code_agent_model_server.yaml` + a model server (e.g. `vllm_model`): + +```mermaid +flowchart TD + RC["ng_collect_rollouts"] + RUN["agent POST /run"] + SEED["resources_server POST /seed_session"] + RESP["agent POST /v1/responses"] + CLI["claude -p subprocess"] + MSG["model_server POST /v1/messages"] + MSRESP["model_server responses()"] + BACKEND["inference backend POST /v1/chat/completions"] + NOTE["↻ repeat for each Claude LLM turn"] + PARSE["agent builds NeMoGymResponse from stream-json"] + VERIFY["resources_server POST /verify"] + OUT["rollout JSONL"] + + RC -->|"NeMoGymResponseCreateParams"| RUN + RUN --> SEED --> RESP + RESP -->|"shell + ANTHROPIC_BASE_URL"| CLI + CLI -->|"Anthropic Messages SSE"| MSG + MSG --> MSRESP --> BACKEND + BACKEND --> MSG --> CLI + CLI --> NOTE + NOTE --> PARSE + PARSE -->|"NeMoGymResponse"| VERIFY + VERIFY --> OUT +``` + +### Message types at each hop + +| Step | From → To | Format | +| --- | --- | --- | +| 1 | Rollout → agent `/run` | Task row with `responses_create_params` | +| 2 | Agent `/run` → resources `/seed_session` | Same task row | +| 3 | Agent `/run` → agent `/v1/responses` | `NeMoGymResponseCreateParamsNonStreaming` | +| 4 | Agent → Claude subprocess | Shell env (`ANTHROPIC_BASE_URL`, etc.) | +| 5 | Claude ↔ model `/v1/messages` | **Anthropic Messages** (many turns) | +| 6 | Inside model server | Anthropic → `NeMoGymResponse` → Anthropic (internal) | +| 7 | Model server ↔ vLLM | OpenAI Chat Completions (internal) | +| 8 | Claude → agent | **stream-json stdout** (full session) | +| 9 | Agent `/v1/responses` return | **`NeMoGymResponse`** (episode-level, for scoring) | +| 10 | Agent → resources `/verify` | Task row + `NeMoGymResponse` | + +## Two NeMoGymResponse lifetimes (model-server path) + +This is a common source of confusion. On the model-server path there are **two separate** `NeMoGymResponse` objects: + +### Per-turn (internal, inside model server) + +Each Claude LLM call triggers: + +``` +Anthropic request → NeMoGymResponseCreateParams → responses() → NeMoGymResponse + → stripped → Anthropic response back to Claude +``` + +This object can carry RL fields when `return_token_id_information=True`, but Claude never sees them and Gym rollouts do not receive them today. + +### Episode-level (what Gym scoring uses) + +After Claude finishes the full session, the agent parses stream-json and **constructs one** `NeMoGymResponse` in `claude_code_agent.responses()`. That is what `/verify` reads. + +Today this episode-level response uses plain `NeMoGymResponseOutputMessage` items — **no RL fields**, even if the model server produced them per turn. + +## Direct Anthropic path (shorter) + +With the default `reasoning_gym_claude_code_agent.yaml` (`anthropic_base_url: null`), steps involving the Gym model server drop out: + +```mermaid +flowchart TD + RC["ng_collect_rollouts"] + RUN["agent POST /run"] + SEED["resources_server POST /seed_session"] + RESP["agent POST /v1/responses"] + CLI["claude -p subprocess"] + ANTH["api.anthropic.com POST /v1/messages"] + VERIFY["resources_server POST /verify"] + OUT["rollout JSONL"] + + RC --> RUN --> SEED --> RESP --> CLI + CLI <-->|"Anthropic Messages"| ANTH + CLI -->|"stream-json"| RESP + RESP -->|"NeMoGymResponse"| VERIFY --> OUT +``` + +PR #1627 is **invisible** on this path. + +## Why Anthropic format for Claude? + +Claude Code CLI is hard-wired to `POST /v1/messages`. It cannot call Gym's `/v1/responses` or OpenAI chat completions. When you point Claude at a Gym model server, the server must **speak Anthropic on the wire** even though it implements `responses()` internally. + +Think of `/v1/messages` as a **protocol adapter**: + +``` +Claude (USB-C / Anthropic) ↔ Gym model server adapter ↔ vLLM (HDMI / Chat Completions) +``` + +Gym's rollout pipeline only cares about the final **`NeMoGymResponse`** the agent builds from stream-json — not the per-turn Anthropic exchanges. + +## RL metadata: where it exists and where it is lost + +| Location | RL fields present? | +| --- | --- | +| `vllm_model` internal `NeMoGymResponse` (`return_token_id_information=True`) | Yes — on `*ForTraining` output items | +| Model server `POST /v1/responses` wire response | Yes (when configured) | +| Model server `POST /v1/messages` wire response | **No** — stripped in `responses_to_anthropic_response()` | +| `claude_code_agent` episode `NeMoGymResponse` | **No** — `parse_stream_json()` builds plain messages | +| Resources server `/verify` | Reads text from `NeMoGymResponse.output[]`; RL fields unused for scoring | + +The planned RL path (not yet wired for Claude Code) would **side-channel** per-turn token IDs from the model server's internal `NeMoGymResponse` and merge them into the agent's episode-level `NeMoGymResponse` as existing `*ForTraining` types — not invent a new schema. + +## Protocol layers on a model server + +```mermaid +flowchart TD + subgraph gym["Gym trajectory layer"] + NR["NeMoGymResponse\n(building block for rollouts / verify)"] + end + + subgraph convert["Conversion (inside model server or agent)"] + C["AnthropicConverter / VLLMConverter"] + end + + subgraph wire["Backend wire formats"] + CC["NeMoGymChatCompletion\n/v1/chat/completions"] + AM["Anthropic Message\n/v1/messages"] + OR["OpenAI native Response\nopenai_model upstream"] + end + + NR --> C + C --> CC + C --> AM + C --> OR +``` + +## Config cheat sheet + +| Config | Model path | PR #1627 involved? | +| --- | --- | --- | +| `claude_code_agent/configs/claude_code_agent.yaml` (template) | Direct Anthropic | No | +| `reasoning_gym_claude_code_agent.yaml` | Direct Anthropic via env vars | No | +| `reasoning_gym_claude_code_agent_model_server.yaml` + `vllm_model.yaml` | Claude → Gym model `/v1/messages` → vLLM | **Yes** | + +## Key takeaways + +1. **`NeMoGymResponse.output[]` is the trajectory building block** — shared across models, agents, verifiers, and rollouts. +2. **`POST /v1/responses` is the Gym contract boundary** — not `/v1/messages` or `/v1/chat/completions`. +3. **PR #1627 adds an Anthropic adapter on model servers** so Claude Code can target any Gym backend without a separate proxy process. +4. **The agent wraps Claude as a black box** — Gym HTTP stops at `/v1/responses`; Claude's multi-turn loop uses Anthropic internally. +5. **RL metadata is schema-ready** (`*ForTraining` types) but **not yet plumbed** through the Claude Code + `/v1/messages` path. diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx new file mode 100644 index 0000000000..2c43260968 --- /dev/null +++ b/fern/versions/latest/pages/infrastructure/engineering-notes/harness-terminology.mdx @@ -0,0 +1,103 @@ +--- +title: "Harness Terminology" +description: "How the agent-eval and SWE-bench communities use “harness,” and how NeMo Gym disambiguates agent orchestration from benchmark grading." +--- +**“Harness” is overloaded.** In agentic coding and SWE-bench discussions, the same word often means either (a) the agent-side orchestration that runs a model on tasks, or (b) the benchmark-side pipeline that grades patches. There is no single community-wide definition — context disambiguates. This note maps the dominant usages and the vocabulary NeMo Gym uses to keep them separate. + +## Three meanings in the wild + +| Context | “Harness” usually means | Agent included? | Example | +| --- | --- | --- | --- | +| **SWE-bench docs / `swebench.harness`** | Grading pipeline: Docker, apply patch, run tests, report resolved | No | `python -m swebench.harness.run_evaluation` | +| **SWE-bench leaderboard / blogs** | Agent orchestration: prompt, tool loop, patch extraction | Yes | “mini-SWE-agent + Claude 4.5 Opus” | +| **NeMo Gym docs** | Agent server orchestration around the model | Yes | `claude_code_agent`, OpenHands, `simple_agent` | +| **Generic ML eval** | Benchmark runner infrastructure | Maybe | “evaluation harnesses” in NeMo Evaluator | + +The collision is sharpest in SWE-bench: **official docs** call the grader “the harness,” while **leaderboard rows** read like “harness + model” where harness is the agent stack. + +## SWE-bench: two halves of one eval + +The [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) documents **`swebench.harness`** — the module that: + +1. Prepares per-instance Docker images +2. Applies a `model_patch` from a predictions file +3. Runs the repository test suite +4. Grades pass/fail and aggregates `% resolved` + +The submission contract is intentionally narrow: produce JSONL with `instance_id` and `model_patch`; the harness scores it. Community tutorials often describe this as a unit-test-shaped split: + +- **Arrange** — harness prepares the instance environment (images, checkout, deps) +- **Act** — *your* agent edits the repo and emits a patch (not part of `swebench.harness`) +- **Assert** — harness applies the patch, runs hidden tests, reports resolved/unresolved + +So in SWE-bench **technical** vocabulary, harness ≈ **Environment + grading authority**. The agent is external. + +What upstream bundles (without always naming it cleanly) is **task world + grading** inside one pipeline. What it does **not** model is **how** the agent produced the patch — but leaderboard prose often treats “harness + model” as the evaluated product anyway. + +## Leaderboard and product language + +The [SWE-bench leaderboard](https://www.swebench.com/) reports results as combinations like: + +- **“bash only”** — a specific agent-side setup across models +- **“mini-SWE-agent + Claude 4.5 Opus”** — harness (orchestration) + model + +That usage matches NeMo Gym’s [SWE RL case study](/infrastructure/engineering-notes/swe-rl-case-study): *a harness is a system prompt plus orchestration to execute one attempt at the task.* Here harness ≈ **agent harness**, not `run_evaluation`. + +When reading papers, blog posts, or vendor announcements, assume **harness = agent-side** unless the text explicitly points at Docker grading or `swebench.harness`. + +## NeMo Gym vocabulary (intentional split) + +Gym separates roles that colloquial “harness” often merges: + +| Gym term | Role | SWE-bench analogue | +| --- | --- | --- | +| **Task** | One dataset row / instance (`SweTask`, `TaskPublic`) | `django__django-13741` | +| **Benchmark** | Fixed eval product: split, metric, protocol, baselines | SWE-bench Verified | +| **Environment** | Resources server: `seed_session`, `verify`, state, tools | `swe_bench` RS | +| **Agent harness** | Agent server: multi-step loop, tools, when to stop | Claude Code, OpenHands, mini-SWE-agent | +| **Model** | Stateless inference | vLLM / OpenAI endpoint | + +See [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks) and the [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) note for how `SessionDescriptor`, topology C, and hermetic `verify` fit in. + +### The awkward `harnesses/` directory + +Under [`resources_servers/swe_bench/harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses), **`harness` means something else again**: benchmark-**family eval plugins** owned by the Environment — provision recipes and grading adapters keyed by dataset family (e.g. SWE-bench vs multilingual). They wrap pieces of upstream `swebench.harness`; they are **not** agent orchestration. + +| Name in repo | Meaning | Prefer saying | +| --- | --- | --- | +| Agent server / `responses_api_agents/` | Agent harness | **agent harness** or **agent server** | +| `swebench.harness` (upstream) | Official grading pipeline | **SWE-bench eval harness** or **grading pipeline** | +| `swe_bench/harnesses/` | Environment eval plugins | **benchmark-family plugin** or **eval plugin** (when disambiguation matters) | +| `swe_bench` RS overall | MDP authority for SWE tasks | **Environment** or **resources server** | + +We keep `harness.py` / `harnesses/` in `swe_bench` to align with upstream module naming (`swebench.harness`) and prior `swe_env` convention — not because Gym equates “harness” with agent orchestration. + +## Practical guidance + +**When writing docs or PRs** + +- Say **agent harness** (or **agent server**) for orchestration in `responses_api_agents/`. +- Say **SWE-bench eval harness** or **grading pipeline** for `swebench.harness.run_evaluation`. +- Say **Environment** / **`swe_bench` resources server** for `seed_session` + `verify` + hermetic grading. +- Say **benchmark** for published eval products (Verified, Lite), not for the server binary. +- Say **task** / **instance** for one problem row — not “environment” and not “harness.” + +**When comparing to leaderboard numbers** + +- Identify both **model** and **agent harness** (leaderboard row). +- Confirm **benchmark split** and **grading protocol** (Environment config, `verified:` marker). +- Do not conflate upstream `swebench.harness` with the agent named on the leaderboard. + +**When designing new environments** + +- Keep **grading authority** on the resources server (Environment). +- Keep **orchestration** on the agent server (agent harness). +- Use **benchmark-family plugins** only for dataset-specific provision/grade logic — not for agent loops. + +## Related reading + +- [Evaluation — agent harness](/evaluation#agent-harness) — Gym’s primary “harness” definition +- [Key Terminology — Agent Server](/about/concepts/key-terminology#architecture-terms) — architecture glossary +- [SWE-bench Environment Server](/infrastructure/engineering-notes/swe-bench-environment-server) — Task / Benchmark / Environment split for SWE +- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — harness + model on the leaderboard +- [SWE-bench harness reference](https://www.swebench.com/SWE-bench/reference/harness/) — upstream grading pipeline diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx index ce48f79b15..eb92e8bf31 100644 --- a/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx +++ b/fern/versions/latest/pages/infrastructure/engineering-notes/index.mdx @@ -25,6 +25,24 @@ Infrastructure challenges and deployment topology for SWE RL training. swe-rl case-study + +How “harness” is used in SWE-bench vs agent eval vs NeMo Gym — and recommended naming. + +terminology swe-bench + + + +Environment resources server for SWE-bench: session descriptors, topology C, and hermetic verify. + +swe-bench architecture + + + +Responses API, `/v1/messages`, and rollout data contracts for `claude_code_agent`. + +claude-code api-design + + Why NeMo Gym uses aiohttp instead of httpx for async HTTP. diff --git a/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx new file mode 100644 index 0000000000..b3a97cc340 --- /dev/null +++ b/fern/versions/latest/pages/infrastructure/engineering-notes/swe-bench-environment-server.mdx @@ -0,0 +1,267 @@ +--- +title: "SWE-bench Environment Server" +description: "Restoring the Environment as a resources server — seed_session descriptors, topology C, and hermetic verify for SWE-bench." +--- +This engineering note documents the **`swe_bench` resources server**: what problem it solves, how it differs from earlier SWE integrations in Gym, and how to run evaluation with a black-box agent server such as `claude_code_agent`. + + +This server ships with `verified: false` — it is a working prototype, not yet baselined on gold patches. See [Adding a Benchmark](/contribute/environments/adding-a-benchmark) for the path to `verified: true`. + + +## Background: why a separate Environment server? + +Earlier SWE convergence work ([PR #1738](https://github.com/NVIDIA-NeMo/Gym/pull/1738)) moved grading and sandbox spec **into the agent server** (`responses_api_agents/swe_env/`, inline `verify_task`). That pattern works for a single bundled agent, but it breaks composability: + +- **Black-box agent servers** (Claude Code, OpenHands, Harbor, …) should not import SWE grading code or choose docker vs OpenSandbox themselves. +- **The Environment** should own task authority: sandbox spec, benchmark grading, and the `verified:` marker on the resources server. +- **Agents** should connect through a small HTTP contract (`seed_session` → run → `verify`), not an `anyswe`-style wrapper per agent. + +The `swe_bench` resources server restores that boundary. Grading harnesses, parsing, and `verify_task` live as **private modules** under [`resources_servers/swe_bench/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench) — not under `responses_api_agents/`. + +For cluster-scale SWE RL training topology (Apptainer, CPU sizing), see the older [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study). This note focuses on the **Environment server + agent-server wiring** pattern. + +## Three roles (orthogonal) + +| Role | Gym component | SWE-bench example | +| --- | --- | --- | +| **Environment** | Resources server | `swe_bench` — `seed_session`, `verify`, benchmark harnesses | +| **Agent server** | `responses_api_agents/` | `claude_code_agent` — runs Claude in the instance sandbox | +| **Sandbox runtime** | `nemo_gym/sandbox/` | Docker provider (OpenSandbox / Apptainer as needed) | + + +**“Harness” overload.** In Gym docs, *agent harness* means orchestration inside an agent server. In SWE-bench, `swebench.harness` is the upstream eval stack. Under `swe_bench`, **`harness.py` / `harnesses/`** are **benchmark-family plugins** (provision + grade recipes keyed by `task.benchmark`). They are Environment-owned, not agent orchestration. See [Harness Terminology](/infrastructure/engineering-notes/harness-terminology) for the full map. + + +## Environment vs Benchmark vs Task + +These names refer to different layers (see also [Environments vs Benchmarks](/about/concepts/evaluation#environments-vs-benchmarks)): + +| Layer | SWE-bench example | What it is | +| --- | --- | --- | +| **Benchmark** | *SWE-bench Verified* | Fixed eval product: 500-task test split, `% resolved` metric, comparison protocol, leaderboard baselines | +| **Environment** | `swe_bench` resources server | Executable engine: `seed_session`, `verify`, harness registry, hermetic grading | +| **Task** | `django__django-13741` | One problem instance in the benchmark (prompt + privileged grading metadata) | + +- **`swe_bench` is the Environment** — you can train, dev, or eval with it; it is not the same as the published benchmark. +- **SWE-bench Verified is a Benchmark** built on that Environment (frozen JSONL + eval config + reporting). +- **`verified: true`** on the RS means this Environment configuration is **benchmark-grade** (gold-patch baseline, protocol locked) — not merely that the server exists. + +One Environment supports multiple benchmarks (Verified, Lite, Multilingual) by swapping **tasks** (dataset) and harness keys — no new resources server per publication. + +## What `swe_bench` exposes + +The HTTP surface is intentionally thin ([`app.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/app.py)). Heavy logic stays in private modules. + +| Endpoint | Responsibility | +| --- | --- | +| `POST /seed_session` | Build a **`SessionDescriptor`**: placement topology, per-instance `SandboxSpec`, merged `verifier_metadata` | +| `POST /verify` | Grade `verifier_metadata.model_patch` in a **fresh** eval sandbox (hermetic twin) | + +### SessionDescriptor (response shape) + +`seed_session` returns: + +```json +{ + "placement": { "topology": "agent_in_env" }, + "sandbox": { "spec": { "image": "swebench/sweb.eval.x86_64....", "workdir": "/testbed", ... } }, + "egress": { "env": {} }, + "verifier_metadata": { + "instance_id": "django__django-13741", + "benchmark": "swe-bench", + "dataset_name": "princeton-nlp/SWE-bench_Verified", + "flat_eval": true + } +} +``` + +The agent server reads **`placement.topology`** and **`sandbox.spec`** — it never imports `swe_bench.harness` or picks a provider on its own (beyond what its config already declares). + +### Topology C (`agent_in_env`) + +| Topology | Who owns the working sandbox | Typical agent | +| --- | --- | --- | +| `none` | No in-box work; MCP / host-side tools | Default Claude Code + MCP resources | +| `agent_in_env` | Agent starts the descriptor's sandbox and runs inside it | **`claude_code_swe_bench`** | +| `env_sandboxed` | Environment brokers box lifecycle (future broker RS) | Planned | +| `whole_interaction` | Single box for agent + eval (legacy) | `swe_agents` style | + +**Topology C** is the target for SWE-bench Verified with Claude Code: + +1. Environment returns image + workdir from the benchmark harness. +2. Agent server starts that sandbox, runs `claude -p` **inside** the instance image. +3. Agent harvests `git diff --cached` as `model_patch`. +4. Environment grades the patch in a **separate fresh container** (no agent pollution). + +```mermaid +flowchart TD + RC["gym eval run"] + RUN["agent POST /run"] + SEED["swe_bench POST /seed_session"] + BOX["Agent: AsyncSandbox from descriptor"] + CLAUDE["claude -p in /testbed"] + PATCH["git diff --cached → model_patch"] + VERIFY["swe_bench POST /verify"] + FRESH["verify_task: fresh sandbox"] + OUT["rollout JSONL reward 0/1"] + + RC --> RUN --> SEED + SEED -->|"topology=agent_in_env, sandbox.spec"| BOX + BOX --> CLAUDE --> PATCH + PATCH --> VERIFY + VERIFY --> FRESH --> OUT +``` + +## Benchmark harness layer (private) + +Each SWE dataset family registers a harness under [`harnesses/`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harnesses): + +| Registry key | Class | Notes | +| --- | --- | --- | +| `swe-bench` | `SweBenchHarness("swe-bench")` | Uses upstream `swebench` `make_test_spec` + `get_logs_eval` | +| `swe-bench-multilingual` | `SweBenchHarness("swe-bench-multilingual")` | Same class, different family name | +| `swe-bench-ext` | `SweBenchExtHarness` | Extended / fuzzy parsers | +| `swe-rebench` | `SweRebenchHarness` | SWE-rebench family | +| `r2e-gym` | `R2EGymHarness` | R2E-Gym | +| `nv-internal-1` | `NVInternalHarness` | Internal NV format | + +The harness contract ([`harness.py`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/harness.py)) splits provisioning from grading: + +- **Agent-visible:** `build_spec`, `supports_provider`, `materialize` +- **Verifier-only:** `reset_repo`, `run_eval`, `grade` (called only from `verify_task`) + +For official SWE-bench instances, grading delegates to the external [`swebench`](https://github.com/SWE-bench/SWE-bench) package — Gym runs the official per-instance `eval_script` in the sandbox and parses logs with `swebench.harness.grading.get_logs_eval`. + +## Dataset format + +Each JSONL row needs SWE instance metadata in **`verifier_metadata`** (and typically mirrored in `responses_create_params.metadata`): + +| Field | Purpose | +| --- | --- | +| `instance_id` | SWE-bench instance key (e.g. `django__django-13741`) | +| `dataset_name` | HuggingFace dataset id (selects harness family) | +| `split` | Usually `test` | +| `problem_statement` | User message / issue text for the agent | +| `instance_dict` | Full SWE-bench instance record (JSON string or object) — required for faithful grading | + +Optional per-row `container_formatter` overrides the server default image template. + +### Prepare SWE-bench Verified rows + +```bash +python resources_servers/swe_bench/prepare.py --limit 5 --no-images +``` + +This writes `resources_servers/swe_bench/data/swebench_verified.jsonl`. Use `--no-images` for dataset-only smoke tests; full eval needs Docker images `swebench/sweb.eval.x86_64.{tag}` (see `prepare.py` for tag normalization: `__` → `_1776_`, lowercased). + +## Configuration + +Server config: [`resources_servers/swe_bench/configs/swe_bench.yaml`](https://github.com/NVIDIA-NeMo/Gym/tree/main/resources_servers/swe_bench/configs/swe_bench.yaml) + +```yaml +swe_bench: + resources_servers: + swe_bench: + sandbox_provider: + docker: {} + container_formatter: swebench/sweb.eval.x86_64.{instance_id} + eval_timeout_s: 1800 + flat_eval: true + default_topology: agent_in_env + +claude_code_swe_bench: + responses_api_agents: + claude_code_agent: + resources_server: + type: resources_servers + name: swe_bench + sandbox_provider: + docker: {} + in_box_timeout_s: 1800 + bare: true +``` + +Key knobs: + +| Config field | Effect | +| --- | --- | +| `sandbox_provider` | Passed to `verify_task` and agent in-box binding | +| `container_formatter` | Docker image template for instance sandboxes | +| `flat_eval` | Host-side grading (runs on any exec-capable provider) | +| `default_topology` | Returned from `seed_session` (`agent_in_env` for topology C) | +| `in_box_timeout_s` | Agent-side Claude run timeout inside the sandbox | + +## Quickstart: evaluation rollouts + +**1. Install and test the server** (unit tests use a fake sandbox — no Docker required): + +```bash +gym env test --resources-server swe_bench +``` + +**2. Start servers** (Anthropic API key for Claude Code): + +```bash +gym env start \ + --resources-server swe_bench \ + --agent claude_code_swe_bench \ + --model-type openai_model +``` + +**3. Run rollouts** on prepared JSONL: + +```bash +gym eval run --no-serve --agent claude_code_swe_bench \ + --input resources_servers/swe_bench/data/swebench_verified.jsonl \ + --output results/swe_bench_rollouts.jsonl +``` + +The agent passes **`verifier_metadata.model_patch`** (unified diff) on `POST /verify`. The server returns `reward` ∈ `{0.0, 1.0}`, plus `resolved`, `patch_exists`, and optional `error_kind` / `mask_sample` for infra failures. + +## Hermetic verify + +`verify` **never** reuses the agent's working sandbox. `verify_task`: + +1. Selects the harness for `task.benchmark` +2. Acquires a **fresh** sandbox via `acquire_sandbox` (always teardown) +3. Runs `reset_repo` → `materialize(model_patch)` → `run_eval` → `grade` +4. Maps the report to reward (`1.0` if resolved and no `error_kind`) + +This mirrors SWE-bench's separation between “agent edits” and “official eval script in a clean tree,” and prevents agent artifacts from affecting the score. + + +`verify` short-circuits: `patch_exists=false`, `resolved=false`, `reward=0.0` — no eval sandbox spin-up. + + + +[`responses_api_agents/swe_agents`](https://github.com/NVIDIA-NeMo/Gym/tree/main/responses_api_agents/swe_agents) still shells out to `swebench.harness.run_local_evaluation` inside Apptainer-oriented rollouts. That path bundles agent + grading. **`swe_bench` + `claude_code_agent`** is the composable replacement: one Environment RS wired to many agent servers via the descriptor contract, without per-agent SWE wrappers. + + +## Module map + +```text +resources_servers/swe_bench/ +├── app.py # HTTP: seed_session → SessionDescriptor, verify +├── task.py # First-class Task (SweTask, TaskPublic, parse helpers) +├── session.py # SessionDescriptor wire models +├── harness.py # SweTaskHarness ABC, registry, compute_resolved +├── harnesses/ # Per-family grading plugins +├── verify_task.py # Fresh-sandbox grading orchestrator +├── sandbox.py # AsyncSweEnvironment + acquire_sandbox +├── prepare.py # HF dataset → Gym JSONL +└── configs/swe_bench.yaml +``` + +## Key takeaways + +1. **`swe_bench` is the Environment** — it owns benchmark authority, not the agent server. +2. **`seed_session` returns a descriptor**, not opaque session state — agents bind sandboxes from `placement` + `sandbox.spec`. +3. **Topology C** runs Claude inside the instance image; **verify** always uses a hermetic twin sandbox. +4. **`harnesses/`** are benchmark eval plugins aligned with upstream `swebench.harness` — distinct from Gym “agent harness” orchestration. +5. **Any agent server** that implements `/run` → `seed_session` → work → `verify` with `model_patch` can plug in; no SWE-specific wrapper required. + +## Related docs + +- [Claude Code Agent — Protocol Stack](/infrastructure/engineering-notes/claude-code-agent-protocol-stack) — Responses API, `/v1/messages`, and rollout data contracts +- [SWE RL Case Study](/infrastructure/engineering-notes/swe-rl-case-study) — training-scale Apptainer topology +- [Real-World Environment tutorial](/environment-tutorials/real-world-environment/resources-server-implementation) — `seed_session` / `verify` patterns for resources servers diff --git a/nemo_gym/sandbox/providers/apptainer/provider.py b/nemo_gym/sandbox/providers/apptainer/provider.py index 0605b1a805..bc2b70c499 100644 --- a/nemo_gym/sandbox/providers/apptainer/provider.py +++ b/nemo_gym/sandbox/providers/apptainer/provider.py @@ -148,6 +148,7 @@ class _ApptainerInstance: mount_point: str # where the folder shows up inside image: str # what it was built from env: dict[str, str] = field(default_factory=dict) + overlay_dir: Path | None = None # per-instance disk overlay (cleaned on close) def _resource_flags(resources: SandboxResources) -> list[str]: @@ -386,6 +387,14 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle: for key, value in spec.env.items(): argv += ["--env", f"{key}={value}"] start_args = list(self._create_config.extra_start_args) + # --writable-tmpfs caps the writable layer at apptainer's `sessiondir max size` + # (default 64 MiB), which ENOSPCs for repos that rebuild on apply/eval (e.g. astropy's + # C extensions). Swap it for a per-instance DISK-backed overlay (bounded by host disk). + overlay_dir: Path | None = None + if "--writable-tmpfs" in start_args: + start_args = [a for a in start_args if a != "--writable-tmpfs"] + overlay_dir = Path(tempfile.mkdtemp(prefix="nemo-gym-apptainer-ovl-")) + start_args += ["--overlay", str(overlay_dir)] resource_limit_flags = _resource_limit_flags(spec.resources) if resource_limit_flags and self._create_config.apply_resource_limits: if "--fakeroot" in start_args: @@ -403,9 +412,13 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle: code, _out, err = await self._run(argv, timeout_s=self._create_config.start_timeout_s, daemonize=True) except TimeoutError as e: shutil.rmtree(staging_dir, ignore_errors=True) + if overlay_dir: + shutil.rmtree(overlay_dir, ignore_errors=True) raise ApptainerCreateError(f"apptainer instance start timed out for image={image!r}: {e}") from e if code != 0: shutil.rmtree(staging_dir, ignore_errors=True) + if overlay_dir: + shutil.rmtree(overlay_dir, ignore_errors=True) raise ApptainerCreateError( f"apptainer instance start failed (code={code}) for image={image!r}: {err.strip()}" ) @@ -420,6 +433,7 @@ async def create(self, spec: SandboxSpec) -> SandboxHandle: mount_point=mount_point, image=image, env=dict(spec.env), + overlay_dir=overlay_dir, ), ) @@ -485,6 +499,8 @@ async def _cleanup_failed_create_handle(self, handle: SandboxHandle) -> None: timeout_s=self._exec_config.default_timeout_s, ) shutil.rmtree(inst.staging_dir, ignore_errors=True) + if inst.overlay_dir: + shutil.rmtree(inst.overlay_dir, ignore_errors=True) async def exec( self, @@ -667,6 +683,8 @@ async def close(self, handle: SandboxHandle) -> None: shutil.rmtree(inst.staging_dir, ignore_errors=False) except OSError as e: LOGGER.warning("failed to remove staging dir %s: %s", inst.staging_dir, e) + if inst.overlay_dir: + shutil.rmtree(inst.overlay_dir, ignore_errors=True) if stop_error is not None: raise stop_error diff --git a/nemo_gym/sandbox/providers/docker/__init__.py b/nemo_gym/sandbox/providers/docker/__init__.py new file mode 100644 index 0000000000..a339158b99 --- /dev/null +++ b/nemo_gym/sandbox/providers/docker/__init__.py @@ -0,0 +1,20 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Docker sandbox provider package.""" + +from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider + + +__all__ = ["DockerSandboxProvider"] diff --git a/nemo_gym/sandbox/providers/docker/provider.py b/nemo_gym/sandbox/providers/docker/provider.py new file mode 100644 index 0000000000..7af8fe6ffa --- /dev/null +++ b/nemo_gym/sandbox/providers/docker/provider.py @@ -0,0 +1,324 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Local Docker-backed ``SandboxProvider`` implementation. + +Implements the ``nemo_gym.sandbox`` provider Protocol via the ``docker`` CLI so +SWE environments can be provisioned and graded on any machine with Docker +installed, making end-to-end SWE-bench verification runnable on a single +workstation. +""" + +from __future__ import annotations + +import asyncio +import posixpath +import shlex +import uuid +from collections.abc import Mapping +from pathlib import Path +from typing import Any + +from nemo_gym.sandbox import ( + SandboxCreateError, + SandboxExecResult, + SandboxHandle, + SandboxResources, + SandboxSpec, + SandboxStatus, +) + + +class DockerSandboxProvider: + """Run sandboxes as long-lived Docker containers via the ``docker`` CLI.""" + + name = "docker" + + def __init__( + self, + *, + docker_bin: str = "docker", + default_user: str | int | None = None, + network: str | None = None, + run_args: list[str] | None = None, + keep_alive_command: str = "sleep infinity", + concurrency: int = 32, + **_: Any, + ) -> None: + """Configure the Docker sandbox provider. + + Args: + docker_bin: Name or path of the ``docker`` executable to invoke. + default_user: Default user (name or UID) to run ``exec`` commands as + when no per-call user is given; None leaves the image default. + network: Docker network to attach containers to; None uses the + Docker default. + run_args: Extra arguments appended to every ``docker run`` + invocation. + keep_alive_command: Command run as the container's entrypoint to keep + it alive for subsequent ``exec`` calls. + concurrency: Maximum number of concurrent ``docker`` CLI subprocesses, + bounded by a shared semaphore (matches the apptainer provider). + **_: Additional keyword arguments are accepted and ignored. + + Raises: + ValueError: If ``concurrency`` is less than 1. + """ + if concurrency < 1: + raise ValueError("concurrency must be >= 1") + self._bin = docker_bin + self._default_user = default_user + self._network = network + self._run_args = list(run_args or []) + self._keep_alive = keep_alive_command + self._semaphore = asyncio.Semaphore(concurrency) + + async def _run(self, *args: str, timeout_s: int | float | None = None) -> tuple[int, str, str]: + """Run the ``docker`` CLI with the given arguments and capture output. + + Concurrency is bounded by the provider's shared semaphore so a busy SWE hot + path (one sandbox per rollout, many ``exec`` each) cannot spawn unbounded + ``docker`` subprocesses. + + Args: + *args: Arguments passed to the ``docker`` executable. + timeout_s: Optional timeout in seconds; the process is killed and the + timeout error re-raised if it is exceeded. + + Returns: + A tuple of ``(return_code, stdout, stderr)`` with output decoded as + text using ``errors="replace"``. + """ + async with self._semaphore: + proc = await asyncio.create_subprocess_exec( + self._bin, + *args, + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE, + ) + try: + out, err = await asyncio.wait_for(proc.communicate(), timeout=timeout_s) + except (asyncio.TimeoutError, TimeoutError): + proc.kill() + await proc.wait() + raise + return ( + proc.returncode if proc.returncode is not None else -1, + out.decode(errors="replace"), + err.decode(errors="replace"), + ) + + @staticmethod + def _resources(spec: SandboxSpec) -> SandboxResources: + """Coerce a spec's resource request into a ``SandboxResources``. + + Args: + spec: Sandbox spec whose ``resources`` field is a + ``SandboxResources`` or a mapping. + + Returns: + The spec's ``SandboxResources`` if already one, otherwise a + ``SandboxResources`` built from the mapping (or empty defaults). + """ + if isinstance(spec.resources, SandboxResources): + return spec.resources + return SandboxResources.from_mapping(spec.resources if isinstance(spec.resources, Mapping) else {}) + + async def create(self, spec: SandboxSpec) -> SandboxHandle: + """Start a detached container and return a handle to it. + + Applies resource limits, network, working directory, environment, and + extra run args from the spec, then launches the image running the + keep-alive command so the container persists for later ``exec`` calls. + + Args: + spec: Sandbox spec describing the image, resources, workdir, env, and + readiness timeout. + + Returns: + A ``SandboxHandle`` whose ``sandbox_id`` is the container id. + + Raises: + SandboxCreateError: If no image is given, ``docker run`` times out or + fails, or no container id is returned. + """ + if not spec.image: + raise SandboxCreateError("DockerSandboxProvider requires spec.image") + # Pre-assign a unique name so a container the daemon may have started can still be reaped + # if the CLI client dies (e.g. on timeout) before we capture its id (mirrors apptainer's + # uuid-named instances). + name = f"nemo-gym-{uuid.uuid4().hex}" + args = ["run", "-d", "--init", "--name", name] + if self._network: + args += ["--network", self._network] + res = self._resources(spec) + if res.memory_mib: + args.append(f"--memory={int(res.memory_mib)}m") + if res.cpu: + args.append(f"--cpus={res.cpu}") + if res.gpu: + args.append("--gpus=all") + if spec.workdir: + args += ["-w", spec.workdir] + for key, value in (spec.env or {}).items(): + args += ["-e", f"{key}={value}"] + args += self._run_args + args += [spec.image, "bash", "-c", self._keep_alive] + try: + rc, out, err = await self._run(*args, timeout_s=spec.ready_timeout_s or 600) + except (asyncio.TimeoutError, TimeoutError) as exc: + await self._reap_orphan(name) + raise SandboxCreateError(f"docker run timed out for image {spec.image!r}") from exc + if rc != 0: + await self._reap_orphan(name) + raise SandboxCreateError(f"docker run failed (rc={rc}) for {spec.image!r}: {err.strip() or out.strip()}") + lines = out.strip().splitlines() + container_id = lines[-1].strip() if lines else "" + if not container_id: + await self._reap_orphan(name) + raise SandboxCreateError("docker run did not return a container id") + return SandboxHandle( + sandbox_id=container_id, + provider_name=self.name, + raw={"image": spec.image, "workdir": spec.workdir}, + ) + + async def _reap_orphan(self, name: str) -> None: + """Best-effort force-remove a container by its pre-assigned name. + + Used to clean up a ``docker run`` that may have started a container on the daemon even + though the CLI client failed (timeout / non-zero rc / no id returned) before a handle was + captured. Swallows all errors and bounds itself with a short timeout — a missing or + already-gone container is fine. + + Args: + name: The pre-assigned ``--name`` of the container to remove. + """ + try: + await self._run("rm", "-f", name, timeout_s=30) + except Exception: + pass + + async def exec( + self, + handle: SandboxHandle, + command: str, + *, + cwd: str | None = None, + env: dict[str, str] | None = None, + timeout_s: int | float | None = None, + user: str | int | None = None, + ) -> SandboxExecResult: + """Run a shell command inside the container. + + Args: + handle: Handle identifying the target container. + command: Shell command executed via ``bash -c``. + cwd: Working directory for the command; falls back to the workdir + recorded at create time. + env: Extra environment variables for the command. + timeout_s: Optional timeout in seconds; on expiry a result with + return code 124 and ``error_type="timeout"`` is returned. + user: User (name or UID) to run as; falls back to the provider's + default user. + + Returns: + A ``SandboxExecResult`` with stdout, stderr, return code, and an + ``error_type`` of ``"sandbox"`` for docker-level failures (125/126/ + 127 with no stdout), ``"timeout"`` on timeout, or None otherwise. + """ + args = ["exec"] + workdir = cwd or handle.raw.get("workdir") + if workdir: + args += ["-w", workdir] + eff_user = user if user is not None else self._default_user + if eff_user is not None: + args += ["-u", str(eff_user)] + for key, value in (env or {}).items(): + args += ["-e", f"{key}={value}"] + args += [handle.sandbox_id, "bash", "-c", command] + try: + rc, out, err = await self._run(*args, timeout_s=timeout_s) + except (asyncio.TimeoutError, TimeoutError): + return SandboxExecResult( + stdout=None, + stderr=f"command timed out after {timeout_s}s", + return_code=124, + error_type="timeout", + ) + # docker exec returns 125/126/127 for docker-level failures (container gone, not executable). + error_type = "sandbox" if rc in (125, 126, 127) and not out else None + return SandboxExecResult(stdout=out, stderr=err, return_code=rc, error_type=error_type) + + async def upload_file(self, handle: SandboxHandle, source_path: Path, target_path: str) -> None: + """Copy a host file into the container, creating parent dirs as needed. + + Args: + handle: Handle identifying the target container. + source_path: Path to the file on the host. + target_path: Destination path inside the container. + + Raises: + RuntimeError: If the ``docker cp`` upload fails. + """ + parent = posixpath.dirname(target_path) + if parent: + await self.exec(handle, f"mkdir -p {shlex.quote(parent)}") + rc, out, err = await self._run("cp", str(source_path), f"{handle.sandbox_id}:{target_path}") + if rc != 0: + raise RuntimeError(f"docker cp upload failed: {err.strip() or out.strip()}") + + async def download_file(self, handle: SandboxHandle, source_path: str, target_path: Path) -> None: + """Copy a file out of the container to the host. + + Args: + handle: Handle identifying the source container. + source_path: Path to the file inside the container. + target_path: Destination path on the host; parent dirs are created. + + Raises: + RuntimeError: If the ``docker cp`` download fails. + """ + target = Path(target_path) + target.parent.mkdir(parents=True, exist_ok=True) + rc, out, err = await self._run("cp", f"{handle.sandbox_id}:{source_path}", str(target)) + if rc != 0: + raise RuntimeError(f"docker cp download failed: {err.strip() or out.strip()}") + + async def status(self, handle: SandboxHandle) -> SandboxStatus: + """Report whether the container is running. + + Args: + handle: Handle identifying the container to inspect. + + Returns: + ``RUNNING`` or ``STOPPED`` based on the container's running state, + or ``UNKNOWN`` if the inspect command fails. + """ + rc, out, _ = await self._run("inspect", "-f", "{{.State.Running}}", handle.sandbox_id) + if rc != 0: + return SandboxStatus.UNKNOWN + return SandboxStatus.RUNNING if out.strip() == "true" else SandboxStatus.STOPPED + + async def close(self, handle: SandboxHandle) -> None: + """Force-remove the container. + + Args: + handle: Handle identifying the container to remove. + """ + await self._run("rm", "-f", handle.sandbox_id) + + async def aclose(self) -> None: + """Release provider-level resources; this provider holds none.""" + return None diff --git a/nemo_gym/sandbox/providers/registry.py b/nemo_gym/sandbox/providers/registry.py index 8c4e39e577..451056d470 100644 --- a/nemo_gym/sandbox/providers/registry.py +++ b/nemo_gym/sandbox/providers/registry.py @@ -75,11 +75,18 @@ def _load_opensandbox_provider() -> ProviderClass: return OpenSandboxProvider +def _load_docker_provider() -> ProviderClass: + from nemo_gym.sandbox.providers.docker import DockerSandboxProvider + + return DockerSandboxProvider + + def _load_apptainer_provider() -> ProviderClass: from nemo_gym.sandbox.providers.apptainer import ApptainerProvider return ApptainerProvider -_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider _BUILTIN_PROVIDER_LOADERS["opensandbox"] = _load_opensandbox_provider +_BUILTIN_PROVIDER_LOADERS["docker"] = _load_docker_provider +_BUILTIN_PROVIDER_LOADERS["apptainer"] = _load_apptainer_provider diff --git a/resources_servers/swe_bench/README.md b/resources_servers/swe_bench/README.md new file mode 100644 index 0000000000..b60972b80a --- /dev/null +++ b/resources_servers/swe_bench/README.md @@ -0,0 +1,51 @@ +# swe_bench resources server + +SWE-bench **Environment** resources server: `seed_session` returns a `SessionDescriptor` (topology **C**, per-instance sandbox spec); `verify` grades a model patch in a **fresh** eval sandbox (hermetic twin). + +Grading eval harnesses, parsing, and `verify_task` live as **private modules** under this directory (relocated from `responses_api_agents/swe_env/`). + +Key modules: + +- `task.py` — first-class **Task** (`SweTask`, `TaskPublic`, parse helpers) +- `session.py` — **SessionDescriptor** returned from `seed_session` +- `app.py` — thin HTTP surface (`seed_session`, `verify`) +- `harness.py` / `harnesses/` — benchmark-family grading plugins + +## Wiring + +```yaml +responses_api_agents: + claude_code_agent: + resources_server: + type: resources_servers + name: swe_bench +``` + +## Tests + +```bash +gym env test --resources-server swe_bench +``` + +Unit tests use a fake sandbox provider (no Docker required). + +## Dataset + +Prepare SWE-bench Verified rows with `verifier_metadata` (see `prepare.py`): + +```bash +python resources_servers/swe_bench/prepare.py --limit 5 --no-images +``` + +Each JSONL row includes `verifier_metadata.instance_id`, `instance_dict`, `dataset_name`, and optional `container_formatter`. + +## Rollouts + +```bash +gym env start --resources-server swe_bench --agent claude_code_swe_bench --model-type openai_model +gym eval run --no-serve --agent claude_code_swe_bench \ + --input resources_servers/swe_bench/data/swebench_verified.jsonl \ + --output results/swe_bench_rollouts.jsonl +``` + +Agent servers pass `verifier_metadata.model_patch` (git unified diff) on `POST /verify`. diff --git a/resources_servers/swe_bench/__init__.py b/resources_servers/swe_bench/__init__.py new file mode 100644 index 0000000000..ffd5d25501 --- /dev/null +++ b/resources_servers/swe_bench/__init__.py @@ -0,0 +1,58 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""SWE-bench Environment resources server modules. + +Grading harnesses, parsing, and verify_task implement the Environment MDP +authority. Agent servers connect via HTTP ``seed_session`` / ``verify`` only. +""" + +from resources_servers.swe_bench.harness import ( + EvalArtifacts, + SweEvalReport, + SweTaskHarness, + compute_resolved, + get_harness, + list_harnesses, + register_harness, + reward_from_report, +) +from resources_servers.swe_bench.sandbox import AsyncSweEnvironment +from resources_servers.swe_bench.session import SessionDescriptor +from resources_servers.swe_bench.task import ( + ENVIRONMENT_NAME, + SweTask, + TaskPublic, + TaskSubmission, + parse_task_from_request, +) + + +__all__ = [ + "AsyncSweEnvironment", + "ENVIRONMENT_NAME", + "EvalArtifacts", + "SessionDescriptor", + "SweEvalReport", + "SweTask", + "SweTaskHarness", + "TaskPublic", + "TaskSubmission", + "compute_resolved", + "get_harness", + "list_harnesses", + "parse_task_from_request", + "register_harness", + "reward_from_report", +] diff --git a/resources_servers/swe_bench/app.py b/resources_servers/swe_bench/app.py new file mode 100644 index 0000000000..214189c99e --- /dev/null +++ b/resources_servers/swe_bench/app.py @@ -0,0 +1,113 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""SWE-bench Environment resources server.""" + +from __future__ import annotations + +import dataclasses +from typing import Any, Literal + +from pydantic import Field + +import resources_servers.swe_bench.harnesses # noqa: F401 +from nemo_gym.base_resources_server import ( + BaseResourcesServerConfig, + SimpleResourcesServer, +) +from nemo_gym.sandbox import SandboxSpec +from resources_servers.swe_bench.harness import get_harness +from resources_servers.swe_bench.session import ( + EgressDescriptor, + PlacementDescriptor, + SandboxDescriptor, + SessionDescriptor, + SweBenchSeedSessionRequest, + SweBenchVerifyRequest, + SweBenchVerifyResponse, +) +from resources_servers.swe_bench.task import ( + ENVIRONMENT_NAME, + SweTask, + parse_submission, + parse_task_from_request, +) +from resources_servers.swe_bench.verify_task import report_to_reward, verify_task + + +Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"] + + +class SweBenchResourcesServerConfig(BaseResourcesServerConfig): + sandbox_provider: dict[str, Any] = Field(default_factory=lambda: {"docker": {}}) + container_formatter: str = "swebench/sweb.eval.x86_64.{instance_id}" + eval_timeout_s: float = 1800.0 + flat_eval: bool = True + default_topology: Topology = "agent_in_env" + + +def _spec_to_dict(spec: SandboxSpec) -> dict[str, Any]: + payload = dataclasses.asdict(spec) + resources = payload.get("resources") + if resources is not None and hasattr(resources, "__dataclass_fields__"): + payload["resources"] = dataclasses.asdict(resources) + return payload + + +class SweBenchResourcesServer(SimpleResourcesServer): + config: SweBenchResourcesServerConfig + + def _parse_task(self, body: SweBenchSeedSessionRequest | SweBenchVerifyRequest) -> SweTask: + return parse_task_from_request( + body, + container_formatter=self.config.container_formatter, + flat_eval=self.config.flat_eval, + environment=ENVIRONMENT_NAME, + ) + + async def seed_session(self, body: SweBenchSeedSessionRequest) -> SessionDescriptor: + task = self._parse_task(body) + harness = get_harness(task.harness_family) + if self.config.flat_eval and hasattr(harness, "with_flat_eval"): + harness = harness.with_flat_eval() + spec = harness.build_spec(task) + + verifier_metadata = task.privileged_verifier_metadata(flat_eval=self.config.flat_eval) + if body.verifier_metadata: + verifier_metadata = {**body.verifier_metadata, **verifier_metadata} + + return SessionDescriptor( + environment=ENVIRONMENT_NAME, + task=task.public_view(environment=ENVIRONMENT_NAME), + placement=PlacementDescriptor(topology=self.config.default_topology), + sandbox=SandboxDescriptor(spec=_spec_to_dict(spec)), + egress=EgressDescriptor(env={}), + verifier_metadata=verifier_metadata, + ) + + async def verify(self, body: SweBenchVerifyRequest) -> SweBenchVerifyResponse: + task = self._parse_task(body) + task = task.with_submission(parse_submission(body.verifier_metadata)) + + report = await verify_task( + self.config.sandbox_provider, + task, + eval_timeout_s=self.config.eval_timeout_s, + ) + reward = report_to_reward(report) + masked = report.error_kind is not None + + return SweBenchVerifyResponse( + **body.model_dump(), + task_id=task.task_id, + environment=ENVIRONMENT_NAME, + reward=reward, + resolved=report.resolved, + patch_exists=report.patch_exists, + mask_sample=masked, + error_kind=report.error_kind, + ) + + +if __name__ == "__main__": + SweBenchResourcesServer.run_webserver() diff --git a/resources_servers/swe_bench/configs/swe_bench.yaml b/resources_servers/swe_bench/configs/swe_bench.yaml new file mode 100644 index 0000000000..f3a60d11c2 --- /dev/null +++ b/resources_servers/swe_bench/configs/swe_bench.yaml @@ -0,0 +1,33 @@ +swe_bench: + resources_servers: + swe_bench: + entrypoint: app.py + domain: coding + verified: false + description: SWE-bench Environment (seed_session + hermetic verify) + sandbox_provider: + docker: {} + container_formatter: swebench/sweb.eval.x86_64.{instance_id} + eval_timeout_s: 1800 + flat_eval: true + default_topology: agent_in_env + +claude_code_swe_bench: + responses_api_agents: + claude_code_agent: + entrypoint: app.py + resources_server: + type: resources_servers + name: swe_bench + concurrency: 16 + model: claude-sonnet-4-6 + anthropic_api_key: ${anthropic_api_key} + anthropic_base_url: null + max_turns: 30 + timeout: 1800 + in_box_timeout_s: 1800 + sandbox_provider: + docker: {} + bare: true + mcp_config: null + settings: null diff --git a/resources_servers/swe_bench/data/.gitignore b/resources_servers/swe_bench/data/.gitignore new file mode 100644 index 0000000000..ac481ac55b --- /dev/null +++ b/resources_servers/swe_bench/data/.gitignore @@ -0,0 +1,3 @@ +data/ +__pycache__/ +*.pyc diff --git a/resources_servers/swe_bench/harness.py b/resources_servers/swe_bench/harness.py new file mode 100644 index 0000000000..59467432bb --- /dev/null +++ b/resources_servers/swe_bench/harness.py @@ -0,0 +1,365 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Task model and harness contract for the SWE environment library. + +The first-class **Task** value lives in ``task.py`` (``SweTask``). This module holds +the harness registry and grading helpers. + +The harness contract is intentionally split across a trust boundary: + +* ``build_spec`` / ``supports_provider`` / ``materialize`` are **provisioning** + methods imported and called by *agents* (and the verifier). +* ``reset_repo`` / ``run_eval`` / ``grade`` are **grading** methods used + **only** by the grader (``verify_task``). A test asserts agent adapters never + reference them. + +This module also holds the name->harness registry +(``register_harness``/``get_harness``/``list_harnesses``) and the pure grading +helpers (``compute_resolved``/``reward_from_report``), merged here so the harness +contract, its dispatch, and its scoring live in one place. +""" + +from __future__ import annotations + +from abc import ABC, abstractmethod +from collections.abc import Iterable +from dataclasses import dataclass, field +from typing import TYPE_CHECKING, Any + +from nemo_gym.sandbox import SandboxSpec +from resources_servers.swe_bench.task import SweTask + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +class GraderDependencyError(RuntimeError): + """A required grading dependency is unavailable for a task the harness must grade exactly. + + Raised by a harness when it cannot grade an instance faithfully (e.g. ``swebench`` is + missing for a SWE-bench instance) and degrading to a generic parser would silently skew + the result. ``verify_task`` propagates this rather than swallowing it into an unmasked + reward-0, so a misconfigured grader fails loudly instead of quietly degrading scores. + """ + + +@dataclass +class EvalArtifacts: + """Raw evaluation output retrieved from the sandbox, before grading.""" + + test_output: str = "" + return_code: int = 0 + patch_applied: bool = False + raw: dict[str, Any] = field(default_factory=dict) + + +@dataclass +class SweEvalReport: + """Graded result of a single task. ``error_kind`` masks a sample. + + ``error_kind`` is ``None`` for a clean grade. A non-``None`` value (e.g. + ``"sandbox"`` / ``"eval_error"``) marks an infra failure: the sample is + masked via this flag and ``reward_from_report`` returns ``0.0`` — **never** + ``None`` (the wire ``reward`` field is a non-nullable ``float``). + """ + + instance_id: str + resolved: bool = False + patch_applied: bool = False + patch_exists: bool = False + error_kind: str | None = None + tests_status: dict[str, Any] = field(default_factory=dict) + + +class SweTaskHarness(ABC): + """Per-family provisioning + (server-private) grading recipe.""" + + #: registry key, e.g. ``"swe-bench-ext"``. + name: str = "" + #: ``"flat-host-grade"`` (parse host-side) or ``"nested-harness"`` (in-container grader). + grade_strategy: str = "flat-host-grade" + + # --- provisioning (agent-facing + verifier) ------------------------------ + + @abstractmethod + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox spec for a task. + + Args: + task (SweTask): The task to provision a sandbox for. + + Returns: + SandboxSpec: The spec describing image, workdir, env, ttl, and + provider options for the task. + """ + + def supports_provider(self, provider_name: str) -> bool: + """Report whether this harness can run on the named provider. + + The base harness accepts every provider; flat host-graded families work on any + exec-capable provider. + + Args: + provider_name (str): The name of the sandbox provider. + + Returns: + bool: ``True`` if the provider is supported. + """ + return True + + def with_flat_eval(self) -> "SweTaskHarness": + """Return a variant that grades host-side (flat) on any exec-capable provider. + + All families already grade host-side, so the base implementation returns ``self``. + + Returns: + SweTaskHarness: A harness whose grading runs host-side. + """ + return self + + async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Upload the model patch and test patch into the started sandbox. + + Args: + env (AsyncSweEnvironment): The started environment to write into. + task (SweTask): The task whose patches are uploaded. + """ + if task.model_patch: + await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch)) + if task.test_patch: + await env.write_text("/root/test_patch.diff", _ensure_trailing_newline(task.test_patch)) + + # --- server-private grading (verifier only) ------------------------------ + + async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Reset the in-sandbox checkout to ``base_commit`` for hermetic grading. + + Uses only ``git reset --hard``, never ``git clean -fdx``: verification + runs in a fresh sandbox (no agent edits to scrub), and a clean would + delete the image's prebuilt artifacts (compiled C extensions, installed + environment) and break the tests. + + Args: + env (AsyncSweEnvironment): The started environment to reset. + task (SweTask): The task whose ``base_commit`` and ``repo_workdir`` + are used. + """ + if task.base_commit: + await env.execute(f"git reset --hard {task.base_commit}", cwd=task.repo_workdir) + + @abstractmethod + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Apply the patches and run the evaluation, returning raw artifacts. + + Args: + env (AsyncSweEnvironment): The started environment to evaluate in. + task (SweTask): The task being evaluated. + + Returns: + EvalArtifacts: The raw evaluation output retrieved from the sandbox. + """ + + @abstractmethod + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Parse raw artifacts host-side into a graded report. + + Args: + task (SweTask): The task that was evaluated. + artifacts (EvalArtifacts): The raw evaluation output to parse. + + Returns: + SweEvalReport: The graded result for the task. + """ + + +def _ensure_trailing_newline(text: str) -> str: + """Return the text with a single trailing newline. + + Args: + text (str): The input text. + + Returns: + str: The text unchanged if it already ends in a newline, otherwise the + text with a newline appended. + """ + return text if text.endswith("\n") else text + "\n" + + +# --- name->harness registry ---------------------------- + +_HARNESSES: dict[str, SweTaskHarness] = {} + + +def register_harness(harness: SweTaskHarness, *, override: bool = False) -> None: + """Register a harness under its ``name``. + + Args: + harness (SweTaskHarness): The harness to register. Its ``name`` must be + non-empty. + override (bool): If ``True``, replace an existing harness with the same + name instead of raising. + + Raises: + ValueError: If the harness name is empty, or a harness with the same name + is already registered and ``override`` is ``False``. + """ + if not harness.name: + raise ValueError("Harness must define a non-empty 'name'") + if not override and harness.name in _HARNESSES: + raise ValueError(f"Harness {harness.name!r} is already registered") + _HARNESSES[harness.name] = harness + + +# HuggingFace dataset names don't match registry keys; map by substring (most-specific first) +# so callers can pass a raw ``dataset_name`` (e.g. "princeton-nlp/SWE-bench_Verified"). +_HF_NAME_ALIASES: list[tuple[str, str]] = [ + ("SWE-bench_Multilingual", "swe-bench-multilingual"), + ("R2E-Gym", "r2e-gym"), + ("SWE-rebench", "swe-rebench"), + ("SWE-bench", "swe-bench"), +] + + +def _ensure_registered() -> None: + """Lazily register the built-in harnesses if the registry is empty. + + Importing ``resources_servers.swe_bench.harnesses`` registers all families, but a fresh + process (e.g. a Ray worker running the decoupled agent) may call ``get_harness`` before that + import has run. Registering on demand keeps lookups robust regardless of import order. + """ + if _HARNESSES: + return + from resources_servers.swe_bench.harnesses import register_builtin_harnesses + + register_builtin_harnesses() + + +def get_harness(name: str) -> SweTaskHarness: + """Look up a harness by registry key, or by HuggingFace dataset-name substring. + + Built-in harnesses are registered on first use (robust to import order). An exact key + match wins; otherwise a HuggingFace ``dataset_name`` substring is resolved to its key (e.g. + ``"princeton-nlp/SWE-bench_Verified"`` -> ``"swe-bench"``). + + Args: + name (str): The registry key, or a HuggingFace dataset name. + + Returns: + SweTaskHarness: The registered harness. + + Raises: + KeyError: If no harness matches ``name``. + """ + _ensure_registered() + if name in _HARNESSES: + return _HARNESSES[name] + for needle, key in _HF_NAME_ALIASES: + if needle in name and key in _HARNESSES: + return _HARNESSES[key] + available = ", ".join(sorted(_HARNESSES)) or "(none)" + raise KeyError(f"Unknown SWE harness {name!r}. Registered: {available}") + + +def list_harnesses() -> list[str]: + """List the names of all registered harnesses. + + Returns: + list[str]: The registered harness names, sorted alphabetically. + """ + return sorted(_HARNESSES) + + +# --- pure grading helpers ------------------------------- + + +def compute_resolved( + *, + fail_to_pass: Iterable[str], + pass_to_pass: Iterable[str], + passed: Iterable[str], + eval_type: str = "pass_and_fail", + status_map: dict[str, str] | None = None, +) -> bool: + """Apply the SWE-bench resolution rule. + + Two eval types are supported, mirroring swebench's per-repo selection + (``swebench.harness.grading.get_eval_report`` / + ``get_eval_tests_report`` + ``get_resolution_status``): + + * ``"pass_and_fail"`` (default): mirrors swebench's ``check_pass_and_fail`` + classification combined with the ratio-based ``get_resolution_status``. When a + ``status_map`` is supplied, each required test is a **success** when present and + PASSED/XFAIL (``test_passed``), a **failure** when absent or FAILED/ERROR + (``test_failed``), and **neutral** (excluded from both counts) for any other + status (e.g. SKIPPED/XPASS). A task is resolved only when there are zero + failures across FAIL_TO_PASS and PASS_TO_PASS (each ratio ``== 1``; an + all-neutral category with total ``0`` counts as ``1``). Without a + ``status_map`` it falls back to plain ``passed``-set membership. + * ``"fail_only"``: used for the JS multilingual repos in swebench's + ``FAIL_ONLY_REPOS`` (chartjs/Chart.js, processing/p5.js, markedjs/marked). A + required test counts as success **unless** it is present in ``status_map`` + **and** its status is ``FAILED``. This mirrors swebench's ``check_fail_only``. + + Args: + fail_to_pass (Iterable[str]): Tests that must transition from failing to + passing. + pass_to_pass (Iterable[str]): Tests that must remain passing. + passed (Iterable[str]): The tests that actually passed. + eval_type (str): ``"pass_and_fail"`` or ``"fail_only"`` (selected by the + caller from ``test_spec.repo``). + status_map (dict[str, str] | None): Full per-test status map. Required for + the ``"fail_only"`` rule (to detect a present-and-FAILED required test) + and used by ``"pass_and_fail"`` to exclude neutral-status required tests + exactly as swebench does. + + Returns: + bool: ``True`` if all required tests passed under the selected rule, + ``False`` if there are no required tests or any required test did not + pass. + """ + required = list(fail_to_pass) + list(pass_to_pass) + if not required: + return False + if eval_type == "fail_only": + sm = status_map or {} + # Mirror swebench's check_fail_only: a required test is a failure only when + # present in the status map AND explicitly FAILED; anything else is success. + return all(not (test in sm and sm[test] == "FAILED") for test in required) + if status_map is not None: + # Mirror swebench's check_pass_and_fail + get_resolution_status: a required + # test is a failure only when it is absent or its status is FAILED/ERROR; + # PASSED/XFAIL are successes and any other status (SKIPPED/XPASS) is neutral + # (excluded). Resolution requires zero failures in BOTH categories. + return all(not (test not in status_map or status_map[test] in ("FAILED", "ERROR")) for test in required) + passed_set = set(passed) + return all(test in passed_set for test in required) + + +def reward_from_report(report: SweEvalReport) -> float: + """Map a graded report to a reward. + + An infra or eval failure (``error_kind`` set) yields ``0.0`` and is masked + via the flag downstream; the result is always a ``float`` and never ``None``. + + Args: + report (SweEvalReport): The graded result to convert. + + Returns: + float: ``1.0`` if the task resolved with no error, otherwise ``0.0``. + """ + if report.error_kind is not None: + return 0.0 + return 1.0 if report.resolved else 0.0 diff --git a/resources_servers/swe_bench/harnesses/__init__.py b/resources_servers/swe_bench/harnesses/__init__.py new file mode 100644 index 0000000000..e55c86e926 --- /dev/null +++ b/resources_servers/swe_bench/harnesses/__init__.py @@ -0,0 +1,64 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""SWE dataset-family harnesses. Importing this package registers all families. + +Every built-in family is flat and host-graded: it runs the instance's evaluation +inside a single sandbox, parses the output host-side, and works on any +exec-capable provider (including docker). The registered families are +``swe-bench-ext``, ``nv-internal-1``, ``swe-rebench``, ``swe-bench``, +``swe-bench-multilingual``, and ``r2e-gym``. (The previously apptainer-only nested +grading for ``swe-bench``/``swe-bench-multilingual``/``r2e-gym`` was removed when +PR #1694 took ownership of the apptainer provider.) +""" + +from resources_servers.swe_bench.harness import list_harnesses, register_harness +from resources_servers.swe_bench.harnesses.nv_internal import NVInternalHarness +from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness +from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness +from resources_servers.swe_bench.harnesses.swe_rebench import SweRebenchHarness +from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness + + +def register_builtin_harnesses() -> None: + """Register every built-in SWE dataset-family harness. + + Constructs each built-in harness and registers it under its name, skipping + any name that is already registered so the call is safe to run more than once. + """ + builtins = [ + SweBenchExtHarness(), + NVInternalHarness(), + SweRebenchHarness(), + SweBenchHarness("swe-bench"), + SweBenchHarness("swe-bench-multilingual"), + R2EGymHarness(), + ] + existing = set(list_harnesses()) + for harness in builtins: + if harness.name not in existing: + register_harness(harness) + + +register_builtin_harnesses() + + +__all__ = [ + "NVInternalHarness", + "R2EGymHarness", + "SweBenchExtHarness", + "SweBenchHarness", + "SweRebenchHarness", + "register_builtin_harnesses", +] diff --git a/resources_servers/swe_bench/harnesses/flat_eval.py b/resources_servers/swe_bench/harnesses/flat_eval.py new file mode 100644 index 0000000000..37a6a1727d --- /dev/null +++ b/resources_servers/swe_bench/harnesses/flat_eval.py @@ -0,0 +1,280 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Flat (host-graded) eval-script mode for SWE dataset families. + +Flat mode runs an instance's eval script directly in the sandbox and parses the +produced log host-side, computing ``resolved`` from ``FAIL_TO_PASS`` / +``PASS_TO_PASS`` via :func:`compute_resolved`. Because there is no nested +container, this runs on any exec-capable provider (docker / opensandbox). + +The eval script resets the repo, applies the gold/model patch plus the test +patch, runs the repo's test command, and wraps the test output between two +sentinel markers:: + + >>>>> Start Test Output + ... per-test "PASSED " / "FAILED " lines ... + >>>>> End Test Output + +It also emits patch-apply / reset / timeout status codes +(``>>>>> Applied Patch`` etc.). The host-side parser in this module recognises +these markers and per-test status tokens without importing ``swebench``, so +grading can run in environments where that package (and its Docker +dependencies) is absent. + +``flat_eval_enabled`` reports whether flat mode applies to a task: when the harness +selects it or the task opts in via ``SweTask.metadata["flat_eval"]``. The verifier +honors that per-task key by calling ``SweTaskHarness.with_flat_eval()`` — a no-op for +the built-in families, which already grade host-side. (A previously apptainer-only +nested grading path for swe-bench / r2e-gym was removed in PR #1694.) +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, compute_resolved + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +# SWE-bench eval-log sentinels, kept here so we never import swebench at grade +# time. +APPLY_PATCH_FAIL = ">>>>> Patch Apply Failed" +APPLY_PATCH_PASS = ">>>>> Applied Patch" +RESET_FAILED = ">>>>> Reset Failed" +TESTS_ERROR = ">>>>> Tests Errored" +TESTS_TIMEOUT = ">>>>> Tests Timed Out" +START_TEST_OUTPUT = ">>>>> Start Test Output" +END_TEST_OUTPUT = ">>>>> End Test Output" + +# Codes that mean the harness/patch/test setup failed before tests could be +# trusted; their presence forces an empty status map + patch_applied=False. +_BAD_CODES = (APPLY_PATCH_FAIL, RESET_FAILED, TESTS_ERROR, TESTS_TIMEOUT) + +# Per-test status tokens a pytest-style test runner emits at the start of a line +# ("PASSED tests/test_x.py::test_a"). XFAIL counts as a pass. +_PASS_TOKENS = ("PASSED", "XFAIL") +_FAIL_TOKENS = ("FAILED", "ERROR") +_STATUS_TOKENS = _PASS_TOKENS + _FAIL_TOKENS + ("SKIPPED",) + +# Where the flat path writes the eval script and its captured log inside the +# sandbox. +EVAL_SCRIPT_PATH = "/root/eval.sh" +EVAL_LOG_PATH = "/root/eval_output.log" + + +def parse_eval_log(log: str) -> tuple[dict[str, str], bool]: + """Parse a SWE-bench eval-script log host-side. + + For the common pytest-style runner: + + 1. If any "bad code" (patch-apply / reset / tests-error / timeout) is + present, the run is untrustworthy -> return ``({}, False)``. + 2. If the ``Start``/``End`` test-output markers are missing, the test patch + never applied -> return ``({}, False)``. + 3. Otherwise extract the slice between the markers and parse per-test + ``" "`` lines into a ``{node_id: STATUS}`` map. As a + fallback (output sometimes escapes the markers, e.g. to stderr) the whole + log is scanned when the slice yields nothing. + + Args: + log: The combined stdout/stderr captured from running the eval script. + + Returns: + A tuple ``(status_map, patch_applied)``. ``status_map`` maps each test + node id to its status token. ``patch_applied`` is ``True`` only when the + markers were found and no bad code fired. + """ + if any(code in log for code in _BAD_CODES): + return {}, False + if START_TEST_OUTPUT not in log or END_TEST_OUTPUT not in log: + return {}, False + + between = log.split(START_TEST_OUTPUT, 1)[1].split(END_TEST_OUTPUT, 1)[0] + status_map = _parse_pytest_status_lines(between) + if not status_map: + # Fallback: some runners emit per-test lines outside the markers. + status_map = _parse_pytest_status_lines(log) + return status_map, True + + +def _parse_pytest_status_lines(text: str) -> dict[str, str]: + """Parse ``" "`` pytest-style lines into a status map. + + A status line starts with one of the recognised status tokens, and the node + id is the second whitespace field. FAILED lines may read + ``"FAILED - "``; the trailing reason is stripped by rewriting + ``" - "`` to ``" "``. + + Args: + text: Text containing zero or more per-test status lines. + + Returns: + A mapping from each test node id to its status token. When a node id + appears more than once, the last occurrence wins. + """ + status_map: dict[str, str] = {} + for raw_line in text.split("\n"): + line = raw_line.strip() + token = next((t for t in _STATUS_TOKENS if line.startswith(t)), None) + if token is None: + continue + if token == "FAILED": + line = line.replace(" - ", " ") + fields = line.split() + if len(fields) <= 1: + continue + node_id = fields[1] + # Last status wins for a duplicated node id: a later line overwrites an + # earlier one, so a runner that re-reports a node (e.g. a rerun plugin) + # ends up with its final status. + status_map[node_id] = fields[0] + return status_map + + +def passed_tests(status_map: dict[str, str]) -> list[str]: + """Return node ids whose status counts as a pass (PASSED or XFAIL). + + Args: + status_map: A mapping from test node id to its status token. + + Returns: + The list of node ids whose status is a passing token. + """ + return [node for node, status in status_map.items() if status in _PASS_TOKENS] + + +async def flat_run_eval(env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Run the instance's eval script in the sandbox and capture its log. + + The eval script must be supplied on the task via + ``task.metadata["eval_script"]``. It is written into the sandbox and run, + teeing its combined output to :data:`EVAL_LOG_PATH`; the captured + stdout/stderr already contain the ``>>>>>`` markers, so ``test_output`` is + graded directly. The log file is read back as a fallback when the streamed + output is empty. + + Args: + env: The SWE environment used to write files and execute commands in the + sandbox. + task: The task whose ``metadata["eval_script"]`` is run. + + Returns: + An :class:`EvalArtifacts` holding the captured test output, the script's + return code, whether a model patch existed, and raw metadata. When no + eval script is present the artifacts carry an ``eval_error``. + """ + eval_script = task.metadata.get("eval_script", "") + if not eval_script: + # No script to run -> mask as an eval error rather than scoring 0. + return EvalArtifacts( + test_output="", + return_code=1, + patch_applied=False, + raw={"error_type": "eval_error", "flat": True}, + ) + + await env.write_text(EVAL_SCRIPT_PATH, eval_script if eval_script.endswith("\n") else eval_script + "\n") + # The script is self-contained (it resets + applies patches + runs tests); + # `|| true` keeps the captured log even on a non-zero test exit so grade() + # can parse per-test status. Combined output is also tee'd to a log file. + result = await env.execute( + f"bash {EVAL_SCRIPT_PATH} 2>&1 | tee {EVAL_LOG_PATH}; exit ${{PIPESTATUS[0]}}", + cwd=task.repo_workdir, + is_eval=True, + timeout_s=task.metadata.get("tests_timeout"), + ) + log_text = result["output"] + if not log_text.strip() and result.get("error_type") not in {"sandbox", "timeout"}: + # Streamed output was empty; fall back to the tee'd log file. + cat = await env.execute(f"cat {EVAL_LOG_PATH}", cwd=task.repo_workdir) + if cat["returncode"] == 0: + log_text = cat["output"] + + return EvalArtifacts( + test_output=log_text, + return_code=result["returncode"], + patch_applied=bool(task.model_patch), + raw={"error_type": result.get("error_type"), "flat": True}, + ) + + +def flat_grade(task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade a flat eval-script log host-side. + + Only genuine infra failures (sandbox/timeout) are masked via ``error_kind``. + An unbuildable / missing / empty eval spec (``error_type == "eval_error"``) is + NOT masked: it falls through to the parser, which finds no markers and grades + unmasked ``resolved=False`` (reward 0), matching main's behavior. A log with a + bad code or missing markers likewise grades as unresolved with + ``patch_applied`` set from the parse, since a failed setup is a legitimate + unresolved rather than an infra mask. + + Args: + task: The task being graded, supplying the instance id, expected + ``fail_to_pass`` / ``pass_to_pass`` tests, and model patch. + artifacts: The eval artifacts produced by :func:`flat_run_eval`. + + Returns: + A :class:`SweEvalReport` describing whether the task was resolved, + whether the patch applied and existed, any masking ``error_kind``, and + the per-test status breakdown. + """ + if artifacts.raw.get("error_type") in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=artifacts.raw["error_type"], + ) + + status_map, log_patch_applied = parse_eval_log(artifacts.test_output) + passed = passed_tests(status_map) + # Thread the full status_map so compute_resolved mirrors swebench's + # get_eval_tests_report semantics: a required test counts as a failure only when + # absent or FAILED/ERROR, while neutral statuses (SKIPPED/XPASS) are excluded + # rather than treated as failures (which a bare passed-set membership check would). + resolved = log_patch_applied and compute_resolved( + fail_to_pass=task.fail_to_pass, + pass_to_pass=task.pass_to_pass, + passed=passed, + status_map=status_map, + ) + return SweEvalReport( + instance_id=task.instance_id, + resolved=resolved, + patch_applied=log_patch_applied, + patch_exists=bool(task.model_patch), + tests_status={"passed": passed, "all": status_map}, + ) + + +def flat_eval_enabled(harness_flag: bool, task: SweTask) -> bool: + """Return whether flat (host-side) mode should be used for this task. + + Flat mode applies when the harness flag selects it or the task opts in via + ``metadata["flat_eval"]``. This is a pure predicate; it neither swaps the + harness nor changes provider support. + + Args: + harness_flag: Whether the harness itself selects flat grading. + task: The task whose ``metadata["flat_eval"]`` is consulted. + + Returns: + ``True`` when either source selects flat mode, otherwise ``False``. + """ + return bool(harness_flag) or bool(task.metadata.get("flat_eval", False)) diff --git a/resources_servers/swe_bench/harnesses/nv_internal.py b/resources_servers/swe_bench/harnesses/nv_internal.py new file mode 100644 index 0000000000..3eb7fac4f0 --- /dev/null +++ b/resources_servers/swe_bench/harnesses/nv_internal.py @@ -0,0 +1,426 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""nv-internal-1 harness: flat, host-graded NVIDIA-internal family. + +This family does not run any in-container grading harness: it ships a per-instance +``run_script.sh`` + ``parsing_script.py`` that emit a structured ``output.json`` +test report. The recipe is a 3-hop sequence: + + 1. ``bash run_script.sh > stdout.log 2> stderr.log`` (keep streams separate) + 2. ``python parsing_script.py stdout.log stderr.log output.json`` (parse to JSON report) + 3. read ``output.json`` back host-side + +Grading is then a pure host-side parse of that report's ``{tests: [{name, status}]}`` +shape. Because the family is flat and host-graded, it runs on any exec-capable +provider (e.g. docker). The run script, parsing script, and model patch are +uploaded by ``materialize``. +""" + +from __future__ import annotations + +import ast +import json +import re +from typing import TYPE_CHECKING, Any + +from nemo_gym.sandbox import SandboxResources, SandboxSpec +from resources_servers.swe_bench.harness import ( + EvalArtifacts, + SweEvalReport, + SweTask, + SweTaskHarness, + _ensure_trailing_newline, + compute_resolved, +) + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +#: nv-internal default working directory. +NV_DEFAULT_WORKDIR = "/app" +#: The generic ``build_task`` default workdir; means "the row didn't set one". +_GENERIC_DEFAULT_WORKDIR = "/testbed" + + +def _nv_workdir(task: SweTask) -> str: + """Resolve the working directory for nv-internal hops. + + The generic ``build_task`` defaults ``repo_workdir`` to ``/testbed``, which is + not the nv-internal convention. A row that explicitly sets a non-default + ``repo_workdir`` is honored; otherwise the nv-internal default ``/app`` is used. + + Args: + task: The task whose ``repo_workdir`` is consulted. + + Returns: + The working directory path (str) to run every nv-internal hop in. + """ + workdir = task.repo_workdir + if not workdir or workdir == _GENERIC_DEFAULT_WORKDIR: + return NV_DEFAULT_WORKDIR + return workdir + + +def parse_passed_tests(report: dict[str, Any]) -> list[str]: + """Extract PASSED test names from a parsing_script ``output.json`` report. + + The report shape is ``{"tests": [{"name": ..., "status": "PASSED"|...}, ...]}``. + + Args: + report: The parsed ``output.json`` report mapping. + + Returns: + The list of test names (list[str]) whose status is ``"PASSED"``. + """ + return [ + test["name"] + for test in report.get("tests", []) + if isinstance(test, dict) and test.get("status") == "PASSED" and "name" in test + ] + + +class NVInternalHarness(SweTaskHarness): + """Flat, host-graded harness for the NVIDIA-internal task family. + + Tasks ship their own ``run_script.sh`` and ``parsing_script.py`` that produce + a structured ``output.json`` report, which is graded entirely host-side. The + harness runs on any exec-capable provider. + """ + + name = "nv-internal-1" + grade_strategy = "flat-host-grade" + + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox spec for an nv-internal task. + + Environment variables parsed from the task's dockerfiles are injected into + ``spec.env`` so the provider applies them to every exec hop. This is a + no-op when the dataset does not carry the dockerfiles. + + Args: + task: The task to build a sandbox spec for. + + Returns: + A :class:`SandboxSpec` describing the image, workdir, timeouts, + environment, metadata, resources, and provider options. + """ + env = {"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"} + env.update(_parse_dockerfile_env(task)) + return SandboxSpec( + image=task.image, + workdir=_nv_workdir(task), + ttl_s=task.metadata.get("ttl_s", 1800), + ready_timeout_s=task.metadata.get("ready_timeout_s", 600), + env=env, + metadata={ + "instance_id": task.instance_id[:63], + "benchmark": task.benchmark, + "harness": self.name, + }, + resources=SandboxResources.from_mapping(task.metadata.get("resources", {})), + provider_options=task.metadata.get("provider_options", {}), + ) + + def supports_provider(self, provider_name: str) -> bool: + """Report whether this harness supports the named provider. + + The family is flat and host-graded, so every exec-capable provider is + supported. + + Args: + provider_name: The provider name being checked. + + Returns: + ``True`` for every provider. + """ + return True # flat, host-graded: works on any exec-capable provider + + async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Upload run_script.sh, parsing_script.py, and the model patch. + + The scripts live in ``task.metadata``. The dataset stores them under + dotted keys (``"run_script.sh"`` / ``"parsing_script.py"``), which are read + first, falling back to the extensionless keys only if the dotted ones are + absent. + + Args: + env: The environment used to write files into the sandbox. + task: The task carrying the patch and scripts to upload. + """ + if task.model_patch: + await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch)) + run_script = task.metadata.get("run_script.sh") or task.metadata.get("run_script", "") + parsing_script = task.metadata.get("parsing_script.py") or task.metadata.get("parsing_script", "") + if run_script: + await env.write_text("/root/run_script.sh", _ensure_trailing_newline(run_script)) + if parsing_script: + await env.write_text("/root/parsing_script.py", _ensure_trailing_newline(parsing_script)) + + async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Reset the checkout to ``base_commit``. + + Runs ``git reset --hard`` followed by ``git checkout`` of the base commit + (not ``git clean``) in the nv-internal working directory. + + Args: + env: The environment used to execute commands in the sandbox. + task: The task carrying the ``base_commit`` to reset to. + """ + if task.base_commit: + await env.execute( + f"git reset --hard {task.base_commit} && git checkout {task.base_commit}", + cwd=_nv_workdir(task), + ) + + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Run the 3-hop evaluation recipe and collect its artifacts. + + Applies the model patch, runs the optional per-instance repo setup hook, + then executes the run/parse/read sequence. Sandbox or timeout failures in + any hop short-circuit and are surfaced via ``raw["error_type"]``. + + Args: + env: The environment used to execute commands in the sandbox. + task: The task being evaluated. + + Returns: + An :class:`EvalArtifacts` holding the report output, return code, + whether the patch applied cleanly, and any infra error type. + """ + workdir = _nv_workdir(task) + # Apply the model patch with rejection to tolerate conflicts: + # `--reject` writes .rej files instead of failing; `|| true` keeps going. + patch_applied = True + if task.model_patch: + applied = await env.execute( + "git apply --ignore-space-change --ignore-whitespace --reject -v /root/patch.diff", + cwd=workdir, + ) + patch_applied = applied["returncode"] == 0 + + # Optional per-instance repo setup hook. + repo_cmd = task.metadata.get("before_repo_set_cmd", "").strip() + if repo_cmd: + repo_cmd = repo_cmd.split("\n")[-1] + setup = await env.execute(repo_cmd, cwd=workdir, is_eval=True) + if setup.get("error_type") in {"sandbox", "timeout"}: + return EvalArtifacts( + test_output=setup["output"], + return_code=setup["returncode"], + patch_applied=patch_applied, + raw={"error_type": setup.get("error_type")}, + ) + + # Hop 1: run the per-instance script, keeping stdout/stderr separate. + # The selected test files are passed positionally. + test_files = _format_test_files(task.metadata.get("selected_test_files_to_run", [])) + run = await env.execute( + f"bash /root/run_script.sh {test_files} > /root/stdout.log 2> /root/stderr.log || true", + cwd=workdir, + is_eval=True, + ) + if run.get("error_type") in {"sandbox", "timeout"}: + return EvalArtifacts( + test_output=run["output"], + return_code=run["returncode"], + patch_applied=patch_applied, + raw={"error_type": run.get("error_type")}, + ) + + # Hop 2: parse the logs into a JSON report. + parse = await env.execute( + "python /root/parsing_script.py /root/stdout.log /root/stderr.log /root/output.json", + cwd=workdir, + is_eval=True, + ) + if parse.get("error_type") in {"sandbox", "timeout"}: + return EvalArtifacts( + test_output=parse["output"], + return_code=parse["returncode"], + patch_applied=patch_applied, + raw={"error_type": parse.get("error_type")}, + ) + + # Hop 3: read the report back host-side. + report = await env.execute("cat /root/output.json", cwd=workdir, is_eval=True) + return EvalArtifacts( + test_output=report["output"], + return_code=report["returncode"], + patch_applied=patch_applied, + raw={"error_type": report.get("error_type")}, + ) + + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade the evaluation artifacts into a report. + + Parses the host-side ``output.json`` report, extracts PASSED tests, and + derives resolution from the required FAIL_TO_PASS / PASS_TO_PASS sets. An + infra failure (sandbox or timeout) is masked via ``error_kind`` rather than + scored as unresolved. + + Args: + task: The task being graded. + artifacts: The artifacts produced by ``run_eval``. + + Returns: + A :class:`SweEvalReport` with resolution status, patch flags, and the + parsed test report. + """ + # Infra failure → mask via error_kind (never scored as "unresolved"). + if artifacts.raw.get("error_type") in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=artifacts.raw["error_type"], + ) + try: + report = json.loads(artifacts.test_output) if artifacts.test_output.strip() else {} + except (ValueError, TypeError): + report = {} + passed = parse_passed_tests(report) + f2p, p2p = _resolve_required_tests(task) + # Resolution is derived from tests alone and never gated on patch-apply rc. + # An empty report or no required tests → unresolved (compute_resolved + # returns False). + resolved = compute_resolved( + fail_to_pass=f2p, + pass_to_pass=p2p, + passed=passed, + ) + return SweEvalReport( + instance_id=task.instance_id, + resolved=resolved, + patch_applied=artifacts.patch_applied, + patch_exists=bool(task.model_patch), + tests_status={"passed": passed, "report": report}, + ) + + +def _format_test_files(test_files: Any) -> str: + """Build the comma-joined test-files argument. + + Accepts a list, or a string that is either a comma-joined value or a + ``repr``-style list. A stringified list may use single quotes + (``['a', 'b']``) which ``json.loads`` rejects, so ``ast.literal_eval`` is used + (handling single-quoted and native lists) with a safe fallback to the raw + string. + + Args: + test_files: A list/tuple of names, or a string holding a comma-joined + value or a stringified list. + + Returns: + The comma-joined test-files argument (str); empty for unsupported inputs. + """ + if isinstance(test_files, (list, tuple)): + return ",".join(str(item) for item in test_files) + if isinstance(test_files, str): + stripped = test_files.strip() + if stripped.startswith("[") and stripped.endswith("]"): + try: + parsed = ast.literal_eval(stripped) + if isinstance(parsed, (list, tuple)): + return ",".join(str(item) for item in parsed) + except (ValueError, SyntaxError): + pass + return stripped + return "" + + +def _resolve_required_tests(task: SweTask) -> tuple[list[str], list[str]]: + """Resolve the FAIL_TO_PASS / PASS_TO_PASS required-test sets. + + The ``fail_to_pass_select`` / ``pass_to_pass_select`` keys on ``task.metadata`` + take precedence when present; otherwise the plain ``task.fail_to_pass`` / + ``task.pass_to_pass`` are used. Values may be lists or stringified lists. + + Args: + task: The task whose required-test sets are resolved. + + Returns: + A ``(fail_to_pass, pass_to_pass)`` tuple of test-name lists. + """ + f2p = task.metadata.get("fail_to_pass_select") + f2p = _coerce_test_list(f2p) if f2p is not None else list(task.fail_to_pass) + p2p = task.metadata.get("pass_to_pass_select") + p2p = _coerce_test_list(p2p) if p2p is not None else list(task.pass_to_pass) + return f2p, p2p + + +def _coerce_test_list(value: Any) -> list[str]: + """Coerce a test-list value (list or stringified list) into a list of names. + + Args: + value: A list/tuple of names, or a string holding a stringified list. + + Returns: + The list of test names (list[str]); empty for unsupported inputs. + """ + if isinstance(value, (list, tuple)): + return [str(item) for item in value] + if isinstance(value, str): + stripped = value.strip() + if stripped.startswith("[") and stripped.endswith("]"): + try: + parsed = ast.literal_eval(stripped) + if isinstance(parsed, (list, tuple)): + return [str(item) for item in parsed] + except (ValueError, SyntaxError): + pass + return [] + + +def _parse_dockerfile_env(task: SweTask) -> dict[str, str]: + """Parse ``ENV`` lines from the task's dockerfiles into a name->value mapping. + + Scans ``base_dockerfile + instance_dockerfile`` for ``ENV`` directives and + converts them to environment variables. Handles both Docker forms: + + ENV KEY=VALUE (equals) + ENV KEY VALUE (space-separated) + + Returns ``{}`` when the dockerfiles are absent from metadata. + + Args: + task: The task whose dockerfile metadata is scanned. + + Returns: + A mapping (dict[str, str]) of environment variable names to values. + """ + base_dockerfile = str(task.metadata.get("base_dockerfile", "") or "") + instance_dockerfile = str(task.metadata.get("instance_dockerfile", "") or "") + env: dict[str, str] = {} + for raw_line in (base_dockerfile + "\n" + instance_dockerfile).split("\n"): + line = raw_line.strip() + if not line.startswith("ENV "): + continue + body = line[len("ENV ") :].strip() + if "=" in body: + # Format: ENV KEY=VALUE -> normalize spaces around the first `=`. + key, _, value = body.partition("=") + key = re.sub(r"\s+", "", key) + value = value.strip() + else: + # Format: ENV KEY VALUE -> split into key + remainder value. + parts = body.split(None, 1) + if len(parts) < 2: + continue + key, value = parts[0], parts[1] + if key: + env[key] = value + return env diff --git a/resources_servers/swe_bench/harnesses/r2egym.py b/resources_servers/swe_bench/harnesses/r2egym.py new file mode 100644 index 0000000000..9b4f42f24c --- /dev/null +++ b/resources_servers/swe_bench/harnesses/r2egym.py @@ -0,0 +1,174 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""r2e-gym harness — host-side (flat) graded. + +Runs the instance's eval script in the sandbox and parses the log host-side via the shared +flat-eval path, so it runs on any exec-capable provider. + +NOTE: the apptainer-only nested ``run_local_evaluation`` path (which produced r2e-gym's own +``report.json`` in-container) was removed when PR #1694 took ownership of the apptainer +provider. Re-wiring r2e-gym's nested grading + ``.sif``/mounts onto #1694's provider is tracked +for a follow-up PR (see APPTAINER_PR3_TRACKER.md); until then r2e-gym grades flat (it needs an +``eval_script`` in task metadata, else the flat grader masks the sample as an eval error). +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from nemo_gym.sandbox import SandboxResources, SandboxSpec +from resources_servers.swe_bench.harness import ( + EvalArtifacts, + SweEvalReport, + SweTask, + SweTaskHarness, + _ensure_trailing_newline, + compute_resolved, +) +from resources_servers.swe_bench.harnesses import flat_eval + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +class R2EGymHarness(SweTaskHarness): + """Harness for the r2e-gym family of SWE tasks (host-side / flat graded).""" + + name = "r2e-gym" + grade_strategy = "flat-host-grade" + + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox spec for an r2e-gym task. + + Args: + task: The SWE task whose metadata, image, and workdir describe the sandbox. + + Returns: + SandboxSpec: The populated sandbox spec (image, workdir, TTL, env, metadata, + resources, and any provider options carried on the task). + """ + return SandboxSpec( + image=task.image, + workdir=task.repo_workdir, + ttl_s=task.metadata.get("ttl_s", 1800), + ready_timeout_s=task.metadata.get("ready_timeout_s", 600), + env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"}, + metadata={ + "instance_id": task.instance_id[:63], + "benchmark": task.benchmark, + "harness": self.name, + }, + resources=SandboxResources.from_mapping(task.metadata.get("resources", {})), + provider_options=dict(task.metadata.get("provider_options", {})), + ) + + async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Write the bare ``/root/patch.diff`` the eval script applies. + + Args: + env: The active SWE environment used to write files into the sandbox. + task: The SWE task supplying the model patch (newline-normalized). + """ + if task.model_patch: + await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch)) + + async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Reset the repository checkout (no-op for r2e-gym). + + Args: + env: The active SWE environment (unused). + task: The SWE task (unused). + """ + return None + + def hide_eval_tests_commands(self) -> list[str]: + """Build shell commands that strip the held-out eval tests from the agent's checkout. + + ``/r2e_tests`` holds the evaluation tests the agent must not see; ``run_tests.sh`` + launches them. ``run_tests.sh`` is deleted only when it references ``r2e_tests`` + (substring guard). The agent adapter runs these after ``materialize``. + + Returns: + list[str]: One shell command per checkout root (``""``, ``/root``, ``/testbed``). + """ + commands: list[str] = [] + for root_dir in ["", "/root", "/testbed"]: + commands.append( + f"rm -rf {root_dir}/r2e_tests && " + f"if grep -qs r2e_tests {root_dir}/run_tests.sh; then rm -rf {root_dir}/run_tests.sh; fi" + ) + return commands + + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Run the instance's eval script in-sandbox and grade the log host-side. + + Args: + env: The active SWE environment used to execute commands in the sandbox. + task: The SWE task whose ``metadata['eval_script']`` is run. + + Returns: + EvalArtifacts: The captured test output, return code, patch existence, and flat + markers (masked as ``eval_error`` when no eval script is present). + """ + return await flat_eval.flat_run_eval(env, task) + + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade an r2e-gym task from its evaluation artifacts (host-side, flat). + + Unlike the SWE-bench flat grader, this path does NOT gate ``resolved`` on the + SWE-bench ``>>>>> Start/End Test Output`` marker pair: r2e-gym's ``run_tests.sh`` + does not emit those swebench sentinels, so requiring them would mask every r2e-gym + sample as unresolved. Per-test status lines are parsed from the whole log and the + node-ids are matched directly against the required ``fail_to_pass`` / ``pass_to_pass`` + sets (R2E-Gym uses pytest node-ids verbatim). Only genuine infra failures + (sandbox/timeout) are masked. + + Args: + task: The SWE task being graded. + artifacts: The evaluation artifacts produced by ``run_eval``. + + Returns: + SweEvalReport: The resolved/unresolved verdict with patch state and any error kind. + """ + if artifacts.raw.get("error_type") in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=artifacts.raw["error_type"], + ) + # Parse per-test status lines from the whole log (no swebench-marker gate). An + # unbuildable / empty log yields an empty status map -> no required test passes -> + # unmasked unresolved, and compute_resolved still returns False for an empty + # required set (the edge validated by main). + status_map = flat_eval._parse_pytest_status_lines(artifacts.test_output) + passed = flat_eval.passed_tests(status_map) + # Thread the full status_map so compute_resolved mirrors swebench's + # get_eval_tests_report semantics: neutral-status required tests (SKIPPED/XPASS) + # are excluded rather than treated as failures. + resolved = compute_resolved( + fail_to_pass=task.fail_to_pass, + pass_to_pass=task.pass_to_pass, + passed=passed, + status_map=status_map, + ) + return SweEvalReport( + instance_id=task.instance_id, + resolved=resolved, + patch_applied=bool(status_map), + patch_exists=bool(task.model_patch), + tests_status={"passed": passed, "all": status_map}, + ) diff --git a/resources_servers/swe_bench/harnesses/swe_bench_ext.py b/resources_servers/swe_bench/harnesses/swe_bench_ext.py new file mode 100644 index 0000000000..7c925c1264 --- /dev/null +++ b/resources_servers/swe_bench/harnesses/swe_bench_ext.py @@ -0,0 +1,311 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""swe-bench-ext harness: flat, host-graded reference family. + +Applies the model patch (and test patch) against the repository checkout, runs +the framework test command, and grades host-side with the parser +(:func:`resources_servers.swe_bench.parsing.parse_and_check_tests`). + +Grading delegates the full per-framework logic to ``parse_and_check_tests``: +junit-xml parsing, test-id normalization, the fuzzy matcher, the framework +dispatch, the ``::build``/``::compile`` synthetic-PASS injection, and +build-failed-package propagation. + +``resolved`` is taken from the parser's verdict (all FAIL_TO_PASS passed AND all +PASS_TO_PASS passed). It does not depend on ``patch_applied``: the model and test +patches are applied best-effort and grading is on the tests only. +``patch_applied`` is still recorded for information. +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +from nemo_gym.sandbox import SandboxResources, SandboxSpec +from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness +from resources_servers.swe_bench.parsing import ( + get_framework_config, + get_test_command_with_output, + parse_and_check_tests, +) + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +# Default checkout locations probed (in order) when locating the repo, mirroring main's +# ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null`` ladder +# in SweBenchExtDatasetProcessor's eval script. +_REPO_WORKDIR_LADDER = ("/testbed", "/workspace/repo", "/app", "/root/repo") + + +# Output markers the parser (parse_and_check_tests) extracts content between. +_TEST_OUTPUT_START = "<<>>" +_TEST_OUTPUT_END = "<<>>" +_RESULT_FILE_START = "<<>>" +_RESULT_FILE_END = "<<>>" + + +class SweBenchExtHarness(SweTaskHarness): + """Flat, host-graded harness for the swe-bench-ext task family. + + Runs the task's framework test command inside a single sandbox and grades the + captured output on the host. Works on any exec-capable sandbox provider. + """ + + name = "swe-bench-ext" + grade_strategy = "flat-host-grade" + + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox specification for a task. + + Args: + task: The SWE task describing the image, working directory, and + per-task metadata (timeouts, resources, provider options). + + Returns: + SandboxSpec: The sandbox spec used to launch the task's container. + """ + return SandboxSpec( + image=task.image, + workdir=task.repo_workdir, + ttl_s=task.metadata.get("ttl_s", 1800), + ready_timeout_s=task.metadata.get("ready_timeout_s", 600), + env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"}, + metadata={ + "instance_id": task.instance_id[:63], + "benchmark": task.benchmark, + "harness": self.name, + }, + resources=SandboxResources.from_mapping(task.metadata.get("resources", {})), + provider_options=task.metadata.get("provider_options", {}), + ) + + def supports_provider(self, provider_name: str) -> bool: + """Report whether this harness supports a sandbox provider. + + Being flat and host-graded, it works on any exec-capable provider. + + Args: + provider_name: The name of the sandbox provider. + + Returns: + bool: Always ``True``. + """ + return True + + async def _resolve_repo_workdir(self, env: "AsyncSweEnvironment", task: SweTask) -> str: + """Locate the repository checkout, mirroring main's ``cd`` fallback ladder. + + Main's ``SweBenchExtDatasetProcessor`` eval script runs + ``cd /testbed 2>/dev/null || cd /workspace/repo 2>/dev/null || cd /app 2>/dev/null`` + so a repo that is not at ``/testbed`` is still found. This reproduces that + host-side: a row-provided ``repo_workdir`` that differs from the default and holds a + ``.git`` checkout wins; otherwise the ladder (``/testbed``, ``/workspace/repo``, + ``/app``, ``/root/repo``) is probed for a ``.git`` directory. If nothing matches the + task's ``repo_workdir`` is returned unchanged (preserving prior behavior). + + Args: + env: The async environment used to probe the sandbox. + task: The SWE task whose ``repo_workdir`` is the preferred/default location. + + Returns: + str: The resolved repository working directory inside the sandbox. + """ + # Prefer an explicit, non-default row workdir holding a checkout. + candidates: list[str] = [] + if task.repo_workdir and task.repo_workdir != "/testbed": + candidates.append(task.repo_workdir) + candidates.extend(d for d in _REPO_WORKDIR_LADDER if d not in candidates) + for candidate in candidates: + probe = await env.execute(f'test -d "{candidate}/.git"', cwd="/") + if probe["returncode"] == 0: + return candidate + return task.repo_workdir + + async def reset_repo(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Reset the located checkout to ``base_commit`` for hermetic grading. + + Resolves the repo workdir via the same ladder main uses (so a non-``/testbed`` + checkout is found), then defers to the base ``git reset --hard`` behavior. + + Args: + env: The started environment to reset. + task: The task whose ``base_commit`` is restored. + """ + if task.base_commit: + workdir = await self._resolve_repo_workdir(env, task) + await env.execute(f"git reset --hard {task.base_commit}", cwd=workdir) + + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Apply patches, run the test command, and capture the evaluation output. + + Applies the model patch (and test patch) best-effort, then runs the + framework test command wrapped between output markers so the parser can + extract the structured result file or marked stdout. + + Args: + env: The async environment used to execute commands in the sandbox. + task: The SWE task providing the patches, test command, and framework. + + Returns: + EvalArtifacts: The captured test output, return code, whether the + model patch applied, and the execution error type if any. + """ + # Resolve the checkout via main's cd ladder so a non-/testbed repo is found. + workdir = await self._resolve_repo_workdir(env, task) + patch_applied = True + # Best-effort apply: a bad apply never fails the run (grading is on the + # tests only); we still record whether the model patch applied for info. + apply_flags = "--reject --recount --ignore-space-change --ignore-whitespace" + if task.model_patch: + applied = await env.execute( + f"git apply {apply_flags} /root/patch.diff", + cwd=workdir, + ) + patch_applied = applied["returncode"] == 0 + if task.test_patch: + await env.execute( + f"git apply {apply_flags} /root/test_patch.diff", + cwd=workdir, + ) + # Wrap the command's output: add structured-output flags (--junitxml/--json) + # via get_test_command_with_output, run it between the markers, and dump the + # framework result file so parse_and_check_tests receives junit-xml (preferred) + # or the marked stdout. + # + # The framework is passed through verbatim. An empty framework must NOT be + # coerced to "pytest": for a non-pytest instance whose framework is absent, the + # parser's auto-detect path is what grades correctly, and the default framework + # config adds no flags and no result file. grade() reuses this SAME value via + # _resolve_framework so the two stay in lockstep. + framework = self._resolve_framework(task) + # Use the row's test command verbatim, with NO default runner. Main's + # SweBenchExtDatasetProcessor uses ``inst.get("test_command", "")`` (empty when + # absent): a command-less row runs no runner and grades unresolved. Injecting a + # default ``python -m pytest`` here would diverge from main by fabricating results. + base_command = task.test_command + test_cmd = get_test_command_with_output(base_command, framework) + result_file = (get_framework_config(framework, base_command) or {}).get("result_file") + result = await env.execute(self._wrap_eval_command(test_cmd, result_file), cwd=workdir, is_eval=True) + return EvalArtifacts( + test_output=result["output"], + return_code=result["returncode"], + patch_applied=patch_applied, + raw={"error_type": result.get("error_type")}, + ) + + @staticmethod + def _resolve_framework(task: SweTask) -> str: + """Return the framework value used by both ``run_eval`` and ``grade``. + + Returns the task's framework verbatim. An empty or unknown value is + intentionally passed through unchanged: coercing it to ``"pytest"`` would + mis-dispatch the parser for non-pytest instances that ship no framework. + Centralizing this guarantees ``run_eval`` (which selects the + structured-output flag and result file) and ``grade`` (which parses the + output) agree on the framework. + + Args: + task: The SWE task whose framework value is returned. + + Returns: + str: The task's test framework name (possibly empty). + """ + return task.test_framework + + @staticmethod + def _wrap_eval_command(test_cmd: str, result_file: str | None) -> str: + """Wrap the eval command in the output markers and a result-file dump. + + The parser prefers the junit/json result file (emitted between the + RESULT_FILE markers) and falls back to the marked stdout. The ``mkdir -p`` + ensures ``/workspace/test-results`` exists first, since some frameworks + (e.g. junit/gradle, xctest) write their result file there. + + Args: + test_cmd: The test command to run inside the markers. + result_file: Path or glob of the framework result file to dump, or + ``None`` when the framework produces no result file. + + Returns: + str: A shell script that runs the test command and emits the marked + output and result-file blocks. + """ + mkdir_block = "mkdir -p /workspace/test-results\n" + if result_file and "*" in result_file: + result_block = ( + f'echo "{_RESULT_FILE_START}"\n' + f"for f in {result_file}; do\n" + f' if [ -f "$f" ]; then echo "=== FILE: $f ==="; cat "$f"; echo ""; fi\n' + f"done 2>/dev/null || true\n" + f'echo "{_RESULT_FILE_END}"\n' + ) + elif result_file: + result_block = ( + f'echo "{_RESULT_FILE_START}"\n' + f'if [ -f "{result_file}" ]; then cat "{result_file}"; fi\n' + f'echo "{_RESULT_FILE_END}"\n' + ) + else: + result_block = "" + return f'{mkdir_block}echo "{_TEST_OUTPUT_START}"\n{test_cmd}\n{result_block}echo "{_TEST_OUTPUT_END}"\n' + + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade captured evaluation artifacts into a report. + + Infrastructure failures are masked via ``error_kind`` and never scored as + unresolved. Otherwise the test output is handed to ``parse_and_check_tests`` + and ``resolved`` is taken from the parser's verdict. + + Args: + task: The SWE task providing the expected test sets and framework. + artifacts: The captured test output, return code, and error type. + + Returns: + SweEvalReport: The grading report, including ``resolved``, + ``patch_applied``, ``patch_exists``, and the parsed test status (or + ``error_kind`` on infrastructure failure). + """ + # Infra failure: mask via error_kind (never scored as "unresolved"). + if artifacts.raw.get("error_type") in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=artifacts.raw["error_type"], + ) + # Delegate to the parser, passing the framework verbatim via the SAME + # _resolve_framework value run_eval used. An empty/unknown framework falls + # through to the parser's auto-detect path; coercing it to "pytest" here would + # mis-grade non-pytest instances. + test_framework = self._resolve_framework(task) + result = parse_and_check_tests( + test_output=artifacts.test_output, + test_framework=test_framework, + fail_to_pass=task.fail_to_pass, + pass_to_pass=task.pass_to_pass, + instance_id=task.instance_id, + ) + # resolved is the parser's verdict (all F2P passed AND all P2P passed); it + # does NOT gate on patch_applied (grading is on tests only). + return SweEvalReport( + instance_id=task.instance_id, + resolved=bool(result["resolved"]), + patch_applied=artifacts.patch_applied, + patch_exists=bool(task.model_patch), + tests_status=result, + ) diff --git a/resources_servers/swe_bench/harnesses/swe_rebench.py b/resources_servers/swe_bench/harnesses/swe_rebench.py new file mode 100644 index 0000000000..68b863182b --- /dev/null +++ b/resources_servers/swe_bench/harnesses/swe_rebench.py @@ -0,0 +1,375 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""swe-rebench harness: a flat, host-graded family with a vendored log parser. + +This is a flat host-graded family: reset to base, apply the model patch and test +patch, run the install/test commands, then parse the test log host-side. + +Two things distinguish swe-rebench: + +* **JAVA env** — SWE-rebench tasks need + ``_JAVA_OPTIONS=-Djava.net.preferIPv6Addresses=false``, surfaced via + ``build_spec.env`` so it is set for the whole sandbox session. +* **Dynamic log parser** — swe-rebench has no single uniform pytest summary; the + correct per-test PASSED/FAILED status comes from a repo-specific parser keyed + by ``log_parser`` and shipped in the cloned ``SWE-rebench-V2`` repo + (``lib/agent/log_parsers.py`` or ``agent/log_parsers.py``). It is imported + dynamically, guarded by try/except. + +The cloned ``SWE-rebench-V2`` directory must be provisioned out-of-band. When it +is absent or the named parser cannot be resolved, ``grade`` masks the sample via +``error_kind`` rather than scoring a misleading ``unresolved``. +""" + +from __future__ import annotations + +import importlib.util +import json +import re +import sys +from pathlib import Path +from typing import TYPE_CHECKING, Any, Callable + +from nemo_gym.sandbox import SandboxResources, SandboxSpec +from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, SweTaskHarness + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +# JAVA flag required for every SWE-rebench task. +_JAVA_OPTIONS = "-Djava.net.preferIPv6Addresses=false" + +# Patch-apply flags shared by the model and test patch; non-fatal +# ``git apply --reject`` style so a failed apply still runs the tests. +_APPLY_FLAGS = "--reject --recount --ignore-space-change --whitespace=nowarn" + +# Timing/duration suffixes some test runners append to node names; stripped so +# the parser output lines up with the (already-normalized) expected node ids. +_REBENCH_TIMING_NORMALIZE_RES = [ + re.compile(r"\s*\[\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\]\s*$", re.IGNORECASE), + re.compile(r"\s+in\s+\d+(?:\.\d+)?\s+(?:msec|sec)\b", re.IGNORECASE), + re.compile(r"\s*\(\s*\d+(?:\.\d+)?\s*(?:ms|s)\s*\)\s*$", re.IGNORECASE), +] + + +def _normalize_test_name(name: str) -> str: + """Strip trailing timing annotations from a test node name. + + Args: + name (str): The raw test node name, possibly carrying a trailing timing + or duration annotation. + + Returns: + str: The node name with any timing suffix removed and surrounding + whitespace stripped. + """ + for pattern in _REBENCH_TIMING_NORMALIZE_RES: + name = pattern.sub("", name) + return name.strip() + + +def _load_rebench_log_parsers(rebench_repo_dir: Path): + """Dynamically import the cloned SWE-rebench-V2 ``log_parsers`` module. + + Prefers ``lib/agent/log_parsers.py`` and falls back to + ``agent/log_parsers.py``, temporarily prepending the repo (and its ``lib`` + directory) to ``sys.path`` so the module's intra-repo imports resolve. + + Args: + rebench_repo_dir (Path): Path to the cloned SWE-rebench-V2 repository. + + Returns: + ModuleType: The imported ``log_parsers`` module. + + Raises: + FileNotFoundError: If the cloned directory has not been provisioned and + no ``log_parsers.py`` can be located. + """ + lp_path = rebench_repo_dir / "lib" / "agent" / "log_parsers.py" + if not lp_path.exists(): + lp_path = rebench_repo_dir / "agent" / "log_parsers.py" + if not lp_path.exists(): + raise FileNotFoundError( + f"SWE-rebench-V2 log_parsers not found under {rebench_repo_dir}; " + "provision the clone via setup_scripts/swe_rebench.sh" + ) + + extra_paths = [str(rebench_repo_dir), str(rebench_repo_dir / "lib")] + added: list[str] = [] + for p in extra_paths: + if p not in sys.path: + sys.path.insert(0, p) + added.append(p) + try: + spec = importlib.util.spec_from_file_location("_rebench_log_parsers", str(lp_path)) + mod = importlib.util.module_from_spec(spec) + spec.loader.exec_module(mod) + return mod + finally: + for p in added: + try: + sys.path.remove(p) + except ValueError: + pass + + +def _resolve_parser(log_parsers, log_parser_name: str) -> Callable[[str], dict[str, str]] | None: + """Resolve a parser callable from the loaded module. + + Looks up the name in the module's ``NAME_TO_PARSER`` mapping first, then + falls back to a module-level attribute of the same name. + + Args: + log_parsers: The imported ``log_parsers`` module. + log_parser_name (str): The name of the parser to resolve. + + Returns: + Callable[[str], dict[str, str]] | None: The resolved parser callable, or + ``None`` if no parser matches the name. + """ + name_to_parser = getattr(log_parsers, "NAME_TO_PARSER", {}) or {} + return name_to_parser.get(log_parser_name) or getattr(log_parsers, log_parser_name, None) + + +def _as_list(value: Any) -> list[str]: + """Coerce a test-command/install/list field to a list of strings. + + Accepts the value as a JSON-encoded string, a bare string, or a list. A + JSON-encoded string is parsed and coerced recursively; a bare string that + fails to parse is wrapped in a single-element list. + + Args: + value (Any): The field value to coerce. May be ``None``, a string, a + list, a tuple, or any other type. + + Returns: + list[str]: The value normalized to a list of strings. An empty list is + returned for ``None`` or an empty string. + """ + if value is None: + return [] + if isinstance(value, str): + text = value.strip() + if not text: + return [] + if text[0] in "[{": + try: + parsed = json.loads(text) + except (ValueError, TypeError): + return [value] + return _as_list(parsed) + return [value] + if isinstance(value, (list, tuple)): + return [str(v) for v in value] + return [str(value)] + + +class SweRebenchHarness(SweTaskHarness): + """Flat, host-graded harness for the swe-rebench benchmark family. + + Applies the model and test patches, runs the install/test commands, then + parses the test log host-side using a repo-specific parser loaded + dynamically from the cloned SWE-rebench-V2 repository. + """ + + name = "swe-rebench" + grade_strategy = "flat-host-grade" + + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox spec for a swe-rebench task. + + Sets the git and ``_JAVA_OPTIONS`` environment variables, merges any + task-provided env, and forwards TTL, readiness timeout, resources, and + provider options from the task metadata. + + Args: + task (SweTask): The task to build a sandbox specification for. + + Returns: + SandboxSpec: The sandbox specification for running the task. + """ + # _JAVA_OPTIONS forces IPv4 for SWE-rebench tasks. + env = { + "GIT_CONFIG_GLOBAL": "/dev/null", + "GIT_PAGER": "cat", + "_JAVA_OPTIONS": _JAVA_OPTIONS, + } + env.update(task.metadata.get("env", {})) + return SandboxSpec( + image=task.image, + workdir=task.repo_workdir, + ttl_s=task.metadata.get("ttl_s", 1800), + ready_timeout_s=task.metadata.get("ready_timeout_s", 600), + env=env, + metadata={ + "instance_id": task.instance_id[:63], + "benchmark": task.benchmark, + "harness": self.name, + }, + resources=SandboxResources.from_mapping(task.metadata.get("resources", {})), + provider_options=task.metadata.get("provider_options", {}), + ) + + def supports_provider(self, provider_name: str) -> bool: + """Report whether the harness supports a given sandbox provider. + + Being flat and host-graded, it works on any exec-capable provider. + + Args: + provider_name (str): The name of the sandbox provider. + + Returns: + bool: Always ``True``. + """ + return True # flat, host-graded: works on any exec-capable provider + + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Apply patches, run install and test commands, and collect artifacts. + + Applies the model patch then the test patch (both best-effort), runs the + non-fatal install commands, then runs the test block with the eval + timeout. Records whether the model patch applied for informational + purposes only; grading does not gate on it. + + Args: + env (AsyncSweEnvironment): The environment used to execute commands + inside the sandbox. + task (SweTask): The task being evaluated. + + Returns: + EvalArtifacts: The captured test output, return code, model-patch + application status, and raw error metadata. + """ + workdir = task.repo_workdir + install_config = task.metadata.get("install_config", {}) or {} + install_cmds = _as_list(install_config.get("install")) + test_cmds = _as_list(install_config.get("test_cmd")) or ([task.test_command] if task.test_command else []) + + # Apply the model patch first, then the test patch. Both are best-effort: + # a failed apply still runs the tests; model-patch application is recorded + # for info only (grading does not gate on it). + patch_applied = True + if task.model_patch: + applied = await env.execute( + f"git apply {_APPLY_FLAGS} /root/patch.diff", + cwd=workdir, + ) + patch_applied = applied["returncode"] == 0 + if task.test_patch: + await env.execute(f"git apply {_APPLY_FLAGS} /root/test_patch.diff", cwd=workdir) + + # Install commands are non-fatal; failures there should not abort the + # test run. + for cmd in install_cmds: + await env.execute(cmd, cwd=workdir) + + test_block = "\n".join(test_cmds) if test_cmds else "python -m pytest -rA -q" + # Thread the eval timeout into the test exec, defaulting to 1800s so a + # stuck swe-rebench run is bounded. A row that explicitly carries a + # ``tests_timeout`` overrides the default. + result = await env.execute( + test_block, + cwd=workdir, + is_eval=True, + timeout_s=task.metadata.get("tests_timeout", 1800), + ) + return EvalArtifacts( + test_output=result["output"], + return_code=result["returncode"], + patch_applied=patch_applied, + raw={"error_type": result.get("error_type")}, + ) + + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade a swe-rebench task from its evaluation artifacts. + + Masks infra failures (sandbox/timeout) and grading errors (missing clone, + unknown parser, parser crash) via ``error_kind`` rather than scoring them. + Otherwise parses the test output with the resolved repo-specific parser + and marks the task resolved when every FAIL_TO_PASS and PASS_TO_PASS test + is in the passed set. + + Args: + task (SweTask): The task being graded. + artifacts (EvalArtifacts): The artifacts captured during evaluation. + + Returns: + SweEvalReport: The grading report, with ``resolved`` set on success + or ``error_kind`` set when the sample is masked. + """ + # Infra failure -> mask via error_kind (never scored as "unresolved"). + if artifacts.raw.get("error_type") in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=artifacts.raw["error_type"], + ) + + install_config = task.metadata.get("install_config", {}) or {} + log_parser_name = install_config.get("log_parser", "") + # The cloned SWE-rebench-V2 dir is provisioned out-of-band; its absence, + # an unknown parser name, or a parser crash all mask the sample via + # ``error_kind`` rather than mis-scoring it. + rebench_repo_dir = task.metadata.get("rebench_repo_dir") + if not rebench_repo_dir: + return self._masked(task, artifacts, "eval_error") + try: + log_parsers = _load_rebench_log_parsers(Path(rebench_repo_dir)) + parser = _resolve_parser(log_parsers, log_parser_name) + if parser is None: + return self._masked(task, artifacts, "eval_error") + results = parser(artifacts.test_output) + except Exception: + return self._masked(task, artifacts, "eval_error") + + results = {_normalize_test_name(k): v for k, v in (results or {}).items()} + passed_set = {k for k, v in results.items() if v == "PASSED"} + fail_to_pass_set = {_normalize_test_name(n) for n in task.fail_to_pass} + pass_to_pass_set = {_normalize_test_name(n) for n in task.pass_to_pass} + + # Resolution rule: every FAIL_TO_PASS and PASS_TO_PASS test must be in the + # passed set. Resolution is not gated on patch application, and the + # F2P/P2P sets are not required to be non-empty (an empty set is a subset + # of any set). + resolved = (fail_to_pass_set <= passed_set) and (pass_to_pass_set <= passed_set) + return SweEvalReport( + instance_id=task.instance_id, + resolved=resolved, + patch_applied=artifacts.patch_applied, + patch_exists=bool(task.model_patch), + tests_status={"passed": sorted(passed_set), "all": results}, + ) + + @staticmethod + def _masked(task: SweTask, artifacts: EvalArtifacts, kind: str) -> SweEvalReport: + """Build a masked report that records a grading error instead of a score. + + Args: + task (SweTask): The task being graded. + artifacts (EvalArtifacts): The artifacts captured during evaluation. + kind (str): The error kind to record on the report. + + Returns: + SweEvalReport: A report with ``error_kind`` set and no resolution. + """ + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=kind, + ) diff --git a/resources_servers/swe_bench/harnesses/swebench.py b/resources_servers/swe_bench/harnesses/swebench.py new file mode 100644 index 0000000000..563c6ae614 --- /dev/null +++ b/resources_servers/swe_bench/harnesses/swebench.py @@ -0,0 +1,274 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""swe-bench / swe-bench-multilingual harness — host-side (flat) grading. + +A single parametrized class serves both families. It runs the instance's official SWE-bench +eval script (``swebench.make_test_spec(...).eval_script``) inside the sandbox and grades the +produced log host-side with swebench's per-repo log parser, so it runs on any exec-capable +provider (docker / opensandbox). + +NOTE: the apptainer-only nested ``run_local_evaluation`` path was removed when PR #1694 took +ownership of the apptainer provider. The swe_env-specific nested-apptainer grading (mounts/.sif +wiring + run_local_evaluation) is tracked for a follow-up PR (see APPTAINER_PR3_TRACKER.md). +""" + +from __future__ import annotations + +import dataclasses +import os +import tempfile +from typing import TYPE_CHECKING + +from nemo_gym.sandbox import SandboxResources, SandboxSpec +from resources_servers.swe_bench.harness import ( + EvalArtifacts, + GraderDependencyError, + SweEvalReport, + SweTask, + SweTaskHarness, + _ensure_trailing_newline, + compute_resolved, +) +from resources_servers.swe_bench.harnesses import flat_eval + + +if TYPE_CHECKING: + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +# Per-test status tokens swebench's repo parsers emit that count as a pass. +_SWEBENCH_PASS_STATUSES = frozenset({"PASSED", "XFAIL"}) + +# swe-bench families this harness serves. +_VALID_NAMES = frozenset({"swe-bench", "swe-bench-multilingual"}) + + +class SweBenchHarness(SweTaskHarness): + """SWE-bench (and multilingual) harness, host-side (flat) graded. + + Runs the instance's official eval script in the sandbox and parses the log host-side with + swebench's per-repo parser. Construct one instance per family + (``SweBenchHarness("swe-bench")`` / ``SweBenchHarness("swe-bench-multilingual")``). + """ + + grade_strategy = "flat-host-grade" + + def __init__(self, name: str = "swe-bench") -> None: + """Initialize the harness for a given swe-bench family. + + Args: + name: The swe-bench family to serve (``"swe-bench"`` or ``"swe-bench-multilingual"``). + + Raises: + ValueError: If ``name`` is not a known swe-bench family. + """ + if name not in _VALID_NAMES: + raise ValueError(f"Unknown swe-bench family: {name!r} (expected one of {sorted(_VALID_NAMES)})") + self.name = name + + # --- provisioning -------------------------------------------------------- + + def build_spec(self, task: SweTask) -> SandboxSpec: + """Build the sandbox spec for a task. + + Args: + task: The task to provision a sandbox for. + + Returns: + A ``SandboxSpec`` describing the image, workdir, environment, and any provider + options carried on the task. Flat grading runs the eval script directly in the + instance image, so no host harness/venv mounts are needed. + """ + return SandboxSpec( + image=task.image, + workdir=task.repo_workdir, + ttl_s=task.metadata.get("ttl_s", 1800), + ready_timeout_s=task.metadata.get("ready_timeout_s", 600), + env={"GIT_CONFIG_GLOBAL": "/dev/null", "GIT_PAGER": "cat"}, + metadata={ + "instance_id": task.instance_id[:63], + "benchmark": task.benchmark, + "harness": self.name, + }, + resources=SandboxResources.from_mapping(task.metadata.get("resources", {})), + provider_options=dict(task.metadata.get("provider_options", {})), + ) + + async def materialize(self, env: "AsyncSweEnvironment", task: SweTask) -> None: + """Write the bare ``/root/patch.diff`` the eval script applies. + + Args: + env: The environment used to write files into the sandbox. + task: The task whose model patch is staged for the eval script (newline-normalized + so the upstream ``git apply`` succeeds). + """ + if task.model_patch: + await env.write_text("/root/patch.diff", _ensure_trailing_newline(task.model_patch)) + + def _flat_eval_script(self, task: SweTask) -> str: + """Build the official SWE-bench eval script for host-side (flat) grading. + + Uses the ``swebench`` library's ``make_test_spec(...).eval_script`` (the per-repo recipe), + prefixed with a step that applies the model patch from ``/root/patch.diff``. Returns an + empty string if the instance dict is unavailable or the spec cannot be built, in which + case the flat grader masks the sample as an eval error rather than scoring 0. + + Args: + task: The task whose ``metadata['instance_dict']`` describes the SWE-bench instance. + + Returns: + The eval-script text, or ``""`` when it cannot be constructed. + """ + instance = task.metadata.get("instance_dict") + if not instance: + return "" + try: + from swebench.harness.test_spec.test_spec import make_test_spec + + spec = make_test_spec(instance, namespace="swebench") + except Exception: + return "" + # Mirror main's GIT_APPLY ladder (swebench/harness/run_evaluation.py GIT_APPLY_CMDS): + # try each apply command in order, breaking on the first rc==0, and never write + # conflict markers into the tree (no --3way). The trailing `echo` only fires when + # every command failed. + apply_model = ( + "cd /testbed && " + "(git apply --verbose /root/patch.diff || " + "git apply --verbose --reject /root/patch.diff || " + "patch --batch --fuzz=5 -p1 -i /root/patch.diff || " + "echo 'NEMO_GYM_PATCH_APPLY_FAILED')\n" + ) + return apply_model + spec.eval_script + + # --- server-private grading ---------------------------------------------- + + async def run_eval(self, env: "AsyncSweEnvironment", task: SweTask) -> EvalArtifacts: + """Run the instance's eval script in-sandbox and collect its log. + + Args: + env: The environment used to execute commands in the sandbox. + task: The task to evaluate. + + Returns: + An ``EvalArtifacts`` carrying the captured test output, return code, whether a patch + existed, and the flat-eval markers. + """ + if not task.metadata.get("eval_script"): + task = dataclasses.replace(task, metadata={**task.metadata, "eval_script": self._flat_eval_script(task)}) + return await flat_eval.flat_run_eval(env, task) + + def grade(self, task: SweTask, artifacts: EvalArtifacts) -> SweEvalReport: + """Grade a task from its evaluation artifacts (host-side, flat). + + The SWE-bench family spans repos with different test runners (pytest, django's unittest + runner, etc.). The generic flat parser is pytest-only and silently scores non-pytest + repos (e.g. django) unresolved — even the gold patch. Grade with swebench's official + per-repo log parser; if ``swebench`` cannot be imported for a real SWE-bench instance + this raises ``GraderDependencyError`` (fail loud) rather than silently mis-scoring. The + generic parser is used only for the legitimate cases where there is no instance dict or + the eval spec cannot be built (matching main's behavior for unbuildable instances). + + Args: + task: The task being graded. + artifacts: The evaluation artifacts produced by ``run_eval``. + + Returns: + A ``SweEvalReport`` recording resolution, patch state, and any error kind. + + Raises: + GraderDependencyError: If ``swebench`` is unavailable for a real SWE-bench instance. + """ + report = self._swebench_flat_grade(task, artifacts) + return report if report is not None else flat_eval.flat_grade(task, artifacts) + + def _swebench_flat_grade(self, task: SweTask, artifacts: EvalArtifacts) -> "SweEvalReport | None": + """Grade a flat eval log with swebench's official per-repo log parser. + + The generic :func:`flat_eval.flat_grade` parser only recognises pytest-style + ``PASSED `` lines, so repos with other test runners (e.g. django's unittest + runner) parse as zero passing tests and grade unresolved — even for the gold patch. + This path uses ``swebench.harness.grading.get_logs_eval`` (the same per-repo parser the + nested harness uses), keeping docker flat grading faithful to the official result. + + Args: + task: The task being graded (supplies the instance dict + fail/pass test ids). + artifacts: The artifacts produced by :func:`flat_eval.flat_run_eval`. + + Returns: + A ``SweEvalReport`` with the official verdict, or ``None`` when there is no instance + dict or the eval spec cannot be built (caller falls back to the generic parser). + + Raises: + GraderDependencyError: If ``swebench`` cannot be imported for a real SWE-bench + instance (fail loud rather than silently degrading to the generic parser). + """ + # Mirror flat_grade's infra masks so a genuine sandbox/timeout never scores 0. An + # unbuildable/empty eval spec is NOT masked here (it grades unmasked unresolved via + # the generic parser fallback below), matching main's behavior. + error_type = artifacts.raw.get("error_type") + if error_type in {"sandbox", "timeout"}: + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + patch_applied=artifacts.patch_applied, + error_kind=error_type, + ) + instance = task.metadata.get("instance_dict") + if not instance: + return None + try: + from swebench.harness.constants import FAIL_ONLY_REPOS + from swebench.harness.grading import get_logs_eval + from swebench.harness.test_spec.test_spec import make_test_spec + except Exception as exc: + # Fail loud instead of degrading to the generic pytest-only parser, which mis-scores + # non-pytest repos (e.g. django) as unresolved even for a correct patch. swebench is a + # pinned hard dependency (requirements.txt: swebench==4.1.0); a missing/broken install + # is a misconfiguration that must surface, not silently skew the SWE-bench resolve rate. + raise GraderDependencyError( + "swebench is required to grade SWE-bench instances faithfully (per-repo log " + "parsers) but could not be imported; install the pinned 'swebench==4.1.0'." + ) from exc + log_fp = None + try: + spec = make_test_spec(instance, namespace="swebench") + with tempfile.NamedTemporaryFile("w", suffix=".log", delete=False) as handle: + handle.write(artifacts.test_output or "") + log_fp = handle.name + status_map, markers_found = get_logs_eval(spec, log_fp) + except Exception: + return None + finally: + if log_fp is not None and os.path.exists(log_fp): + os.unlink(log_fp) + passed = [node for node, status in status_map.items() if status in _SWEBENCH_PASS_STATUSES] + # Select the eval type per-repo exactly as swebench.harness.grading.get_eval_report: + # FAIL_ONLY_REPOS (the JS multilingual repos) use the fail-only resolution rule. + eval_type = "fail_only" if spec.repo in FAIL_ONLY_REPOS else "pass_and_fail" + resolved = bool(markers_found) and compute_resolved( + fail_to_pass=task.fail_to_pass, + pass_to_pass=task.pass_to_pass, + passed=passed, + eval_type=eval_type, + status_map=status_map, + ) + return SweEvalReport( + instance_id=task.instance_id, + resolved=resolved, + patch_applied=bool(markers_found), + patch_exists=bool(task.model_patch), + tests_status={"passed": passed, "all": status_map}, + ) diff --git a/resources_servers/swe_bench/parsing/__init__.py b/resources_servers/swe_bench/parsing/__init__.py new file mode 100644 index 0000000000..a9de18198d --- /dev/null +++ b/resources_servers/swe_bench/parsing/__init__.py @@ -0,0 +1,52 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""SWE-Bench-Ext test-output parser. + +Provides the per-framework parsers, framework output config, and the +resolution helper used by SWE harnesses for host-side grading. This +``__init__`` re-exports the public symbols so callers can import them from a +single location, e.g.:: + + from resources_servers.swe_bench.parsing import ( + parse_and_check_tests, + get_framework_config, + get_test_command_with_output, + ) +""" + +from resources_servers.swe_bench.parsing.frameworks import ( + FRAMEWORK_CONFIGS, + get_framework_config, + get_test_command_with_output, +) +from resources_servers.swe_bench.parsing.parsing import ( + normalize_test_id, + parse_test_output, +) +from resources_servers.swe_bench.parsing.utils import parse_and_check_tests + + +__all__ = [ + # High-level grading entry point (F2P/P2P resolution). + "parse_and_check_tests", + # Framework output config + command augmentation. + "FRAMEWORK_CONFIGS", + "get_framework_config", + "get_test_command_with_output", + # Framework dispatcher + test-id normalization. + "parse_test_output", + "normalize_test_id", +] diff --git a/resources_servers/swe_bench/parsing/frameworks.py b/resources_servers/swe_bench/parsing/frameworks.py new file mode 100644 index 0000000000..7de570c491 --- /dev/null +++ b/resources_servers/swe_bench/parsing/frameworks.py @@ -0,0 +1,174 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Test framework output configuration mapping.""" + +from typing import Dict + + +FRAMEWORK_CONFIGS: Dict[str, Dict] = { + "pytest": { + "output_flag": "--junitxml=/workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + "unittest": { + "output_flag": "--junitxml=/workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + "go": { + "output_flag": "-json", + "result_file": None, + }, + "jest": { + "output_flag": "--json --outputFile=/workspace/test-results/output.json", + "result_file": "/workspace/test-results/output.json", + }, + "vitest": { + "output_flag": "--reporter=json --outputFile=/workspace/test-results/output.json", + "result_file": "/workspace/test-results/output.json", + }, + "mocha": { + "output_flag": "--reporter json --reporter-options output=/workspace/test-results/output.json", + "result_file": "/workspace/test-results/output.json", + }, + "bun": { + "output_flag": None, # Bun doesn't have structured JSON output flag by default + "result_file": None, # Parse from stdout + }, + "junit": { + "output_flag": None, + "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml", + }, + "maven": { + "output_flag": None, + "result_file": "find:/workspace/repo:*/target/surefire-reports:TEST-*.xml", + }, + "gtest": { + "output_flag": "--gtest_output=json:/workspace/test-results/output.json", + "result_file": "/workspace/test-results/output.json", + }, + "cargo-nextest": { + "output_flag": None, # Profile is already in test_command + "result_file": None, # JUnit XML is output to repo/junit.xml by profile config + }, + "ctest": { + "output_flag": "--output-on-failure --output-junit /workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + "xctest": { + # For SwiftPM with XCTest framework + "output_flag": "--parallel --num-workers=1 --xunit-output /workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + "testing": { + # For SwiftPM with new Swift Testing framework (Swift 6+) + "output_flag": "--disable-xctest --parallel --xunit-output /workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + "cppunit": { + "output_flag": None, + "result_file": None, + }, + # Lua test frameworks - Tier 1 (Standard XML output) + "busted": { + "output_flag": "--output=junit", + "result_file": "/workspace/test-results/output.xml", + }, + "luaunit": { + "output_flag": "-o junit -n /workspace/test-results/output.xml", + "result_file": "/workspace/test-results/output.xml", + }, + # Lua test frameworks - Tier 2 (Custom parsers) + "telescope": { + "output_flag": None, + "result_file": None, + }, + "lust": { + "output_flag": None, + "result_file": None, + }, + "minitest": { + "output_flag": None, + "result_file": None, + }, + "bespoke_libgeos": { + "output_flag": None, + "result_file": None, + }, + # TAP (Test Anything Protocol) - used by tape, node-tap + "tap": { + "output_flag": None, # TAP outputs to stdout + "result_file": None, # Parse from stdout + }, + "tape": { + "output_flag": None, # tape outputs TAP to stdout + "result_file": None, # Parse from stdout + }, + # Hardhat (Solidity) - uses Mocha under the hood + "hardhat": { + "output_flag": None, # Uses Mocha console reporter by default + "result_file": None, # Parse from stdout + }, +} + + +def get_framework_config(framework: str, test_command: str = "") -> Dict: + """Get configuration for a test framework. + + Args: + framework: Test framework name + test_command: The test command (optional, used to detect Gradle vs Maven) + """ + config = FRAMEWORK_CONFIGS.get( + framework, + { + "output_flag": None, + "result_file": None, + }, + ) + + # Special handling for JUnit: detect Gradle vs Maven from command + if framework == "junit" and test_command: + if "gradlew" in test_command or "gradle " in test_command: + # Gradle uses different output location than Maven + # Use */TEST-*.xml to match both standard Gradle (test/) and Android (testDebugUnitTest/) + config = { + "output_flag": None, + "result_file": "find:/workspace/repo:*/build/test-results*:TEST-*.xml", + } + + # Special handling for xctest: detect Swift Testing vs XCTest from command + # When --disable-xctest is used, the task is using Swift Testing, not XCTest + # Use the 'testing' framework config to avoid adding XCTest-only flags like --num-workers + if framework == "xctest" and test_command: + if "--disable-xctest" in test_command: + config = FRAMEWORK_CONFIGS.get("testing", config) + + return config + + +def get_test_command_with_output(base_command: str, framework: str) -> str: + """ + Add structured output flags to test command. + + Returns: command_with_output_flags + """ + config = get_framework_config(framework, base_command) + output_flag = config.get("output_flag") + + enhanced = f"{base_command} {output_flag}" if output_flag else base_command + + return enhanced diff --git a/resources_servers/swe_bench/parsing/parsing.py b/resources_servers/swe_bench/parsing/parsing.py new file mode 100644 index 0000000000..800586adad --- /dev/null +++ b/resources_servers/swe_bench/parsing/parsing.py @@ -0,0 +1,1606 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" +Test parsing utilities for build.py. + +Helper functions for: +- Separating test and gold patches +- Parsing JUnit XML and JSON test outputs +""" + +import json +import re +import xml.etree.ElementTree as ET +from pathlib import Path +from typing import Dict, Optional, Tuple + + +def read_patch(path: Path, skip_binary: bool = False) -> str: + """ + Read the text content of a patch file optionally skipping binary files + """ + parts = split_patch(path, skip_binary=skip_binary) + return "".join([diff for _, diff in parts]) + + +def split_patch(patch_path: Path, skip_binary: bool = False) -> list[Tuple[str, str]]: + """ + Read a patch and partition by file. + + Args: + patch_path (Path) - The patch file to split + skip_binary (bool) - Whether to exclude binary files + + Returns: List of (filename, patch content) tuples + """ + content = patch_path.read_text() + parts = [] + + # Split by file changes (each starts with "diff --git") + file_diffs = re.split(r"(diff --git.*?)(?=diff --git|\Z)", content, flags=re.DOTALL) + + for i in range(0, len(file_diffs), 2): + if i + 1 >= len(file_diffs): + continue + + header = file_diffs[i] + content = file_diffs[i + 1] + full_diff = header + content + + # Extract filename from diff header + file_match = re.search(r"diff --git a/(.*?) b/", full_diff) + if not file_match: + continue + + filepath = file_match.group(1) + + if skip_binary: + binary_match = re.search(r"^GIT binary patch$", full_diff, flags=re.MULTILINE) + if binary_match: + continue + + parts.append((filepath, full_diff)) + + return parts + + +def _parse_embedded_test_results(text_output: str, test_prefix: str = "") -> Dict[str, str]: + """Parse embedded test results from system-out text. + + This handles cases like wolfssl where a single ctest testcase runs many individual tests + and outputs them in a specific format within . + + Expected formats: + - " 1: test_name : passed ( 0.00016)" + - " 2: test_name : failed ( 0.00016)" + - " 3: test_name : skipped" + - "HMAC-MD5 test passed!" + - "RSA test failed!" + + Args: + text_output: The text content from + test_prefix: Prefix to add to test names (usually the testcase name) + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + """ + results = {} + + # Pattern 1: Numbered test format (wolfssl API tests) + # Format: " 1: test_name : passed ( 0.00016)" + numbered_pattern = re.compile( + r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*?)\s*:\s*(passed|failed|skipped)", re.MULTILINE | re.IGNORECASE + ) + + for match in numbered_pattern.finditer(text_output): + test_name = match.group(1).strip() + status = match.group(2).lower() + + # Build test ID with prefix + if test_prefix: + test_id = f"{test_prefix}::{test_name}" + else: + test_id = test_name + + if status == "passed": + results[test_id] = "PASSED" + elif status == "failed": + results[test_id] = "FAILED" + elif status == "skipped": + results[test_id] = "SKIPPED" + + # Pattern 2: Unit test format (wolfssl unit tests) + # Format: "HMAC-MD5 test passed!" + # Only match lines that don't contain '---' (separator lines) + # Use [ \t] instead of \s to avoid matching newlines + unit_pattern = re.compile( + r"^([A-Za-z0-9_\-/]+(?:[ \t]+[A-Za-z0-9_\-/]+){0,5}?)[ \t]+test[ \t]+(passed|failed)!", + re.MULTILINE | re.IGNORECASE, + ) + + for match in unit_pattern.finditer(text_output): + test_name = match.group(1).strip() + status = match.group(2).lower() + + # Skip if the test name contains special characters indicating it's not a real test + if "---" in test_name or len(test_name) > 50: + continue + + # Build test ID with prefix + if test_prefix: + test_id = f"{test_prefix}::{test_name}" + else: + test_id = test_name + + if status == "passed": + results[test_id] = "PASSED" + elif status == "failed": + results[test_id] = "FAILED" + + # Pattern 3: FAILURES section (wolfssl API tests) + # Format: "FAILURES:\n 892: test_wolfSSL_CTX_load_verify_locations" + failures_section = re.search(r"FAILURES:\s*\n(.*?)(?:\n\s*End|$)", text_output, re.DOTALL) + if failures_section: + failure_pattern = re.compile(r"^\s*\d+:\s+([^\s:]+(?:\s+[^\s:]+)*)", re.MULTILINE) + for match in failure_pattern.finditer(failures_section.group(1)): + test_name = match.group(1).strip() + if test_prefix: + test_id = f"{test_prefix}::{test_name}" + else: + test_id = test_name + # Mark as failed (this overrides any previous 'passed' if it exists) + results[test_id] = "FAILED" + + return results + + +def parse_junit_xml(xml_content: str) -> Dict[str, str]: + """Parse JUnit XML to extract test results. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (import errors, syntax errors, etc.), if we find valid + XML test results, we parse and return them. We only return None if we're certain + the framework didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If test framework failed to run (not the same as tests failing) + """ + results = {} + found_any_xml = False + + # PRIORITY 1 & 2: Try to parse XML documents (pure or mixed with other output) + # Handle multiple concatenated XML documents (from multiple test result files) + # Split by 0: + doc = "", xml_start) + if xml_end > xml_start: + xml_extracted = doc[xml_start : xml_end + len("")] + else: + xml_end = doc.find("", xml_start) + if xml_end > xml_start: + xml_extracted = doc[xml_start : xml_end + len("")] + else: + continue + + try: + tree = ET.fromstring(xml_extracted) + found_any_xml = True + except ET.ParseError: + continue + + # Parse all testcases from this document + for testcase in tree.iter("testcase"): + classname = testcase.get("classname", "") + name = testcase.get("name", "") + test_id = f"{classname}::{name}" if classname else name + + # Check if this testcase has system-out with embedded test results + # This handles cases like wolfssl where a single ctest executable runs many tests + system_out = testcase.find("system-out") + embedded_results = {} + if system_out is not None and system_out.text: + embedded_results = _parse_embedded_test_results(system_out.text, classname or name) + + if embedded_results: + # If we found embedded test results, use those instead of the testcase status + results.update(embedded_results) + elif testcase.find("failure") is not None or testcase.find("error") is not None: + results[test_id] = "FAILED" + elif testcase.find("skipped") is not None: + results[test_id] = "SKIPPED" + else: + results[test_id] = "PASSED" + + # PRIORITY 3: If we found NO valid XML and NO results, check for error indicators + # Only return None if we're certain the framework failed to run + if not found_any_xml and not results: + error_indicators = [ + "ERROR: ", # Generic error marker + "ImportError:", # Python import errors + "ModuleNotFoundError:", # Python module errors + "SyntaxError:", # Python syntax errors + "FAILED ", # Framework failure markers + "INTERNALERROR", # pytest internal errors + "collection errors", # pytest collection errors + "error: ", # Generic error (C++, Swift, etc.) + "fatal error:", # Fatal compilation errors + "cannot find symbol", # Java compilation errors + "error: build had", # Swift build errors (xctest) + "error: terminated", # Swift process crashes (xctest) + ] + has_errors = any(indicator in xml_content for indicator in error_indicators) + # Return None ONLY if: no XML found AND errors present + # Return empty dict if: no XML found AND no errors (rare but valid) + return None if has_errors else results + + return results + + +def parse_go_json(json_output: str) -> Dict[str, str]: + """Parse Go test -json output (newline-delimited JSON). + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (module errors, build errors, etc.), if we find valid + test results JSON, we parse and return it. We only return None if we're certain + the tests didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If Go tests failed to run (not the same as tests failing) + """ + results = {} + has_valid_json = False + + # PRIORITY 1: Try to parse newline-delimited JSON (valid test output) + for line in json_output.strip().split("\n"): + if not line.strip(): + continue + try: + event = json.loads(line) + has_valid_json = True # Found at least one valid JSON line + action = event.get("Action") + + # Handle test-level events + if "Test" in event and action in ["pass", "fail", "skip"]: + test_name = event.get("Test", "") + if test_name: + package = event.get("Package", "") + test_id = f"{package}::{test_name}" if package else test_name + + if action == "pass": + results[test_id] = "PASSED" + elif action == "fail": + results[test_id] = "FAILED" + elif action == "skip": + results[test_id] = "SKIPPED" + + # Handle package-level failures (no Test field) + elif "Package" in event and "Test" not in event and action == "fail": + package = event.get("Package", "") + test_id = f"{package}::package" + results[test_id] = "FAILED" + + except json.JSONDecodeError: + # PRIORITY 2: Handle plaintext build failures (legitimate failures) + # When tests can't compile/build, Go outputs plaintext "FAIL package [build failed]" + # This is a legitimate test failure, not a parsing error + build_fail_match = re.match(r"^FAIL\s+(\S+)\s+\[build failed\]", line) + if build_fail_match: + package_name = build_fail_match.group(1) + results[package_name] = "FAILED" + has_valid_json = True # Count build failures as valid results + + # PRIORITY 3: If we found NO valid JSON and NO build failures, check for error indicators + if not has_valid_json and not results: + error_indicators = [ + "go: cannot find main module", # Module not found + "can't load package", # Package loading errors + "pattern matches no packages", # No matching packages + "build constraints exclude all Go files", # Build constraints error + ] + has_errors = any(indicator in json_output for indicator in error_indicators) + # Return None ONLY if: no JSON found AND errors present + # Return empty dict if: no JSON found AND no errors (rare but valid) + return None if has_errors else results + + return results + + +def parse_jest_vitest_json(json_output: str) -> Dict[str, str]: + """Parse Jest/Vitest JSON output. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (TypeScript, npm, etc.), if we find valid + test results JSON, we parse and return it. We only return None if we're + certain the framework didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If Jest itself failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Try to parse as pure JSON (test results take precedence) + try: + data = json.loads(json_output.strip()) + # If we got JSON, check if it has test results (even if errors exist elsewhere in output) + except json.JSONDecodeError: + # PRIORITY 2: Search for JSON markers in mixed output + # Even with errors in output, tests might have run and produced JSON + json_start = json_output.find('{"numFailed') # Jest format + if json_start == -1: + json_start = json_output.find('{"numTotalTest') # Vitest format + if json_start == -1: + json_start = json_output.find('{"test') # Alternative format + if json_start == -1: + # PRIORITY 3: No JSON found - NOW check if there are error indicators + # Only return None if we're sure tests didn't run (no results + errors present) + # NOTE: error_indicators are a LAST RESORT - we prefer finding test results + error_indicators = [ + "error TS", # TypeScript compilation errors (e.g., error TS2307:) + "ELIFECYCLE", # npm script failures + "npm ERR!", # npm errors + "Error: Cannot find module", # Module loading errors (like Mocha) + "SyntaxError:", # JavaScript/TypeScript syntax errors + "Test suite failed to run", # Jest-specific: tests couldn't be loaded + "FAIL ", # Jest failure marker without JSON + ] + has_errors = any(indicator in json_output for indicator in error_indicators) + # Return None ONLY if: no JSON found AND errors present + # Return empty dict if: no JSON found AND no errors (rare but valid) + return None if has_errors else results + + # Try to extract JSON from mixed output + decoder = json.JSONDecoder() + try: + data, _ = decoder.raw_decode(json_output[json_start:]) + except json.JSONDecodeError: + # Could not parse JSON even after finding marker + return None + + # At this point, we have successfully parsed JSON + # Check if this is Jest's error response format (Jest itself failed, not the tests) + # Format: {"error": {"code": 2, "summary": "", "detail": ""}} + # This is a structured error response, NOT test results + if "error" in data and "code" in data.get("error", {}): + # This is an error response from Jest itself, not test results + return None + + # Check if we have the expected test results structure + # If we have testResults, parse it even if tests failed - those are legitimate test results + # Parse test results + if "testResults" in data: + for test_result in data.get("testResults", []): + file_path = test_result.get("name", "") + suite_status = test_result.get("status", "") + assertions = test_result.get("assertionResults", []) + + # Handle suite-level failures (no assertions ran) + if suite_status == "failed" and len(assertions) == 0: + test_id = f"{file_path}::suite" + results[test_id] = "FAILED" + continue + + # Handle individual test assertions + for assertion in assertions: + full_name = assertion.get("fullName", "") + title = assertion.get("title", "") + status = assertion.get("status", "") + test_id = f"{file_path}::{full_name}" if full_name else f"{file_path}::{title}" + + if status == "passed": + results[test_id] = "PASSED" + elif status == "failed": + results[test_id] = "FAILED" + elif status in ["pending", "skipped"]: + results[test_id] = "SKIPPED" + + # If we successfully parsed JSON but found no testResults, that's unexpected + # Return None to indicate this isn't valid test output + # (Valid Jest output should have testResults array, even if empty) + if "testResults" not in data: + return None + + return results + + +def parse_mocha_json(json_output: str) -> Optional[Dict[str, str]]: + """Parse Mocha JSON output. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (module errors, syntax errors, etc.), if we find valid + test results JSON, we parse and return it. We only return None if we're certain + the framework didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If Mocha itself failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Try to parse as pure JSON (test results take precedence) + try: + data = json.loads(json_output.strip()) + # Validate this is Mocha JSON by checking for 'stats' key + if "stats" not in data: + data = None + except json.JSONDecodeError: + data = None + + # PRIORITY 2: If direct parse failed, search for JSON in mixed output + if data is None: + # Look for stats key in JSON + stats_pos = json_output.find('"stats"') + if stats_pos == -1: + # PRIORITY 3: No JSON found - NOW check if there are error indicators + error_indicators = [ + "Error: Cannot find module", # Module loading errors + "SyntaxError:", # JavaScript syntax errors + "TypeError:", # Type errors + "ReferenceError:", # Reference errors + "No test files found", # Mocha-specific: no tests found + ] + has_errors = any(indicator in json_output for indicator in error_indicators) + # Return None ONLY if: no JSON found AND errors present + # Return empty dict if: no JSON found AND no errors (rare but valid) + return None if has_errors else results + + # Find the opening brace before "stats" + json_start = json_output.rfind("{", 0, stats_pos) + if json_start == -1: + return None + + # Try parsing from this position + json_portion = json_output[json_start:] + + # Use json.JSONDecoder to find where the object ends + decoder = json.JSONDecoder() + try: + data, _ = decoder.raw_decode(json_portion) + except json.JSONDecodeError: + return None + + # Validate extracted JSON has 'stats' + if "stats" not in data: + return None + + # At this point, we have valid Mocha JSON with 'stats' + # Parse test results even if some tests failed - those are legitimate results + + # Process passed tests + for test in data.get("passes", []): + file_path = test.get("file", "") + full_title = test.get("fullTitle", "") + test_id = f"{file_path}::{full_title}" if full_title else file_path + results[test_id] = "PASSED" + + # Process failed tests + for test in data.get("failures", []): + file_path = test.get("file", "") + full_title = test.get("fullTitle", "") + test_id = f"{file_path}::{full_title}" if full_title else file_path + results[test_id] = "FAILED" + + # Process pending/skipped tests + for test in data.get("pending", []): + file_path = test.get("file", "") + full_title = test.get("fullTitle", "") + test_id = f"{file_path}::{full_title}" if full_title else file_path + results[test_id] = "SKIPPED" + + return results + + +def parse_gtest_json(json_output: str) -> Dict[str, str]: + """Parse Google Test JSON output. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (compilation errors, linking errors, etc.), if we find valid + test results JSON, we parse and return it. We only return None if we're certain + the tests didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If GTest itself failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Try to parse as pure JSON (test results take precedence) + try: + data = json.loads(json_output.strip()) + # Validate this is GTest JSON by checking for 'testsuites' key + if "testsuites" not in data: + data = None + except json.JSONDecodeError: + data = None + + # PRIORITY 2: If direct parse failed, search for JSON in mixed output + if data is None: + # Try to find JSON in mixed output + json_start = json_output.find('{"testsuites"') + if json_start == -1: + json_start = json_output.find('{\n "testsuites"') + if json_start == -1: + # PRIORITY 3: No JSON found - NOW check if there are error indicators + error_indicators = [ + "error:", # C++ compilation errors + "undefined reference to", # Linking errors + "fatal error:", # Fatal compilation errors + "cannot find -l", # Linking library errors (e.g., "cannot find -lgtest") + ": No such file or directory", # File not found errors + ] + has_errors = any(indicator in json_output for indicator in error_indicators) + # Return None ONLY if: no JSON found AND errors present + # Return empty dict if: no JSON found AND no errors (rare but valid) + return None if has_errors else results + + # Extract JSON object + json_portion = json_output[json_start:] + decoder = json.JSONDecoder() + try: + data, _ = decoder.raw_decode(json_portion) + except json.JSONDecodeError: + return None + + # Validate extracted JSON has 'testsuites' + if "testsuites" not in data: + return None + + # At this point, we have valid GTest JSON with 'testsuites' + # Parse test results even if some tests failed - those are legitimate results + + # Parse test results from testsuites + testsuites = data.get("testsuites", []) + if not isinstance(testsuites, list): + testsuites = [testsuites] if isinstance(testsuites, dict) else [] + + for testsuite in testsuites: + suite_name = testsuite.get("name", "") + + # Handle both 'testsuite' (array) and direct test cases + test_cases = testsuite.get("testsuite", []) + if not test_cases: + test_cases = testsuite.get("tests", []) + + for test_case in test_cases: + test_name = test_case.get("name", "") + classname = test_case.get("classname", suite_name) + + # Build test ID in format: SuiteName::TestName + test_id = f"{classname}::{test_name}" if classname else test_name + + # Determine test status + status = test_case.get("status", "RUN") + result = test_case.get("result", "COMPLETED") + + # Check for failures + failures = test_case.get("failures", []) + if failures and len(failures) > 0: + results[test_id] = "FAILED" + elif status == "NOTRUN" or result == "SKIPPED": + results[test_id] = "SKIPPED" + elif result == "COMPLETED" or status == "RUN": + results[test_id] = "PASSED" + else: + results[test_id] = "FAILED" + + return results + + +def parse_maven_text_output(text_output: str) -> Dict[str, str]: + """Parse Maven text output for test results.""" + results = {} + + # Look for test summary lines like: + # Tests run: 5, Failures: 1, Errors: 0, Skipped: 0 + summary_pattern = r"Tests run: (\d+),\s*Failures: (\d+),\s*Errors: (\d+),\s*Skipped: (\d+)" + + # Check for compilation errors - if tests can't compile, mark them as failed + compilation_error_pattern = r"\[ERROR\].*?testCompile.*?Compilation failure" + if re.search(compilation_error_pattern, text_output, re.DOTALL | re.IGNORECASE): + # Find test files mentioned in compilation errors + test_file_pattern = r"/workspace/repo/[^/]+/src/test/java/([\w/]+)\.java" + for match in re.finditer(test_file_pattern, text_output): + test_class = match.group(1).replace("/", ".") + # Mark as failed due to compilation + results[f"{test_class}::compile"] = "FAILED" + # If we found compilation errors, return early + if results: + return results + + # Check for BUILD FAILURE + if "BUILD FAILURE" in text_output: + # If build failed and we haven't found specific test failures, mark as generic failure + if not results: + results["maven::build"] = "FAILED" + return results + + # Parse test run summaries per module + lines = text_output.split("\n") + current_module = None + + for line in lines: + # Track which module we're in + if "Building" in line and "[" in line and "]" in line: + # Extract module name from lines like "[INFO] Building Docs Web 1.12-SNAPSHOT [4/4]" + parts = line.split("Building") + if len(parts) > 1: + module_parts = parts[1].strip().split() + if len(module_parts) > 0: + current_module = module_parts[0] + + # Look for test summary + summary_match = re.search(summary_pattern, line) + if summary_match: + total = int(summary_match.group(1)) + failures = int(summary_match.group(2)) + errors = int(summary_match.group(3)) + skipped = int(summary_match.group(4)) + + if total > 0: + # We have test counts but might not have individual test names + # Generate generic test IDs based on the current module + module_name = current_module or "unknown" + passed = total - failures - errors - skipped + + for j in range(passed): + results[f"{module_name}::test_{j + 1}"] = "PASSED" + for j in range(failures + errors): + results[f"{module_name}::test_failed_{j + 1}"] = "FAILED" + for j in range(skipped): + results[f"{module_name}::test_skipped_{j + 1}"] = "SKIPPED" + return results + + +def parse_cargo_nextest(output: str) -> Dict[str, str]: + """Parse cargo-nextest text output. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (warnings, etc.), if we find valid test results, + we parse and return them. We only return None if we're certain the tests didn't + run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If cargo-nextest failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Parse individual test result lines + # Format: PASS [ 1.588s] rusty::tests integration::linking::test_name + # FAIL [ 5.845s] rusty codegen::tests::parameters_tests::test_name + test_line_pattern = re.compile(r"^\s*(PASS|FAIL|SIGKILL|SKIP)\s+\[.*?\]\s+(.+)$", re.MULTILINE) + + for match in test_line_pattern.finditer(output): + status = match.group(1) + test_name = match.group(2).strip() + + if status == "PASS": + results[test_name] = "PASSED" + elif status in ("FAIL", "SIGKILL"): + results[test_name] = "FAILED" + elif status == "SKIP": + results[test_name] = "SKIPPED" + + # PRIORITY 2: If we found NO test results, check for error indicators + # Only return None if we're certain tests didn't run (compilation/linking errors) + if not results: + error_indicators = [ + "error[E", # Rust compiler errors (e.g., error[E0425]) + "error: could not compile", # Cargo compilation errors + "error: linking with", # Linking errors + "error: aborting due to", # Compilation aborted + ] + has_errors = any(indicator in output for indicator in error_indicators) + # Return None ONLY if: no results found AND errors present + # Return empty dict if: no results found AND no errors (rare but valid - no tests in project) + return None if has_errors else results + + return results + + +def parse_bun_text(text_output: str) -> Dict[str, str]: + """ + Parse Bun test framework output. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (TypeScript, compilation, etc.), if we find valid + test results, we parse and return them. We only return None if we're + certain Bun didn't run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If Bun itself failed to run (not the same as tests failing) + """ + results = {} + current_file = None + current_describe = None + + # PRIORITY 1: Try to parse test results (✓ and ✗ symbols) + for line in text_output.split("\n"): + # Track current file (lines ending with .ts: or .js:) + if re.match(r"^[^\s].*\.(ts|js|tsx|jsx):?\s*$", line.strip()): + current_file = line.strip().rstrip(":") + current_describe = None + continue + + # Track describe blocks (indented text followed by colon, but not test results) + describe_match = re.match(r"^\s+([^✓✗\n]+):\s*$", line) + if describe_match: + current_describe = describe_match.group(1).strip() + continue + + # Remove ANSI color codes + clean_line = re.sub(r"\x1b\[[0-9;]*m", "", line) + + # Match passed tests: ✓ test_name [time] + pass_match = re.match(r"^\s*✓\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line) + if pass_match: + test_name = pass_match.group(1).strip() + # Build test ID with file, describe block, and test name + test_id = test_name + if current_file: + test_id = f"{current_file}::{test_name}" + if current_describe: + test_id = f"{current_file}::{current_describe} > {test_name}" + results[test_id] = "PASSED" + continue + + # Match failed tests: ✗ test_name [time] + fail_match = re.match(r"^\s*✗\s+(.+?)(?:\s+\[[\d.]+m?s\])?\s*$", clean_line) + if fail_match: + test_name = fail_match.group(1).strip() + # Build test ID with file, describe block, and test name + test_id = test_name + if current_file: + test_id = f"{current_file}::{test_name}" + if current_describe: + test_id = f"{current_file}::{current_describe} > {test_name}" + results[test_id] = "FAILED" + continue + + # Alternative format: FAIL filepath > describe > test_name + alt_fail_match = re.match(r"^\s*FAIL\s+(.+?)\s+>\s+(.+?)\s*$", clean_line) + if alt_fail_match: + file_path = alt_fail_match.group(1).strip() + test_path = alt_fail_match.group(2).strip() + test_id = f"{file_path}::{test_path}" + results[test_id] = "FAILED" + continue + + # PRIORITY 2: If no individual test results found, try parsing summary + if not results: + # Look for summary like "5 pass, 2 fail" or "X passing (Yms)" + summary_match = re.search(r"(\d+)\s+pass(?:ing|ed)?.*?(\d+)\s+fail(?:ing|ed)?", text_output.lower()) + if summary_match: + passed = int(summary_match.group(1)) + failed = int(summary_match.group(2)) + + # Generate generic test IDs + for i in range(passed): + results[f"test_{i + 1}"] = "PASSED" + for i in range(failed): + results[f"test_failed_{i + 1}"] = "FAILED" + + # PRIORITY 3: No test results found - NOW check if there are error indicators + # Only return None if we're sure tests didn't run (no results + errors present) + # NOTE: error_indicators are a LAST RESORT - we prefer finding test results + if not results: + error_indicators = [ + "error TS", # TypeScript compilation errors (e.g., error TS2307:) + "Error: Cannot find module", # Module loading errors + "SyntaxError:", # JavaScript/TypeScript syntax errors + "error: ", # Generic Bun errors (lowercase 'error:') + "Error:", # Generic errors + "ModuleNotFoundError", # Module not found + "bun: command not found", # Bun not installed + "panicked at", # Bun runtime panics + "Segmentation fault", # Critical runtime errors + ] + has_errors = any(indicator in text_output for indicator in error_indicators) + # Return None ONLY if: no test results found AND errors present + # Return empty dict if: no test results found AND no errors (rare but valid) + return None if has_errors else results + + return results + + +def parse_cppunit_text(text_output: str) -> Dict[str, str]: + """Parse CppUnit text output for test results. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + Even if output contains errors (warnings, etc.), if we find valid test results, + we parse and return them. We only return None if we're certain the tests didn't + run (no test results + error indicators). + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If CppUnit failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Parse individual test result lines + # Format: TestClassName::testMethodName : OK + # TestClassName::testMethodName : FAIL + test_line_pattern = re.compile( + r"^([A-Za-z_][A-Za-z0-9_]*::[A-Za-z_][A-Za-z0-9_]*)\s*:\s*(OK|FAIL|ERROR)$", re.MULTILINE + ) + + for match in test_line_pattern.finditer(text_output): + test_name = match.group(1).strip() + status = match.group(2).strip() + + if status == "OK": + results[test_name] = "PASSED" + elif status in ["FAIL", "ERROR"]: + results[test_name] = "FAILED" + + # PRIORITY 2: If we found NO test results, check for error indicators + # Only return None if we're certain tests didn't run (compilation/linking errors) + if not results: + error_indicators = [ + "error:", # C++ compilation errors + "undefined reference to", # Linking errors + "fatal error:", # Fatal compilation errors + "ld returned", # Linker errors + "cannot find -l", # Library linking errors + ] + has_errors = any(indicator in text_output for indicator in error_indicators) + # Return None ONLY if: no results found AND errors present + # Return empty dict if: no results found AND no errors (rare but valid - no tests) + return None if has_errors else results + + return results + + +def parse_minitest_text(text_output: str, test_metadata_path: str = None) -> Dict[str, str]: + """ + Parse mini.nvim (MiniTest) test framework output. + + MiniTest is used by Neovim plugins for testing. + Example output: + Total number of cases: 5 + tests/test_treesitter.lua: ooooo + + Fails (0) and Notes (0) + + Or with failures: + FAIL in tests/test_treesitter.lua | wrap_cursor | normal: error message + FAIL in tests/test_treesitter.lua | enumerate: error message + + Fails (2) and Notes (0) + + IMPORTANT: MiniTest only outputs individual test names when they FAIL. + When all tests pass, only summary is shown - no individual test names. + + Solution: When all tests pass, read test_metadata.json to get expected test names + and return them as PASSED. This ensures real test names are used consistently. + """ + results = {} + + # Parse individual test results from FAIL/NOTE lines + # Format: FAIL in file.lua | group | test_name: error message + # Use [^|:]+ to stop at pipe OR colon (prevents capturing error message) + fail_pattern = re.compile( + r"^(?:\x1b\[\d+(?:;\d+)?m)?FAIL(?:\x1b\[0m)?\s+in\s+([^|]+)\s*\|\s*([^|:]+)(?:\s*\|\s*([^:]+))?:", re.MULTILINE + ) + + for match in fail_pattern.finditer(text_output): + file_path = match.group(1).strip() + group = match.group(2).strip() + test_name = match.group(3).strip() if match.group(3) else "" + + # Create test ID: file | group | test_name or file | group + if test_name: + test_id = f"{file_path} | {group} | {test_name}" + else: + test_id = f"{file_path} | {group}" + + results[test_id] = "FAILED" + + return results + + +def parse_telescope_text(text_output: str) -> Dict[str, str]: + """ + Parse telescope test framework output. + + Telescope outputs lines like: + ✓ test_name + ✗ test_name + - test_name (skipped) + + Also handles PlenaryBusted output for Neovim plugins. + + IMPORTANT: We prioritize finding valid test results over detecting errors. + We only return None if we're certain the framework didn't run. + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If telescope failed to run (not the same as tests failing) + """ + results = {} + + for line in text_output.split("\n"): + line = line.strip() + + # Match passed tests: ✓ test_name or "Success: test_name" + if "✓" in line: + test_name = line.split("✓", 1)[1].strip() + if test_name: # Avoid empty test names + results[test_name] = "PASSED" + elif line.lower().startswith("success:"): + test_name = line.split(":", 1)[1].strip() + if test_name: + results[test_name] = "PASSED" + + # Match failed tests: ✗ test_name or "Failed: test_name" + elif "✗" in line: + test_name = line.split("✗", 1)[1].strip() + if test_name: + results[test_name] = "FAILED" + elif line.lower().startswith("failed:"): + test_name = line.split(":", 1)[1].strip() + if test_name: + results[test_name] = "FAILED" + + # Match skipped tests: - test_name or "Skipped: test_name" + elif line.startswith("- ") and "skip" in line.lower(): + test_name = line[2:].strip() + # Remove "(skipped)" suffix if present + test_name = re.sub(r"\s*\(skipped\)\s*$", "", test_name, flags=re.IGNORECASE) + if test_name: + results[test_name] = "SKIPPED" + elif line.lower().startswith("skipped:"): + test_name = line.split(":", 1)[1].strip() + if test_name: + results[test_name] = "SKIPPED" + + # If no results found, try parsing summary line + if not results: + # Look for summary like "5 passed, 2 failed, 1 skipped" + summary_pattern = r"(\d+)\s+passed.*?(\d+)\s+failed" + match = re.search(summary_pattern, text_output.lower()) + if match: + passed = int(match.group(1)) + failed = int(match.group(2)) + + # Generate generic test IDs + for i in range(passed): + results[f"test_{i + 1}"] = "PASSED" + for i in range(failed): + results[f"test_failed_{i + 1}"] = "FAILED" + + # PRIORITY 2: If we found NO test results, check for error indicators + # Only return None if we're certain tests didn't run (Lua/Neovim errors) + if not results: + error_indicators = [ + "Error:", # Generic Lua errors + "error loading module", # Lua module loading errors + "attempt to call", # Lua runtime errors + "bad argument", # Lua runtime errors + "stack traceback:", # Lua errors with traceback + ] + has_errors = any(indicator in text_output for indicator in error_indicators) + # Return None ONLY if: no results found AND errors present + # Return empty dict if: no results found AND no errors (rare but valid - no tests) + return None if has_errors else results + + return results + + +def parse_lust_text(text_output: str) -> Dict[str, str]: + """ + Parse lust test framework output. + + Lust outputs test results with dots (.) for pass, F for fail. + Example output: + ..F. + 4 tests, 1 failure + test/my_test.lua:15: Expected true but got false + + We parse individual test results when available, or fall back to summary. + """ + results = {} + + # Try to parse individual test results from verbose output + # Pattern: " test_name ... ok" or " test_name ... FAILED" + test_pattern = re.compile(r"^\s*(.+?)\s+\.\.\.\s+(ok|FAILED|ERROR)", re.MULTILINE) + matches = test_pattern.findall(text_output) + + if matches: + # Found individual test results + for test_name, status in matches: + test_name = test_name.strip() + if status == "ok": + results[test_name] = "PASSED" + else: + results[test_name] = "FAILED" + return results + + # Try to extract test descriptions from failure messages + # Pattern: "test_file.lua:line_number: test description" + failure_pattern = re.compile(r"^([^\s:]+\.lua):(\d+):\s*(.+)$", re.MULTILINE) + failures = failure_pattern.findall(text_output) + + if failures: + for filepath, _, description in failures: + test_id = f"{filepath}::{description.strip()}" + results[test_id] = "FAILED" + + # Parse summary line to get total count: "X tests, Y failures" + summary_match = re.search(r"(\d+)\s+tests?,\s+(\d+)\s+failures?", text_output.lower()) + if summary_match: + total_tests = int(summary_match.group(1)) + failures = int(summary_match.group(2)) + + # If we haven't parsed individual tests yet, generate generic ones + if not results: + passed = total_tests - failures + for i in range(passed): + results[f"test_{i + 1}"] = "PASSED" + for i in range(failures): + results[f"test_failed_{i + 1}"] = "FAILED" + return results + + # Fallback: if no detailed info, check for overall success/failure + if not results: + if "0 failures" in text_output.lower() or "0 errors" in text_output.lower(): + results["test_suite"] = "PASSED" + else: + results["test_suite"] = "FAILED" + + return results + + +def parse_bespoke_libgeos(text_output: str) -> Dict[str, str]: + """Parse libgeos/GEOS test output format. + + Format: + capi::GEOSBoundary: . + capi::GEOSBuffer: ..................... + geos::operation::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F][5=F][6=F] + geos::operation::buffer::BufferOp: ..........................[27=X] + + Where: + - dots (.) = passing tests + - [N=F] = explicit failure markers + - [N=X] = exception markers (also failures) + - standalone F or X = failure/exception + + IMPORTANT: We prioritize finding valid test results over detecting errors. + We only return None if we're certain the framework didn't run. + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If libgeos tests failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Parse individual test result lines + # Pattern: TestSuite::TestName: followed by dots, Fs, Xs, or [N=F]/[N=X] markers + # Example: capi::GEOSBoundary: . + # Example: geos::OverlayNGEmptyCoordDim: [1=F][2=F].[4=F] + # Example: geos::operation::buffer::BufferOp: ..........................[27=X] + test_line_pattern = re.compile( + r"^([a-zA-Z_][a-zA-Z0-9_:]*::[a-zA-Z_][a-zA-Z0-9_]*)\s*:\s*(.+?)(?:\n|$)", re.MULTILINE + ) + + for match in test_line_pattern.finditer(text_output): + test_id = match.group(1) # Full name like "capi::GEOSBoundary" + test_output_line = match.group(2) # Everything after the colon + + # Check for failure markers: + # 1. [N=F] pattern (explicit failure notation) + # 2. [N=X] pattern (exception notation) + # 3. Standalone F or X characters + has_failure = bool(re.search(r"\[.*=[FX]\]|(? str: + """Normalize XCTest case identifiers from swift test console output.""" + name = raw_name.strip() + + # Typical format: -[Module.Class testMethod] + if name.startswith("-[") and name.endswith("]"): + inner = name[2:-1].strip() + if " " in inner: + class_name, method = inner.split(" ", 1) + return f"{class_name}::{method}" + return inner + + # Alternate format: Module.Class.testMethod + if "." in name: + parts = name.split(".", 1) + return f"{parts[0]}::{parts[1]}" + + return name + + +def parse_swift_test_text(text_output: str) -> Dict[str, str]: + """Parse vanilla `swift test` console output (without --xunit-output).""" + results = {} + test_case_pattern = re.compile(r"Test Case '([^']+)' (passed|failed|skipped)", re.IGNORECASE) + + for match in test_case_pattern.finditer(text_output): + raw_name, status = match.groups() + test_id = _normalize_swift_test_name(raw_name) + results[test_id] = status.upper() + + if results: + return results + + # Fallback: parse summary line to infer aggregate results if per-test lines missing + summary_match = re.search( + r"Executed\s+(\d+)\s+tests?,\s+with\s+(\d+)\s+failures?", + text_output, + re.IGNORECASE, + ) + if summary_match: + total_tests = int(summary_match.group(1)) + failures = int(summary_match.group(2)) + passes = max(total_tests - failures, 0) + + for i in range(passes): + results[f"swift_test_pass_{i + 1}"] = "PASSED" + for i in range(failures): + results[f"swift_test_fail_{i + 1}"] = "FAILED" + + return results + + +def parse_xctest_output(output: str) -> Dict[str, str]: + """Parse XCTest results, preferring XML when available.""" + xml_results = parse_junit_xml(output) + if xml_results: + return xml_results + return parse_swift_test_text(output) + + +def normalize_test_id(test_id: str, framework: str = "") -> str: + """Normalize test IDs for stable matching across different formats. + + This function performs several normalizations: + + 1. Removes unstable runtime prefixes that change between runs: + - (N/M) - Test execution order (e.g., "(2/5) test_name") + - [N/M] - Alternative bracket format + - #N - Test number prefix (e.g., "#42 test_name") + - N. - Numbered list format (e.g., "1. test_name") + + 2. Removes common file extensions (.py, .js, .ts, .go, etc.) from test paths + to allow matching between "test_file.py::test" and "test_file::test" + + 3. Normalizes delimiters (`.`, `::`, `/`) to a canonical form (`::`) + when they appear between alphanumeric characters, allowing matching + between "testa.testb::testc" and "testa/testb.testc" + + Examples: + "(2/5) test_name" -> "test_name" + "test_file.py::test_name" -> "test_file::test_name" + "testa.testb::testc" -> "testa::testb::testc" + "testa/testb.testc" -> "testa::testb::testc" + "tests/module.js::describe::it" -> "tests::module::describe::it" + + Args: + test_id: Original test ID from parser + framework: Test framework name (for future framework-specific rules if needed) + + Returns: + Normalized test ID + """ + # Step 1: Remove unstable runtime prefixes + + # Universal pattern: Remove (N/M) or [N/M] prefixes (test execution order) + # Matches: "(2/5) test", "[2/5] test", "(123/456) test", "( 1/75) test" (with internal space) + normalized = re.sub(r"^[\(\[]?\s*\d+/\d+[\)\]]?\s+", "", test_id) + + # Universal pattern: Remove #N prefix (test numbering) + # Matches: "#42 test", "# 42 test" + normalized = re.sub(r"^#\s*\d+\s+", "", normalized) + + # Universal pattern: Remove "N. " prefix (numbered list) + # Matches: "1. test", "42. test" + normalized = re.sub(r"^\d+\.\s+", "", normalized) + + # Step 2: Remove common file extensions before delimiters + # This prevents .py from becoming ::py after delimiter normalization + # Match extensions like .py, .js, .ts, etc. that appear before :: / . or end of string + extensions_pattern = ( + r"\.(py|pyw|js|mjs|cjs|ts|mts|cts|jsx|tsx|" + r"go|java|rb|rs|c|cpp|cc|cxx|h|hpp|hxx|" + r"swift|kt|kts|scala|php|cs|fs|" + r"ex|exs|erl|hrl|clj|cljs|cljc|" + r"lua|pl|pm|t|r|R|m|mm|" + r"f|f90|f95|for|vb|pas|pp|" + r"d|nim|zig|v|sv|vhd|vhdl|" + r"tcl|sh|bash|zsh|fish|ps1|psm1|psd1)" + r"(?=::|/|\.|$)" + ) + normalized = re.sub(extensions_pattern, "", normalized, flags=re.IGNORECASE) + + # Step 3: Normalize delimiters (., ::, /) to :: when between word characters + # This allows matching "testa.testb::testc" with "testa/testb.testc" + delimiter_pattern = r"(?<=\w)(::|\.|/)(?=\w)" + normalized = re.sub(delimiter_pattern, "::", normalized) + + return normalized + + +def parse_tap_text(text_output: str) -> Dict[str, str]: + """ + Parse TAP (Test Anything Protocol) output. + + TAP is used by tape, node-tap, and other JavaScript test frameworks. + + Format: + TAP version 13 + # Subtest: Test name + 1..N + ok 1 - assertion name + not ok 2 - assertion name + ok 1 - Test name # time=123ms + not ok 2 - Test name + 1..N + + IMPORTANT: We prioritize finding valid test results over detecting errors. + We only return None if we're certain the tests didn't run. + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If TAP tests failed to run (not the same as tests failing) + """ + results = {} + + # PRIORITY 1: Parse top-level test results (not indented subtests) + # Format: "ok N - Test name" or "not ok N - Test name" + # Skip lines starting with whitespace (subtests) + tap_test_pattern = re.compile( + r"^(not )?ok\s+(\d+)\s*(?:-\s*)?(.+?)(?:\s*#\s*(skip|todo|time=.*))?$", re.MULTILINE | re.IGNORECASE + ) + + for match in tap_test_pattern.finditer(text_output): + is_failure = match.group(1) is not None # "not ok" prefix + test_num = match.group(2) + test_name = match.group(3).strip() if match.group(3) else f"test_{test_num}" + directive = match.group(4) + + # Clean up test name (remove timing info like "# time=123ms") + test_name = re.sub(r"\s*#\s*time=[\d.]+m?s\s*$", "", test_name, flags=re.IGNORECASE) + + test_id = test_name if test_name else f"test_{test_num}" + + # Check for skip directive + if directive and directive.lower().startswith("skip"): + results[test_id] = "SKIPPED" + elif is_failure: + results[test_id] = "FAILED" + else: + results[test_id] = "PASSED" + + # PRIORITY 2: If no results found, try parsing summary line + # Format: "# tests N", "# pass N", "# fail N" + if not results: + pass_match = re.search(r"#\s*pass\s+(\d+)", text_output, re.IGNORECASE) + fail_match = re.search(r"#\s*fail\s+(\d+)", text_output, re.IGNORECASE) + + if pass_match or fail_match: + passed = int(pass_match.group(1)) if pass_match else 0 + failed = int(fail_match.group(1)) if fail_match else 0 + + for i in range(passed): + results[f"tap_test_passed_{i + 1}"] = "PASSED" + for i in range(failed): + results[f"tap_test_failed_{i + 1}"] = "FAILED" + + # PRIORITY 3: If no results, check for error indicators + if not results: + error_indicators = [ + "npm ERR!", # npm errors + "Error: Cannot find module", # Module loading errors + "SyntaxError:", # JavaScript syntax errors + "TypeError:", # Type errors + ] + has_errors = any(indicator in text_output for indicator in error_indicators) + # Return None ONLY if: no results found AND errors present + return None if has_errors else results + + return results + + +def parse_hardhat_mocha_text(text_output: str) -> Dict[str, str]: + """ + Parse Hardhat/Mocha console text output (non-JSON reporter). + + Hardhat uses Mocha under the hood and outputs text like: + Contract: FeeSharingProxy: + withdrawFees + ✓ Shouldn't be able to use zero token address + ✓ Shouldn't be able to withdraw second time in period + 1) Should fail with specific error + + 5 passing (1s) + 1 failing + + IMPORTANT: We prioritize finding valid test results over detecting errors. + + Returns: + Dict[str, str]: Test results mapping test IDs to status (PASSED/FAILED/SKIPPED) + None: If tests failed to run (not the same as tests failing) + """ + results = {} + + # Track current context (Contract/describe blocks) + current_context = [] + + # PRIORITY 1: Parse individual test results + for line in text_output.split("\n"): + stripped = line.strip() + + # Track Contract: or describe blocks + contract_match = re.match(r"^Contract:\s*(.+?):\s*$", stripped) + if contract_match: + current_context = [contract_match.group(1)] + continue + + # Track describe blocks (indented without checkmark/number) + if stripped and not stripped.startswith(("✓", "✗", "-")) and not re.match(r"^\d+\)", stripped): + # Check if this looks like a describe block (usually followed by test cases) + if ":" not in stripped and len(stripped) < 100: + # This might be a describe block, but we'll handle it dynamically + pass + + # Match passed tests: ✓ test_name or ✔ test_name + pass_match = re.match(r"^[✓✔]\s+(.+?)(?:\s+\(\d+m?s\))?$", stripped) + if pass_match: + test_name = pass_match.group(1).strip() + test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name + results[test_id] = "PASSED" + continue + + # Match failed tests: N) test_name or ✗ test_name + fail_match = re.match(r"^(?:\d+\)|[✗✘])\s*(.+?)$", stripped) + if fail_match: + test_name = fail_match.group(1).strip() + test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name + results[test_id] = "FAILED" + continue + + # Match skipped tests: - test_name + skip_match = re.match(r"^-\s+(.+?)$", stripped) + if skip_match: + test_name = skip_match.group(1).strip() + test_id = f"{' > '.join(current_context)} > {test_name}" if current_context else test_name + results[test_id] = "SKIPPED" + continue + + # PRIORITY 2: Parse summary if no individual results found + if not results: + # Look for "N passing" and "N failing" + pass_match = re.search(r"(\d+)\s+passing", text_output, re.IGNORECASE) + fail_match = re.search(r"(\d+)\s+failing", text_output, re.IGNORECASE) + + if pass_match or fail_match: + passed = int(pass_match.group(1)) if pass_match else 0 + failed = int(fail_match.group(1)) if fail_match else 0 + + for i in range(passed): + results[f"mocha_test_passed_{i + 1}"] = "PASSED" + for i in range(failed): + results[f"mocha_test_failed_{i + 1}"] = "FAILED" + + # PRIORITY 3: If no results, check for error indicators + if not results: + error_indicators = [ + "Error: Cannot find module", + "SyntaxError:", + "CompilerError:", # Solidity compilation errors + "Error: HH", # Hardhat errors + ] + has_errors = any(indicator in text_output for indicator in error_indicators) + return None if has_errors else results + + return results + + +def parse_pytest_text(text_output: str) -> Dict[str, str]: + """ + Parse pytest plain text output (-v flag). + + Pytest outputs lines like: + tests/test_foo.py::test_one PASSED + tests/test_foo.py::test_two FAILED + tests/test_foo.py::test_three SKIPPED + + Or in short form: + tests/test_foo.py .F.s + + Also handles summary lines like: + ===== 3 passed, 1 failed, 1 skipped in 0.5s ===== + """ + results = {} + + # Pattern 1: Verbose output with test names + # e.g., "tests/test_foo.py::test_one PASSED" + verbose_pattern = re.compile( + r"^([\w./]+::\w+(?:::\w+)*)\s+(PASSED|FAILED|SKIPPED|ERROR|XFAIL|XPASS)", re.MULTILINE + ) + + for match in verbose_pattern.finditer(text_output): + test_id = match.group(1).strip() + status = match.group(2).upper() + + if status in ("PASSED", "XPASS"): + results[test_id] = "PASSED" + elif status in ("FAILED", "ERROR", "XFAIL"): + results[test_id] = "FAILED" + elif status == "SKIPPED": + results[test_id] = "SKIPPED" + + if results: + return results + + # Pattern 2: Short form with dots (. = pass, F = fail, s = skip) + # e.g., "tests/test_foo.py .F.s" + short_pattern = re.compile(r"^([\w./]+\.py)\s+([.FsExX]+)", re.MULTILINE) + + for match in short_pattern.finditer(text_output): + file_path = match.group(1) + outcomes = match.group(2) + + for i, char in enumerate(outcomes): + test_id = f"{file_path}::test_{i + 1}" + if char == ".": + results[test_id] = "PASSED" + elif char.upper() == "F": + results[test_id] = "FAILED" + elif char.lower() == "s": + results[test_id] = "SKIPPED" + + if results: + return results + + # Pattern 3: Summary line fallback + # e.g., "===== 3 passed, 1 failed, 1 skipped in 0.5s =====" + summary_pattern = re.compile(r"(\d+)\s+passed(?:,\s*(\d+)\s+failed)?(?:,\s*(\d+)\s+(?:skipped|deselected))?") + match = summary_pattern.search(text_output) + if match: + passed = int(match.group(1) or 0) + failed = int(match.group(2) or 0) + skipped = int(match.group(3) or 0) + + for i in range(passed): + results[f"pytest_test_passed_{i + 1}"] = "PASSED" + for i in range(failed): + results[f"pytest_test_failed_{i + 1}"] = "FAILED" + for i in range(skipped): + results[f"pytest_test_skipped_{i + 1}"] = "SKIPPED" + + return results + + +def parse_test_output(output: str, framework: str) -> Dict[str, str]: + """ + Parse test output to extract individual test results. + + Returns: {'test_id': 'PASSED'|'FAILED'|'SKIPPED'} + """ + # Direct framework → parser mapping + parsers = { + "pytest": parse_junit_xml, + "unittest": parse_junit_xml, + "junit": parse_junit_xml, + "maven": parse_maven_text_output, + "gtest": parse_gtest_json, + "cargo-nextest": parse_cargo_nextest, + "go": parse_go_json, + "jest": parse_jest_vitest_json, + "vitest": parse_jest_vitest_json, + "mocha": parse_mocha_json, + "bun": parse_bun_text, + "ctest": parse_junit_xml, + "cppunit": parse_cppunit_text, + "bespoke_libgeos": parse_bespoke_libgeos, + # XCTest using hybrid approach + "xctest": parse_xctest_output, + "testing": parse_xctest_output, # New Swift Testing framework (Swift 6+) + # Lua frameworks + "busted": parse_junit_xml, # Uses JUnit XML output + "luaunit": parse_junit_xml, # Uses JUnit XML output + "telescope": parse_telescope_text, + "lust": parse_lust_text, + "minitest": parse_minitest_text, # Neovim mini.nvim test framework + # TAP (Test Anything Protocol) - used by tape, node-tap + "tap": parse_tap_text, + "tape": parse_tap_text, + # Hardhat (Solidity) - uses Mocha console output + "hardhat": parse_hardhat_mocha_text, + } + + parser = parsers.get(framework) + if parser: + result = parser(output) + # Fallback for common frameworks if their primary parser returns None/empty + if not result: + if framework in ["junit", "maven"]: + result = parse_maven_text_output(output) + elif framework == "pytest": + # Pytest often outputs plain text, not JUnit XML + result = parse_pytest_text(output) + elif framework == "mocha": + # Mocha might output text instead of JSON (console reporter) + result = parse_hardhat_mocha_text(output) + return result or {} + + # Try auto-detection for unknown frameworks + # Check for TAP output + if "TAP version" in output or re.search(r"^(?:not )?ok\s+\d+", output, re.MULTILINE): + return parse_tap_text(output) or {} + + # Check for Mocha/Hardhat console output + if "Contract:" in output or re.search(r"^\s*[✓✔]\s+", output, re.MULTILINE): + return parse_hardhat_mocha_text(output) or {} + + return {} diff --git a/resources_servers/swe_bench/parsing/utils.py b/resources_servers/swe_bench/parsing/utils.py new file mode 100644 index 0000000000..4ef21aad79 --- /dev/null +++ b/resources_servers/swe_bench/parsing/utils.py @@ -0,0 +1,194 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""SWE-Bench-Ext test output parsing utilities. + +Provides the high-level grading entry point that parses raw test output, +normalizes test IDs, fuzzy-matches the expected FAIL_TO_PASS / PASS_TO_PASS +tests, and reports whether the task was resolved. Example usage:: + + from resources_servers.swe_bench.parsing import parse_and_check_tests + + result = parse_and_check_tests( + test_output=log_text, + test_framework="pytest", + fail_to_pass=["test_a", "test_b"], + pass_to_pass=["test_c"], + instance_id="my-task-123", + ) + # result["resolved"] -> bool +""" + +from __future__ import annotations + +from typing import Any, Dict, List, Optional + +from resources_servers.swe_bench.parsing.parsing import ( + normalize_test_id, + parse_test_output, +) + + +# Marker strings used to delimit structured output in the raw test log. +_TEST_OUTPUT_START = "<<>>" +_TEST_OUTPUT_END = "<<>>" +_RESULT_FILE_START = "<<>>" +_RESULT_FILE_END = "<<>>" + + +def _extract_between_markers(text: str, start: str, end: str) -> Optional[str]: + """Extract the substring between two marker strings. + + Args: + text: Text to search within. + start: Opening marker; the result begins after it. + end: Closing marker; the result ends before it. + + Returns: + The stripped text between the markers, or None if either marker is + missing or they appear out of order. + """ + s = text.find(start) + e = text.find(end) + if s != -1 and e != -1 and s < e: + return text[s + len(start) : e].strip() + return None + + +def _match_test_with_fuzzy( + test_id: str, + parsed_results: Dict[str, str], + build_failed_packages: set, +) -> str: + """Resolve the status of a single test ID against parsed results. + + Tries, in order: a direct lookup, a check for membership in a package that + failed to build, a substring match, and a match on the final ``::`` + component. + + Args: + test_id: Normalized test identifier to look up. + parsed_results: Mapping of parsed test ID to status string. + build_failed_packages: Set of package names whose build failed; any + test ID prefixed by one of these is treated as failed. + + Returns: + The matched status string, ``"FAILED"`` if the test's package failed to + build, or ``"NOT_FOUND"`` if no match is found. + """ + # Direct match + if test_id in parsed_results: + return parsed_results[test_id] + + # Check if this test belongs to a package that failed to build + for pkg in build_failed_packages: + if test_id.startswith(pkg): + return "FAILED" + + # Substring match (normalized IDs may differ in prefix) + for parsed_id, status in parsed_results.items(): + if test_id in parsed_id or parsed_id in test_id: + return status + + # Try matching by last component (after last ::) + if "::" in test_id: + suffix = test_id.rsplit("::", 1)[-1] + for parsed_id, status in parsed_results.items(): + if "::" in parsed_id and parsed_id.rsplit("::", 1)[-1] == suffix: + return status + + return "NOT_FOUND" + + +def parse_and_check_tests( + test_output: str, + test_framework: str, + fail_to_pass: List[str], + pass_to_pass: List[str], + instance_id: str = "", +) -> Dict[str, Any]: + """Parse test output and check FAIL_TO_PASS / PASS_TO_PASS resolution. + + The pipeline extracts structured output from the result-file markers (if + present), parses it with the framework dispatcher, normalizes both parsed + and expected test IDs, fuzzy-matches each expected test, and computes + ``resolved`` as all FAIL_TO_PASS passing and all PASS_TO_PASS passing. + + Args: + test_output: Raw test log to parse. + test_framework: Name of the test framework (e.g. ``"pytest"``) used to + select the parser and normalize IDs. + fail_to_pass: Test IDs expected to transition from failing to passing. + pass_to_pass: Test IDs expected to remain passing. + instance_id: Optional task identifier, accepted for caller convenience. + + Returns: + A report dict containing the overall ``resolved`` flag, per-test + FAIL_TO_PASS and PASS_TO_PASS results, pass/total counts for each + group, the number of parsed tests, and the framework name. + """ + # Try to extract result file content from the markers. + result_file_content = _extract_between_markers(test_output, _RESULT_FILE_START, _RESULT_FILE_END) + + if result_file_content: + parsed = parse_test_output(result_file_content, test_framework) + if not parsed: + parsed = parse_test_output(test_output, test_framework) + else: + parsed = parse_test_output(test_output, test_framework) + + if parsed is None: + parsed = {} + + # Normalize parsed test IDs + parsed = {normalize_test_id(tid, test_framework): status for tid, status in parsed.items()} + + # Normalize expected test IDs + norm_f2p = [normalize_test_id(tid, test_framework) for tid in fail_to_pass] + norm_p2p = [normalize_test_id(tid, test_framework) for tid in pass_to_pass] + + # Handle synthetic build/compile tests + for tid in norm_f2p + norm_p2p: + if (tid.endswith("::build") or tid.endswith("::compile")) and tid not in parsed: + parsed[tid] = "PASSED" + + # Identify packages that failed to build + build_failed_packages = {pkg for pkg, status in parsed.items() if status == "FAILED" and "::" not in pkg} + + # Match FAIL_TO_PASS + f2p_results = {} + for tid in norm_f2p: + f2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages) + + # Match PASS_TO_PASS + p2p_results = {} + for tid in norm_p2p: + p2p_results[tid] = _match_test_with_fuzzy(tid, parsed, build_failed_packages) + + all_f2p_passed = all(v == "PASSED" for v in f2p_results.values()) if f2p_results else False + all_p2p_passed = all(v == "PASSED" for v in p2p_results.values()) + resolved = all_f2p_passed and all_p2p_passed + + return { + "resolved": resolved, + "patch_exists": True, + "patch_successfully_applied": True, + "fail_to_pass_results": f2p_results, + "pass_to_pass_results": p2p_results, + "f2p_passed": sum(1 for v in f2p_results.values() if v == "PASSED"), + "f2p_total": len(f2p_results), + "p2p_passed": sum(1 for v in p2p_results.values() if v == "PASSED"), + "p2p_total": len(p2p_results), + "parsed_count": len(parsed), + "framework": test_framework, + } diff --git a/resources_servers/swe_bench/prepare.py b/resources_servers/swe_bench/prepare.py new file mode 100644 index 0000000000..505ad263cc --- /dev/null +++ b/resources_servers/swe_bench/prepare.py @@ -0,0 +1,183 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" + python prepare.py # full SWE-bench Verified + all SIFs + python prepare.py --limit 5 # 5 instances + their 5 SIFs (smoke test) + python prepare.py --instance-id django__django-13741 + python prepare.py --no-images # dataset only, skip image builds + python prepare.py --no-dataset --sif-dir PATH # build images only + +schema anyswe_agent expects: each line has +`responses_create_params.metadata` with `instance_id`, `dataset_name`, `split`, +`problem_statement`, and `instance_dict` (the full SWE-bench instance the eval +harness needs). Images are Apptainer SIFs named `{instance_id}.sif` so the +agent's container_formatter is simply `/{instance_id}.sif`. + +Prerequisites for image builds: `apptainer` on PATH and network access to the +SWE-bench image registry. Each SIF is multiple GB, building all of SWE-bench +Verified (500 tasks) needs hundreds of GB of disk. Can use --limit and iterate. +""" + +import argparse +import json +import subprocess +import sys +from concurrent.futures import ThreadPoolExecutor, as_completed +from pathlib import Path + + +HF_DATASET = "princeton-nlp/SWE-bench_Verified" +DEFAULT_SPLIT = "test" +# SWE-bench publishes eval images with `__` -> `_1776_` and lowercased. +DOCKER_IMAGE_TMPL = "docker://swebench/sweb.eval.x86_64.{tag}:latest" +DEFAULT_MODEL = "Qwen/Qwen3-Coder-30B-A3B-Instruct" + +_THIS_DIR = Path(__file__).parent + + +def _docker_tag(instance_id: str) -> str: + return instance_id.replace("__", "_1776_").lower() + + +def _to_gym_row(inst: dict, split: str, sampling: dict) -> dict: + swe_meta = { + "instance_id": inst["instance_id"], + "dataset_name": HF_DATASET, + "split": split, + "problem_statement": inst["problem_statement"], + "instance_dict": json.dumps(inst), + } + user_text = inst["problem_statement"] + return { + "responses_create_params": { + "input": [{"role": "user", "content": user_text}], + **sampling, + "metadata": swe_meta, + }, + "verifier_metadata": swe_meta, + } + + +def build_dataset(output: Path, split: str, limit: int | None, instance_id: str | None, sampling: dict) -> list[str]: + try: + from datasets import load_dataset + except ImportError: + sys.exit("`datasets` is required for dataset prep: pip install datasets") + + print(f"Loading {HF_DATASET} [{split}]...", flush=True) + rows = load_dataset(HF_DATASET, split=split) + + if instance_id: + rows = [r for r in rows if r["instance_id"] == instance_id] + if not rows: + sys.exit(f"instance_id {instance_id!r} not found in {HF_DATASET}") + elif limit: + rows = rows.select(range(min(limit, len(rows)))) + + output.parent.mkdir(parents=True, exist_ok=True) + ids: list[str] = [] + with output.open("w") as f: + for inst in rows: + inst = dict(inst) + f.write(json.dumps(_to_gym_row(inst, split, sampling)) + "\n") + ids.append(inst["instance_id"]) + print(f"Wrote {len(ids)} rows -> {output}", flush=True) + return ids + + +def _build_one_sif(instance_id: str, sif_dir: Path, force: bool) -> tuple[str, bool, str]: + sif_path = sif_dir / f"{instance_id}.sif" + if sif_path.exists() and not force: + return instance_id, True, "exists" + image = DOCKER_IMAGE_TMPL.format(tag=_docker_tag(instance_id)) + proc = subprocess.run( + ["apptainer", "build", "--force", str(sif_path), image], + capture_output=True, + text=True, + errors="replace", + ) + if proc.returncode != 0: + return instance_id, False, proc.stderr.strip()[-500:] + return instance_id, True, "built" + + +def build_images(instance_ids: list[str], sif_dir: Path, jobs: int, force: bool) -> None: + if not _which("apptainer"): + sys.exit("`apptainer` not found on PATH. Install it or pass --no-images") + sif_dir.mkdir(parents=True, exist_ok=True) + print(f"Building {len(instance_ids)} SIF(s) into {sif_dir} with {jobs} worker(s)...", flush=True) + failures: list[str] = [] + with ThreadPoolExecutor(max_workers=jobs) as pool: + futures = {pool.submit(_build_one_sif, iid, sif_dir, force): iid for iid in instance_ids} + for done in as_completed(futures): + iid, ok, detail = done.result() + print(f" [{'ok' if ok else 'FAIL'}] {iid}: {detail}", flush=True) + if not ok: + failures.append(iid) + if failures: + print(f"\n{len(failures)} image build(s) failed:", flush=True) + for iid in failures: + print(f" - {iid}", flush=True) + sys.exit(1) + print(f"All images ready. Use: container_formatter='{sif_dir}/{{instance_id}}.sif'", flush=True) + + +def _which(name: str) -> bool: + from shutil import which + + return which(name) is not None + + +def main() -> None: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--output", type=Path, default=_THIS_DIR / "data" / "swebench_verified.jsonl") + p.add_argument("--split", default=DEFAULT_SPLIT) + p.add_argument("--limit", type=int, default=None, help="Only the first N instances (default: all)") + p.add_argument("--instance-id", default=None, help="Only this instance") + p.add_argument("--sif-dir", type=Path, default=_THIS_DIR / "data" / "sifs") + p.add_argument("--no-dataset", action="store_true", help="Skip dataset build") + p.add_argument("--no-images", action="store_true", help="Skip image build") + p.add_argument("--jobs", type=int, default=4, help="Parallel image builds") + p.add_argument("--force", action="store_true", help="Rebuild SIFs that already exist") + p.add_argument("--model", default=DEFAULT_MODEL, help="Default model baked into each row") + p.add_argument("--temperature", type=float, default=0.7) + p.add_argument("--top-p", type=float, default=0.8) + p.add_argument("--max-output-tokens", type=int, default=12288) + args = p.parse_args() + + sampling = { + "model": args.model, + "temperature": args.temperature, + "top_p": args.top_p, + "max_output_tokens": args.max_output_tokens, + } + + instance_ids: list[str] + if args.no_dataset: + if not args.output.exists(): + sys.exit(f"--no-dataset but {args.output} does not exist") + instance_ids = [ + json.loads(line)["responses_create_params"]["metadata"]["instance_id"] + for line in args.output.read_text().splitlines() + if line.strip() + ] + else: + instance_ids = build_dataset(args.output, args.split, args.limit, args.instance_id, sampling) + + if not args.no_images: + build_images(instance_ids, args.sif_dir, args.jobs, args.force) + + +if __name__ == "__main__": + main() diff --git a/resources_servers/swe_bench/requirements.txt b/resources_servers/swe_bench/requirements.txt new file mode 100644 index 0000000000..cef7e1d96d --- /dev/null +++ b/resources_servers/swe_bench/requirements.txt @@ -0,0 +1,2 @@ +swebench +datasets>=2.14.0 diff --git a/resources_servers/swe_bench/sandbox.py b/resources_servers/swe_bench/sandbox.py new file mode 100644 index 0000000000..99dafcd50c --- /dev/null +++ b/resources_servers/swe_bench/sandbox.py @@ -0,0 +1,230 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Async SWE sandbox: an environment wrapper plus its acquire/teardown lifecycle. + +``AsyncSweEnvironment`` is a thin async wrapper around a started sandbox that any +agent or the verifier uses to run commands and move files in and out. +``acquire_sandbox`` starts a fresh sandbox and always tears it down on exit +(normal return, exception, or cancellation). +""" + +from __future__ import annotations + +import os +import tempfile +from contextlib import asynccontextmanager +from pathlib import Path +from typing import Any, AsyncIterator, Mapping + +from nemo_gym.sandbox import AsyncSandbox, SandboxProvider, SandboxSpec + + +class AsyncSweEnvironment: + """Thin async wrapper around a started ``AsyncSandbox``. + + Agents drive their own loop with ``execute``/``upload``/``download``; the + verifier uses the same surface to run eval recipes. The environment never + owns trajectory capture or grading logic — only sandbox I/O. + """ + + def __init__(self, sandbox: AsyncSandbox) -> None: + """Wrap an already-started sandbox. + + Args: + sandbox (AsyncSandbox): A started sandbox to drive I/O against. + """ + self._sandbox = sandbox + self._closed = False + + @classmethod + async def start( + cls, + provider: Mapping[str, Any] | SandboxProvider, + spec: SandboxSpec, + ) -> "AsyncSweEnvironment": + """Create and start a fresh sandbox and return the environment. + + Args: + provider (Mapping[str, Any] | SandboxProvider): The sandbox provider + config or instance to launch the sandbox with. + spec (SandboxSpec): The sandbox spec describing image, workdir, env, + and other launch options. + + Returns: + AsyncSweEnvironment: An environment wrapping the started sandbox. + """ + sandbox = AsyncSandbox(provider, spec) + await sandbox.start() + return cls(sandbox) + + @property + def sandbox(self) -> AsyncSandbox: + """The wrapped sandbox. + + Returns: + AsyncSandbox: The underlying sandbox instance. + """ + return self._sandbox + + @property + def sandbox_id(self) -> str | None: + """The provider-assigned sandbox identifier. + + Returns: + str | None: The sandbox id, or ``None`` if the sandbox has no handle. + """ + handle = getattr(self._sandbox, "_handle", None) + return handle.sandbox_id if handle is not None else None + + @property + def provider_name(self) -> str | None: + """The name of the provider backing the sandbox. + + Returns: + str | None: The provider name, or ``None`` if the sandbox has no handle. + """ + handle = getattr(self._sandbox, "_handle", None) + return handle.provider_name if handle is not None else None + + async def execute( + self, + command: str, + *, + cwd: str | None = None, + user: str | int | None = "root", + timeout_s: int | float | None = None, + is_eval: bool = False, + ) -> dict[str, Any]: + """Run a command in the sandbox and return a normalized result. + + Args: + command (str): The shell command to execute. + cwd (str | None): Working directory for the command, or ``None`` to + use the sandbox default. + user (str | int | None): User to run the command as. Defaults to + ``"root"``. + timeout_s (int | float | None): Optional timeout in seconds. + is_eval (bool): Accepted for caller bookkeeping; it does not affect + how the command is executed. + + Returns: + dict[str, Any]: A dict with ``output`` (combined stdout and stderr), + ``returncode``, ``stdout``, ``stderr``, and ``error_type``. + """ + result = await self._sandbox.exec(command, cwd=cwd, env=None, timeout_s=timeout_s, user=user) + stdout = result.stdout or "" + stderr = result.stderr or "" + output = "\n".join(part for part in (stdout, stderr) if part) + return { + "output": output, + "returncode": result.return_code, + "stdout": stdout, + "stderr": stderr, + "error_type": result.error_type, + } + + async def upload(self, local_path: Path | str, remote_path: str) -> None: + """Upload a local file into the sandbox. + + Args: + local_path (Path | str): Path to the file on the host. + remote_path (str): Destination path inside the sandbox. + """ + await self._sandbox.upload(local_path, remote_path) + + async def download(self, remote_path: str, local_path: Path | str) -> None: + """Download a file from the sandbox to the host. + + Args: + remote_path (str): Source path inside the sandbox. + local_path (Path | str): Destination path on the host. + """ + await self._sandbox.download(remote_path, local_path) + + async def write_text(self, remote_path: str, content: str) -> None: + """Write a string to a file inside the sandbox via a temporary upload. + + Args: + remote_path (str): Destination path inside the sandbox. + content (str): The text content to write. + """ + tmp = tempfile.NamedTemporaryFile("w", delete=False, encoding="utf-8") + try: + tmp.write(content) + tmp.flush() + tmp.close() + await self._sandbox.upload(tmp.name, remote_path) + finally: + os.unlink(tmp.name) + + async def cleanup(self) -> None: + """Stop the sandbox. Idempotent: subsequent calls are no-ops.""" + if self._closed: + return + self._closed = True + await self._sandbox.stop() + + async def __aenter__(self) -> "AsyncSweEnvironment": + """Enter the async context manager. + + Returns: + AsyncSweEnvironment: This environment instance. + """ + return self + + async def __aexit__(self, exc_type: Any, exc_val: Any, exc_tb: Any) -> None: + """Exit the async context manager and stop the sandbox. + + Args: + exc_type (Any): The exception type, if one was raised. + exc_val (Any): The exception instance, if one was raised. + exc_tb (Any): The traceback, if an exception was raised. + """ + await self.cleanup() + + +# --- sandbox acquire/teardown lifecycle --------------- + + +@asynccontextmanager +async def acquire_sandbox( + provider: Mapping[str, Any] | SandboxProvider, + spec: SandboxSpec, + *, + instance_id: str = "", +) -> AsyncIterator[AsyncSweEnvironment]: + """Start a fresh sandbox, yield it, and always stop it on exit. + + Args: + provider: Either a ``SandboxProvider`` instance or a mapping describing + the provider configuration used to create the sandbox. + spec: The ``SandboxSpec`` describing how to provision the sandbox. + instance_id: Identifier accepted for logging/telemetry; it does not + affect behavior. + + Yields: + AsyncSweEnvironment: The started environment wrapping the sandbox, + which is cleaned up when the context manager exits. + """ + env: AsyncSweEnvironment | None = None + try: + env = await AsyncSweEnvironment.start(provider, spec) + yield env + finally: + if env is not None: + try: + await env.cleanup() + except Exception: + pass diff --git a/resources_servers/swe_bench/self_drive.py b/resources_servers/swe_bench/self_drive.py new file mode 100644 index 0000000000..2e00cfac69 --- /dev/null +++ b/resources_servers/swe_bench/self_drive.py @@ -0,0 +1,392 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Provider-neutral self-driving scaffolding for SWE agents. + +Any agent that runs to completion inside a sandbox (editing the repo at the task's +working directory) can reuse these helpers: provision a working sandbox via a +``SandboxProvider``, inject a sandbox-reachable model endpoint and/or extra +environment for egress, run an opaque agent launch command, and extract the +resulting unified-diff patch. Grading is decoupled — callers grade the patch +in-process via :func:`run_self_driving` (or ``verify_task`` directly) in a fresh +sandbox. The agent launch command, staged files, and patch-output location are +caller-supplied, so nothing here is specific to any one agent harness. + +This module also defines the in-sandbox model-server egress primitive +(``ModelEndpoint`` / ``resolve``), used to inject a sandbox-reachable endpoint +into the agent's environment. +""" + +from __future__ import annotations + +import dataclasses +import json +import shlex +from collections.abc import Mapping +from dataclasses import dataclass +from typing import Any + +from nemo_gym.sandbox import SandboxProvider +from resources_servers.swe_bench.harness import SweTask, get_harness, reward_from_report +from resources_servers.swe_bench.sandbox import acquire_sandbox + + +def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str: + """Return the name of a sandbox provider. + + Args: + provider: Either a mapping keyed by provider name, or a ``SandboxProvider`` + instance with a ``name`` attribute. + + Returns: + The provider name, or ``"?"`` if it cannot be determined. + """ + if isinstance(provider, Mapping): + return next(iter(provider), "?") + return getattr(provider, "name", "?") + + +async def _read_output_jsonl_row(env, output_glob: str) -> dict[str, Any]: + """Return the last row of the newest matching ``output.jsonl`` (or ``{}`` if absent). + + Some self-driving harnesses write their result row to an ``output.jsonl`` file under an + output directory rather than to the working tree, so a plain ``git diff`` would miss the + patch. When several files match (e.g. a re-run left a stale one), the newest by mtime is + picked. ``find -printf "%T@ %p"`` emits `` `` per match; ``sort -n | tail -1`` + selects the most-recently-modified, and the leading float timestamp plus single space is + stripped back off (so paths containing spaces survive). + + Args: + env: The sandbox handle exposing ``execute`` for running shell commands. + output_glob: Path or glob under which to search for ``output.jsonl`` files. + + Returns: + The parsed last JSON row of the newest matching ``output.jsonl`` as a dict, or an + empty dict if no file or content is found. + """ + found = await env.execute( + f'find {shlex.quote(output_glob)} -name output.jsonl -printf "%T@ %p\\n" 2>/dev/null | sort -n | tail -1' + ) + newest = (found.get("stdout", "") or "").strip() + # newest is " "; the path may contain spaces, so split only on the first one. + path = newest.split(" ", 1)[1].strip() if " " in newest else "" + if not path: + return {} + catted = await env.execute(f"cat {shlex.quote(path)}") + raw = (catted.get("stdout", "") or "").strip() + if not raw: + return {} + return json.loads(raw.splitlines()[-1]) + + +async def _extract_patch_from_output_jsonl(env, output_glob: str) -> str: + """Read the unified-diff patch from the newest matching ``output.jsonl``. + + Args: + env: The sandbox handle exposing ``execute`` for running shell commands. + output_glob: Path or glob under which to search for ``output.jsonl`` files. + + Returns: + The patch string from ``row["test_result"]["git_patch"]``, or an empty string if + absent. + """ + row = await _read_output_jsonl_row(env, output_glob) + return (row.get("test_result") or {}).get("git_patch", "") or "" + + +def _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env): + """Build the agent sandbox spec, injecting egress env (model endpoint and/or extra env). + + Args: + task: The SWE task whose benchmark selects the harness and seeds the spec. + provider: The sandbox provider, used to resolve the model endpoint for egress. + model_server: Optional model-server config; when given, a sandbox-reachable endpoint + is resolved and merged into the spec's environment. + opensandbox_service_url: Optional OpenSandbox service URL used when resolving the + model endpoint. + extra_env: Optional environment variables merged verbatim into the spec. + + Returns: + The sandbox spec with egress environment variables applied. + """ + harness = get_harness(task.benchmark) + spec = harness.build_spec(task) + # Model-server egress: inject only a sandbox-reachable endpoint (never the global dict). + if model_server is not None: + endpoint = resolve(_provider_name(provider), model_server, opensandbox_service_url=opensandbox_service_url) + spec = dataclasses.replace(spec, env={**spec.env, **endpoint.to_sandbox_env()}) + # Any extra in-sandbox env (e.g. a NeMo-Gym ServerClient config dict, ANTHROPIC_* vars). + if extra_env: + spec = dataclasses.replace(spec, env={**spec.env, **dict(extra_env)}) + return spec + + +async def provision_and_collect( + task: SweTask, + *, + provider: Mapping[str, Any] | SandboxProvider, + agent_launch_command: str, + model_server: Mapping[str, Any] | None = None, + opensandbox_service_url: str | None = None, + extra_env: Mapping[str, str] | None = None, + stage_files: Mapping[str, str] | None = None, + patch_output_glob: str | None = None, + agent_timeout_s: int | float = 1800, +) -> dict[str, Any]: + """Provision and self-drive the agent, returning the patch and error signals. + + Provisions a writable sandbox from the task image, stages any caller-supplied files, + runs the opaque ``agent_launch_command`` at the repo working directory, then extracts the + unified-diff patch. No grading happens here. + + Two egress styles are supported and composable: + + * ``model_server`` -> a sandbox-reachable OpenAI ``base_url`` (via ``resolve``), + for agents that call the model via a standard OpenAI/litellm client. + * ``extra_env`` -> injected verbatim, for agents wired to NeMo Gym's ``ServerClient`` or to + a CLI that reads its endpoint from environment variables. + + ``env.execute`` does not raise on timeout; it returns an ``error_type`` instead, so the + caller must read the returned ``"error_type"`` to set ``agent_timed_out`` (otherwise a + timed-out agent would wrongly not be masked). + + Args: + task: The SWE task describing the instance, image, and working directory. + provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``). + agent_launch_command: The shell command that runs the agent inside the sandbox. + model_server: Optional model-server config; when given, a sandbox-reachable endpoint + is resolved and injected into the agent's environment. + opensandbox_service_url: Optional OpenSandbox service URL used when resolving the + model endpoint. + extra_env: Optional environment variables injected verbatim into the sandbox. + stage_files: Optional ``{remote_path: content}`` files written into the live sandbox + before launch. + patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this + path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``. + agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``. + + Returns: + A dict with keys ``"patch"`` (the unified-diff string), ``"agent_error"`` (the + harness error field or ``None``), and ``"error_type"`` (``"timeout"``, ``"sandbox"``, + or ``None``). + """ + spec = _build_agent_spec(task, provider, model_server, opensandbox_service_url, extra_env) + async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env: + for remote_path, content in (stage_files or {}).items(): + await env.write_text(remote_path, content) + run = await env.execute(agent_launch_command, cwd=task.repo_workdir, timeout_s=agent_timeout_s) + error_type = run.get("error_type") + if patch_output_glob: + row = await _read_output_jsonl_row(env, patch_output_glob) + patch = (row.get("test_result") or {}).get("git_patch", "") or "" + return {"patch": patch, "agent_error": row.get("error"), "error_type": error_type} + diff = await env.execute(f"cd {task.repo_workdir} && git add -A && git diff --cached", cwd=task.repo_workdir) + return {"patch": diff.get("stdout", "") or "", "agent_error": None, "error_type": error_type} + + +async def provision_and_extract_patch( + task: SweTask, + *, + provider: Mapping[str, Any] | SandboxProvider, + agent_launch_command: str, + model_server: Mapping[str, Any] | None = None, + opensandbox_service_url: str | None = None, + extra_env: Mapping[str, str] | None = None, + stage_files: Mapping[str, str] | None = None, + patch_output_glob: str | None = None, + agent_timeout_s: int | float = 1800, +) -> str: + """Provision a working sandbox, self-drive the agent, and return the unified-diff patch. + + A thin wrapper over :func:`provision_and_collect` returning only the patch. No grading + happens here. + + Args: + task: The SWE task describing the instance, image, and working directory. + provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``). + agent_launch_command: The shell command that runs the agent inside the sandbox. + model_server: Optional model-server config; when given, a sandbox-reachable endpoint + is resolved and injected into the agent's environment. + opensandbox_service_url: Optional OpenSandbox service URL used when resolving the + model endpoint. + extra_env: Optional environment variables injected verbatim into the sandbox. + stage_files: Optional ``{remote_path: content}`` files written into the live sandbox + before launch. + patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this + path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``. + agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``. + + Returns: + The extracted unified-diff patch as a string (empty if none was produced). + """ + result = await provision_and_collect( + task, + provider=provider, + agent_launch_command=agent_launch_command, + model_server=model_server, + opensandbox_service_url=opensandbox_service_url, + extra_env=extra_env, + stage_files=stage_files, + patch_output_glob=patch_output_glob, + agent_timeout_s=agent_timeout_s, + ) + return result["patch"] + + +async def run_self_driving( + task: SweTask, + *, + provider: Mapping[str, Any] | SandboxProvider, + agent_launch_command: str, + model_server: Mapping[str, Any] | None = None, + opensandbox_service_url: str | None = None, + extra_env: Mapping[str, str] | None = None, + stage_files: Mapping[str, str] | None = None, + patch_output_glob: str | None = None, + agent_timeout_s: int | float = 1800, +) -> dict[str, Any]: + """Provision, self-drive, extract the patch, then grade it in-process in a fresh sandbox. + + Bundles provisioning and verification for standalone use and tests. The patch is graded by + ``verify_task`` in its OWN fresh sandbox (so grading is hermetic — never the agent's dirtied + tree). ``verify_task`` is imported lazily to avoid a circular import between this library and + the verifier module. + + Args: + task: The SWE task describing the instance, image, and working directory. + provider: The sandbox provider (mapping keyed by name, or a ``SandboxProvider``). + agent_launch_command: The shell command that runs the agent inside the sandbox. + model_server: Optional model-server config; when given, a sandbox-reachable endpoint + is resolved and injected into the agent's environment. + opensandbox_service_url: Optional OpenSandbox service URL used when resolving the + model endpoint. + extra_env: Optional environment variables injected verbatim into the sandbox. + stage_files: Optional ``{remote_path: content}`` files written into the live sandbox + before launch. + patch_output_glob: When given, the patch is read from an ``output.jsonl`` under this + path; otherwise it comes from ``git diff --cached`` on ``repo_workdir``. + agent_timeout_s: Timeout in seconds for the agent run. Defaults to ``1800``. + + Returns: + A dict with the instance id, model patch, resolution status, reward, whether a patch + exists, whether the sample is masked, and the verifier's error kind. + """ + from resources_servers.swe_bench.verify_task import verify_task + + patch = await provision_and_extract_patch( + task, + provider=provider, + agent_launch_command=agent_launch_command, + model_server=model_server, + opensandbox_service_url=opensandbox_service_url, + extra_env=extra_env, + stage_files=stage_files, + patch_output_glob=patch_output_glob, + agent_timeout_s=agent_timeout_s, + ) + # Score the patch in the verifier's OWN fresh sandbox (decoupled, hermetic verification). + report = await verify_task(provider, dataclasses.replace(task, model_patch=patch)) + masked = report.error_kind is not None + return { + "instance_id": task.instance_id, + "model_patch": patch, + "resolved": report.resolved, + "reward": reward_from_report(report), + "patch_exists": bool(patch.strip()), + "mask_sample": masked, + "error_kind": report.error_kind, + } + + +# --- in-sandbox model-server egress -------------- + + +class ModelEgressUnavailable(RuntimeError): + """Raised when no sandbox-reachable model endpoint can be resolved for a provider.""" + + +@dataclass(frozen=True) +class ModelEndpoint: + """A sandbox-reachable model-server endpoint. + + Attributes: + base_url: The base URL the in-sandbox agent uses to reach the model server. + api_key: Optional API key for authenticating to the model server. + model: Optional model name to use. + """ + + base_url: str + api_key: str = "" + model: str = "" + + def to_sandbox_env(self) -> dict[str, str]: + """Build the minimal set of environment variables to inject into the sandbox. + + Returns: + dict[str, str]: Environment variables carrying the base URL and, + when set, the API key and model name. The global config dict is + never included. + """ + env = {"OPENAI_BASE_URL": self.base_url, "NEMO_GYM_MODEL_BASE_URL": self.base_url} + if self.api_key: + env["OPENAI_API_KEY"] = self.api_key + if self.model: + env["NEMO_GYM_MODEL"] = self.model + return env + + +def resolve( + provider_name: str, + model_server: Mapping[str, Any], + *, + host_loopback_url: str = "http://127.0.0.1:8000/v1", + opensandbox_service_url: str | None = None, +) -> ModelEndpoint: + """Resolve a sandbox-reachable model endpoint for a sandbox provider. + + Args: + provider_name: The sandbox provider name (e.g. ``"apptainer"``, + ``"opensandbox"``, ``"docker"``). + model_server: Mapping describing the model server, read for the + ``api_key``, ``model``, and ``base_url`` keys. + host_loopback_url: Fallback URL used when the provider shares the host + network namespace and no base URL is configured. + opensandbox_service_url: Cluster-reachable Service/ingress URL used for + the opensandbox provider when no other base URL is configured. + + Returns: + ModelEndpoint: The resolved endpoint carrying the base URL, API key, + and model name. + + Raises: + ModelEgressUnavailable: If the opensandbox provider cannot resolve a + cluster-reachable model-server URL (e.g. only loopback is available). + """ + api_key = str(model_server.get("api_key", "") or "") + model = str(model_server.get("model", "") or "") + configured_base = str(model_server.get("base_url", "") or "") + + if provider_name == "opensandbox": + base_url = opensandbox_service_url or configured_base + if not base_url or "127.0.0.1" in base_url or "localhost" in base_url: + raise ModelEgressUnavailable( + "opensandbox needs a cluster-reachable model-server URL (k8s Service/ingress); " + "loopback is unreachable from the pod. Configure 'opensandbox_service_url', or " + "run the agent with the docker provider instead." + ) + else: + # docker / local: shares host network by default (host loopback reachable). + base_url = configured_base or host_loopback_url + + return ModelEndpoint(base_url=base_url, api_key=api_key, model=model) diff --git a/resources_servers/swe_bench/session.py b/resources_servers/swe_bench/session.py new file mode 100644 index 0000000000..ca3112bc57 --- /dev/null +++ b/resources_servers/swe_bench/session.py @@ -0,0 +1,71 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""SessionDescriptor — Environment response after accepting a Task. + +The descriptor is **episode context**, not the Task itself: placement topology, +sandbox spec, egress hints, and a round-trip verifier payload for ``/verify``. +""" + +from __future__ import annotations + +from typing import Any, Literal, Optional + +from pydantic import BaseModel, ConfigDict, Field + +from nemo_gym.base_resources_server import ( + BaseSeedSessionRequest, + BaseSeedSessionResponse, + BaseVerifyRequest, + BaseVerifyResponse, +) +from resources_servers.swe_bench.task import ENVIRONMENT_NAME, TaskPublic + + +Topology = Literal["none", "env_sandboxed", "agent_in_env", "whole_interaction"] + + +class PlacementDescriptor(BaseModel): + topology: Topology + + +class SandboxDescriptor(BaseModel): + spec: dict[str, Any] + + +class EgressDescriptor(BaseModel): + env: dict[str, str] = Field(default_factory=dict) + + +class SessionDescriptor(BaseSeedSessionResponse): + """Environment-owned episode context returned from ``seed_session``.""" + + environment: str = ENVIRONMENT_NAME + task: TaskPublic + placement: PlacementDescriptor + sandbox: SandboxDescriptor + egress: EgressDescriptor + verifier_metadata: dict[str, Any] + + +class SweBenchSeedSessionRequest(BaseSeedSessionRequest): + model_config = ConfigDict(extra="allow") + verifier_metadata: Optional[dict[str, Any]] = None + + +class SweBenchVerifyRequest(BaseVerifyRequest): + model_config = ConfigDict(extra="allow") + verifier_metadata: Optional[dict[str, Any]] = None + + +class SweBenchVerifyResponse(BaseVerifyResponse): + model_config = ConfigDict(extra="allow") + task_id: str = "" + environment: str = ENVIRONMENT_NAME + resolved: bool = False + patch_exists: bool = False + mask_sample: bool = False + error_kind: Optional[str] = None + + +SweBenchSeedSessionResponse = SessionDescriptor diff --git a/resources_servers/swe_bench/task.py b/resources_servers/swe_bench/task.py new file mode 100644 index 0000000000..53d86b506e --- /dev/null +++ b/resources_servers/swe_bench/task.py @@ -0,0 +1,256 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""First-class Task model for the ``swe_bench`` Environment. + +A **Task** (τ) is one problem instance from a benchmark's task distribution — not the +Environment (``swe_bench`` resources server) and not the published benchmark name alone +(e.g. *SWE-bench Verified*). + +Terminology: + +* ``task_id`` / ``instance_id`` — unique instance key (``django__django-13741``) +* ``dataset_name`` — published benchmark product (HuggingFace id) +* ``harness_family`` / ``benchmark`` — harness registry key inside this Environment + (``swe-bench``, ``r2e-gym``, …) +* ``problem_statement`` — initial observation (user message) for the agent +* ``metadata`` — privileged grading fields (``instance_dict``, etc.); Environment-only +""" + +from __future__ import annotations + +import json +from dataclasses import dataclass, field, replace +from typing import Any, Protocol + +from pydantic import BaseModel, ConfigDict + +from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming + + +ENVIRONMENT_NAME = "swe_bench" + +_HARNESS_FAMILY_ALIASES: list[tuple[str, str]] = [ + ("R2E-Gym", "r2e-gym"), + ("SWE-bench_Multilingual", "swe-bench-multilingual"), + ("SWE-bench", "swe-bench"), +] + + +class TaskRunBody(Protocol): + """Minimal run/seed/verify request shape carrying task fields.""" + + responses_create_params: NeMoGymResponseCreateParamsNonStreaming | None + verifier_metadata: dict[str, Any] | None + + +class TaskPublic(BaseModel): + """Agent-visible task identity returned from ``seed_session``.""" + + model_config = ConfigDict(extra="forbid") + + task_id: str + environment: str = ENVIRONMENT_NAME + dataset_name: str = "" + harness_family: str = "" + split: str = "test" + + +class TaskSubmission(BaseModel): + """Agent-produced artifact graded at ``verify`` (Environment-owned scoring).""" + + model_config = ConfigDict(extra="forbid") + + model_patch: str = "" + + +@dataclass +class SweTask: + """One SWE Environment task instance — provisioning + grading input. + + This is the Environment-internal task value. Harnesses consume ``SweTask``; + HTTP callers supply dataset rows that parse into this type. + """ + + instance_id: str + image: str | None = None + base_commit: str | None = None + repo_workdir: str = "/testbed" + test_command: str = "" + test_framework: str = "" + model_patch: str = "" + test_patch: str = "" + fail_to_pass: list[str] = field(default_factory=list) + pass_to_pass: list[str] = field(default_factory=list) + benchmark: str = "swe-bench-ext" + split: str = "test" + dataset_name: str = "" + problem_statement: str = "" + metadata: dict[str, Any] = field(default_factory=dict) + + @property + def task_id(self) -> str: + return self.instance_id + + @property + def harness_family(self) -> str: + return self.benchmark + + def public_view(self, *, environment: str = ENVIRONMENT_NAME) -> TaskPublic: + """Return the agent-visible task identity (no privileged metadata).""" + return TaskPublic( + task_id=self.task_id, + environment=environment, + dataset_name=self.dataset_name, + harness_family=self.harness_family, + split=self.split, + ) + + def privileged_verifier_metadata(self, *, flat_eval: bool) -> dict[str, Any]: + """Privileged fields the Environment needs on verify (not for agent logic).""" + return { + "instance_id": self.instance_id, + "dataset_name": self.dataset_name, + "split": self.split, + "benchmark": self.benchmark, + "harness_family": self.harness_family, + "problem_statement": self.problem_statement, + "flat_eval": flat_eval, + "instance_dict": self.metadata.get("instance_dict"), + } + + def with_submission(self, submission: TaskSubmission | None) -> SweTask: + """Return a copy with the agent's graded submission applied.""" + patch = (submission.model_patch if submission else "") or "" + return replace(self, model_patch=patch) + + +def harness_family_key(dataset_name: str) -> str: + """Map a HuggingFace dataset name to a harness registry key.""" + for needle, key in _HARNESS_FAMILY_ALIASES: + if needle in dataset_name: + return key + return "swe-bench" + + +def instance_image(container_formatter: Any, instance_id: str) -> str: + fmt = container_formatter[0] if isinstance(container_formatter, list) else container_formatter + fmt = fmt or "swebench/sweb.eval.x86_64.{instance_id}" + if fmt.endswith(".sif") or fmt.startswith(("/", ".")): + return fmt.format(instance_id=instance_id) + if fmt.startswith("docker://"): + fmt = fmt[len("docker://") :] + tag = instance_id.replace("__", "_1776_").lower() + image = fmt.format(instance_id=tag) + if ":" not in image.rsplit("/", 1)[-1]: + image += ":latest" + return image + + +def _as_list(value: Any) -> list[str]: + if isinstance(value, str): + try: + return list(json.loads(value)) + except (json.JSONDecodeError, TypeError): + return [value] if value else [] + return list(value or []) + + +def merge_row_metadata( + verifier_metadata: dict[str, Any] | None, + responses_metadata: dict[str, Any] | None, +) -> dict[str, Any]: + """Merge dataset row fields from verifier and responses metadata.""" + return _merge_row_metadata(verifier_metadata, responses_metadata) + + +def _merge_row_metadata( + verifier_metadata: dict[str, Any] | None, + responses_metadata: dict[str, Any] | None, +) -> dict[str, Any]: + info: dict[str, Any] = {} + if responses_metadata: + info.update(responses_metadata) + if verifier_metadata: + info.update(verifier_metadata) + return info + + +def _initial_observation(row: dict[str, Any], responses_metadata: dict[str, Any] | None) -> str: + if row.get("problem_statement"): + return str(row["problem_statement"]) + params = row.get("responses_create_params") + if isinstance(params, dict): + raw_input = params.get("input") + elif responses_metadata is not None: + raw_input = None + else: + raw_input = None + if raw_input is None and hasattr(row.get("responses_create_params"), "input"): + raw_input = row["responses_create_params"].input # type: ignore[union-attr] + if isinstance(raw_input, str): + return raw_input + if isinstance(raw_input, list) and raw_input: + first = raw_input[0] + if isinstance(first, dict): + return str(first.get("content", "")) + return "" + + +def build_task( + row: dict[str, Any], + *, + container_formatter: str, + flat_eval: bool = True, + responses_metadata: dict[str, Any] | None = None, +) -> SweTask: + """Build a ``SweTask`` from merged dataset / verifier metadata.""" + inst_raw = row.get("instance_dict") + inst = json.loads(inst_raw) if isinstance(inst_raw, str) else dict(inst_raw or {}) + dataset_name = str(row.get("dataset_name", "")) + instance_id = row["instance_id"] + image = instance_image(row.get("container_formatter") or container_formatter, instance_id) + + return SweTask( + instance_id=instance_id, + image=image, + base_commit=inst.get("base_commit"), + repo_workdir="/testbed", + test_patch=inst.get("test_patch", ""), + fail_to_pass=_as_list(inst.get("FAIL_TO_PASS") or inst.get("fail_to_pass")), + pass_to_pass=_as_list(inst.get("PASS_TO_PASS") or inst.get("pass_to_pass")), + benchmark=harness_family_key(dataset_name), + split=str(row.get("split", "test")), + dataset_name=dataset_name, + problem_statement=_initial_observation(row, responses_metadata), + metadata={"instance_dict": inst, "flat_eval": flat_eval, "dataset_name": dataset_name}, + ) + + +def parse_task_from_request( + body: TaskRunBody, + *, + container_formatter: str, + flat_eval: bool = True, + environment: str = ENVIRONMENT_NAME, +) -> SweTask: + """Parse a first-class Task from an agent ``/run`` or Environment HTTP body.""" + responses_metadata = (body.responses_create_params.metadata or {}) if body.responses_create_params else {} + row = merge_row_metadata(body.verifier_metadata, responses_metadata) + if "instance_id" not in row: + raise ValueError( + "Task requires verifier_metadata.instance_id (or responses_create_params.metadata.instance_id)" + ) + return build_task( + row, + container_formatter=container_formatter, + flat_eval=flat_eval, + responses_metadata=responses_metadata, + ) + + +def parse_submission(verifier_metadata: dict[str, Any] | None) -> TaskSubmission: + """Extract the agent submission from verify request metadata.""" + meta = dict(verifier_metadata or {}) + patch = meta.get("model_patch") or meta.get("git_patch") or "" + return TaskSubmission(model_patch=patch if isinstance(patch, str) else str(patch)) diff --git a/resources_servers/swe_bench/task_builder.py b/resources_servers/swe_bench/task_builder.py new file mode 100644 index 0000000000..3c4df6d181 --- /dev/null +++ b/resources_servers/swe_bench/task_builder.py @@ -0,0 +1,20 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Backward-compatible re-exports — prefer ``resources_servers.swe_bench.task``.""" + +from resources_servers.swe_bench.task import ( + SweTask, +) +from resources_servers.swe_bench.task import ( + build_task as build_swetask, +) +from resources_servers.swe_bench.task import ( + harness_family_key as benchmark_key, +) +from resources_servers.swe_bench.task import ( + merge_row_metadata as problem_info_from_row, +) + + +__all__ = ["SweTask", "benchmark_key", "build_swetask", "problem_info_from_row"] diff --git a/resources_servers/swe_bench/tests/__init__.py b/resources_servers/swe_bench/tests/__init__.py new file mode 100644 index 0000000000..777f2341ac --- /dev/null +++ b/resources_servers/swe_bench/tests/__init__.py @@ -0,0 +1 @@ +"""Test suite for the swe_env agent harness.""" diff --git a/resources_servers/swe_bench/tests/conftest.py b/resources_servers/swe_bench/tests/conftest.py new file mode 100644 index 0000000000..5bc774e7c4 --- /dev/null +++ b/resources_servers/swe_bench/tests/conftest.py @@ -0,0 +1,26 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Pytest collection guard for the swe_env tests. + +The flat-eval parser fixtures are recorded eval logs whose lines begin with the +SWE-bench ``>>>>>`` sentinels. Under doctest collection those look like +(malformed) ``>>>`` prompts, so the fixtures directory is excluded from +collection entirely. It holds only data, never tests. +""" + +from __future__ import annotations + + +collect_ignore_glob = ["fixtures/*"] diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt new file mode 100644 index 0000000000..bb67958525 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/apply_patch_failed.txt @@ -0,0 +1,9 @@ ++ cd /testbed ++ git apply -v /tmp/patch.diff +Checking patch sphinx/ext/autodoc/__init__.py... +error: while searching for: + def format_signature(self): +error: patch failed: sphinx/ext/autodoc/__init__.py:120 +error: sphinx/ext/autodoc/__init__.py: patch does not apply +>>>>> Patch Apply Failed ++ git checkout abc123 tests/test_ext_autodoc.py diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt new file mode 100644 index 0000000000..bc8d678e61 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/fallback_outside_markers.txt @@ -0,0 +1,14 @@ ++ cd /testbed ++ git apply -v /tmp/patch.diff +Applied patch sphinx/ext/autodoc/__init__.py cleanly. +>>>>> Applied Patch ++ git apply -v /tmp/test_patch.diff +Applied patch tests/test_ext_autodoc.py cleanly. +>>>>> Start Test Output +============================= test session starts ============================== +collected 3 items +>>>>> End Test Output +PASSED tests/test_ext_autodoc.py::test_format_signature +PASSED tests/test_ext_autodoc.py::test_autodoc_inherited +PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members +=================== 3 passed in 1.92s ========================================= diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt new file mode 100644 index 0000000000..c4f0e56654 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/no_markers.txt @@ -0,0 +1,11 @@ ++ cd /testbed ++ git apply -v /tmp/patch.diff +Applied patch sphinx/ext/autodoc/__init__.py cleanly. +>>>>> Applied Patch ++ git checkout abc123 tests/test_ext_autodoc.py +Updated 1 path from the index ++ git apply -v /tmp/test_patch.diff +error: patch failed: tests/test_ext_autodoc.py:1 +error: tests/test_ext_autodoc.py: patch does not apply ++ python -m pytest tests/test_ext_autodoc.py +ERROR: file or directory not found: tests/test_ext_autodoc.py diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt new file mode 100644 index 0000000000..1d0ba6a53a --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/resolved_success.txt @@ -0,0 +1,25 @@ ++ source /opt/miniconda3/bin/activate ++ conda activate testbed ++ git config --global --add safe.directory /testbed ++ cd /testbed ++ git status ++ git restore . ++ git apply -v /tmp/patch.diff +Checking patch sphinx/ext/autodoc/__init__.py... +Applied patch sphinx/ext/autodoc/__init__.py cleanly. +>>>>> Applied Patch ++ git checkout abc123 tests/test_ext_autodoc.py +Updated 1 path from the index ++ git apply -v /tmp/test_patch.diff +Checking patch tests/test_ext_autodoc.py... +Applied patch tests/test_ext_autodoc.py cleanly. +>>>>> Start Test Output +============================= test session starts ============================== +PASSED tests/test_ext_autodoc.py::test_format_signature +PASSED tests/test_ext_autodoc.py::test_autodoc_inherited +PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members +SKIPPED tests/test_ext_autodoc.py::test_optional_feature +=================== 3 passed, 1 skipped in 2.41s =============================== +>>>>> End Test Output ++ git checkout abc123 tests/test_ext_autodoc.py +Updated 1 path from the index diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt new file mode 100644 index 0000000000..0a27e668e1 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/tests_timeout.txt @@ -0,0 +1,10 @@ ++ cd /testbed ++ git apply -v /tmp/patch.diff +Applied patch sphinx/ext/autodoc/__init__.py cleanly. +>>>>> Applied Patch ++ git apply -v /tmp/test_patch.diff +Applied patch tests/test_ext_autodoc.py cleanly. +>>>>> Start Test Output +============================= test session starts ============================== +PASSED tests/test_ext_autodoc.py::test_autodoc_inherited +>>>>> Tests Timed Out diff --git a/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt new file mode 100644 index 0000000000..59dc10159f --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/flat_eval/unresolved_failure.txt @@ -0,0 +1,16 @@ ++ cd /testbed ++ git apply -v /tmp/patch.diff +Checking patch sphinx/ext/autodoc/__init__.py... +Applied patch sphinx/ext/autodoc/__init__.py cleanly. +>>>>> Applied Patch ++ git apply -v /tmp/test_patch.diff +Checking patch tests/test_ext_autodoc.py... +Applied patch tests/test_ext_autodoc.py cleanly. +>>>>> Start Test Output +============================= test session starts ============================== +FAILED tests/test_ext_autodoc.py::test_format_signature - AssertionError: signature mismatch +PASSED tests/test_ext_autodoc.py::test_autodoc_inherited +PASSED tests/test_ext_autodoc.py::test_autodoc_exclude_members +=================== 2 passed, 1 failed in 2.10s ================================ +>>>>> End Test Output ++ git checkout abc123 tests/test_ext_autodoc.py diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt new file mode 100644 index 0000000000..5f1200be91 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/go_json.txt @@ -0,0 +1,6 @@ +{"Time":"2026-06-23T00:00:00Z","Action":"run","Package":"github.com/acme/widget","Test":"TestAlpha"} +{"Time":"2026-06-23T00:00:00Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestAlpha","Elapsed":0.01} +{"Time":"2026-06-23T00:00:01Z","Action":"run","Package":"github.com/acme/widget","Test":"TestBeta"} +{"Time":"2026-06-23T00:00:01Z","Action":"pass","Package":"github.com/acme/widget","Test":"TestBeta","Elapsed":0.02} +{"Time":"2026-06-23T00:00:02Z","Action":"run","Package":"github.com/acme/widget","Test":"TestGamma"} +{"Time":"2026-06-23T00:00:02Z","Action":"fail","Package":"github.com/acme/widget","Test":"TestGamma","Elapsed":0.01} diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml new file mode 100644 index 0000000000..028b436db3 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_junit.xml @@ -0,0 +1,10 @@ + + + + + + + boom + + + diff --git a/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt new file mode 100644 index 0000000000..d566714983 --- /dev/null +++ b/resources_servers/swe_bench/tests/fixtures/swe_bench_ext/pytest_text_fuzzy.txt @@ -0,0 +1,15 @@ +============================= test session starts ============================== +platform linux -- Python 3.12.0, pytest-8.0.0 +collected 3 items + +src/pkg/tests/test_widget.py::test_alpha PASSED [ 33%] +src/pkg/tests/test_widget.py::test_beta PASSED [ 66%] +src/pkg/tests/test_widget.py::test_gamma FAILED [100%] + +=================================== FAILURES =================================== +________________________________ test_gamma ___________________________________ + assert 1 == 2 +E assert 1 == 2 +=========================== short test summary info ============================ +FAILED src/pkg/tests/test_widget.py::test_gamma - assert 1 == 2 +========================= 2 passed, 1 failed in 0.12s ========================== diff --git a/resources_servers/swe_bench/tests/test_app.py b/resources_servers/swe_bench/tests/test_app.py new file mode 100644 index 0000000000..3e50958ab7 --- /dev/null +++ b/resources_servers/swe_bench/tests/test_app.py @@ -0,0 +1,95 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +import json +from unittest.mock import MagicMock + +import pytest + +import resources_servers.swe_bench.tests.test_swe_env # noqa: F401 — registers fake-swe provider +from nemo_gym.openai_utils import NeMoGymResponse, NeMoGymResponseCreateParamsNonStreaming +from nemo_gym.server_utils import ServerClient +from resources_servers.swe_bench.app import ( + SweBenchResourcesServer, + SweBenchResourcesServerConfig, + SweBenchSeedSessionRequest, + SweBenchVerifyRequest, +) + + +@pytest.fixture +def server() -> SweBenchResourcesServer: + return SweBenchResourcesServer( + config=SweBenchResourcesServerConfig( + host="127.0.0.1", + port=12346, + entrypoint="app.py", + name="swe_bench", + sandbox_provider={"fake-swe": {}}, + ), + server_client=MagicMock(spec=ServerClient), + ) + + +def _sample_row() -> dict: + inst = { + "instance_id": "astropy__astropy-12907", + "base_commit": "abc123", + "test_patch": "", + "FAIL_TO_PASS": '["tests/test_x.py::a"]', + "PASS_TO_PASS": '["tests/test_x.py::b"]', + } + meta = { + "instance_id": "astropy__astropy-12907", + "dataset_name": "princeton-nlp/SWE-bench_Verified", + "split": "test", + "problem_statement": "Fix the bug.", + "instance_dict": json.dumps(inst), + } + return { + "responses_create_params": NeMoGymResponseCreateParamsNonStreaming( + input=[{"role": "user", "content": "Fix the bug."}], + metadata=meta, + ), + "verifier_metadata": meta, + } + + +@pytest.mark.asyncio +async def test_seed_session_agent_in_env(server: SweBenchResourcesServer) -> None: + body = SweBenchSeedSessionRequest(**_sample_row()) + resp = await server.seed_session(body) + assert resp.environment == "swe_bench" + assert resp.placement.topology == "agent_in_env" + assert resp.sandbox.spec["image"].startswith("swebench/") + assert resp.task.task_id == "astropy__astropy-12907" + assert resp.task.harness_family == "swe-bench" + assert resp.task.dataset_name == "princeton-nlp/SWE-bench_Verified" + assert resp.verifier_metadata["instance_id"] == "astropy__astropy-12907" + + +@pytest.mark.asyncio +async def test_verify_empty_patch(server: SweBenchResourcesServer) -> None: + row = _sample_row() + row["verifier_metadata"] = {**row["verifier_metadata"], "model_patch": ""} + body = SweBenchVerifyRequest( + **row, + response=NeMoGymResponse( + id="r1", + created_at=0, + model="m", + object="response", + output=[], + parallel_tool_calls=False, + tool_choice="auto", + tools=[], + ), + ) + resp = await server.verify(body) + assert resp.task_id == "astropy__astropy-12907" + assert resp.environment == "swe_bench" + assert resp.reward == 0.0 + assert resp.patch_exists is False + assert resp.resolved is False diff --git a/resources_servers/swe_bench/tests/test_flat_eval.py b/resources_servers/swe_bench/tests/test_flat_eval.py new file mode 100644 index 0000000000..35a00fc580 --- /dev/null +++ b/resources_servers/swe_bench/tests/test_flat_eval.py @@ -0,0 +1,594 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the opt-in flat (host-graded) eval mode of the nested families. + +The suite has two layers: + +* Parser unit tests on recorded fixture logs cover the SWE-bench eval-script log + parser (``parse_eval_log``) on a success log, a failure log, the bad-code logs + (patch-apply-failed / timeout), a no-markers log, and the + output-outside-markers fallback. The fixtures use the + ``>>>>> Start/End Test Output`` shape the SWE-bench eval script emits. + +* Flat run_eval and grade via FakeSandbox drive the flat path of both nested + harnesses (``swe-bench``, ``r2e-gym``) end-to-end with a scripted provider that + returns a fixture log, asserting ``resolved`` is computed from ``FAIL_TO_PASS`` + / ``PASS_TO_PASS``. +""" + +from __future__ import annotations + +import asyncio +from pathlib import Path + +import pytest + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report +from resources_servers.swe_bench.harnesses import flat_eval +from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness +from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness + + +_FIXTURES = Path(__file__).parent / "fixtures" / "flat_eval" + + +def _fixture(name: str) -> str: + """Read a recorded fixture log by name. + + Fixtures are stored with a ``.txt`` suffix, so a caller may pass either the + ``.log`` stem name or the real ``.txt`` name. + + Args: + name: The fixture file name, with either a ``.log`` or ``.txt`` suffix. + + Returns: + The fixture file contents as text. + """ + path = _FIXTURES / name + if not path.exists() and path.suffix == ".log": + path = path.with_suffix(".txt") + return path.read_text() + + +# ---- parser: recorded fixture logs (CI) ------------------------------------- + + +def test_parse_success_log_all_pass(): + """A success log parses to a status map with the expected passed and skipped tests.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("resolved_success.log")) + assert applied is True + assert status_map == { + "tests/test_ext_autodoc.py::test_format_signature": "PASSED", + "tests/test_ext_autodoc.py::test_autodoc_inherited": "PASSED", + "tests/test_ext_autodoc.py::test_autodoc_exclude_members": "PASSED", + "tests/test_ext_autodoc.py::test_optional_feature": "SKIPPED", + } + assert sorted(flat_eval.passed_tests(status_map)) == [ + "tests/test_ext_autodoc.py::test_autodoc_exclude_members", + "tests/test_ext_autodoc.py::test_autodoc_inherited", + "tests/test_ext_autodoc.py::test_format_signature", + ] + + +def test_parse_failure_log_strips_failed_reason(): + """A failure log parses with the failure reason stripped down to the node id.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("unresolved_failure.log")) + assert applied is True + # The "FAILED - " line keeps only the node id. + assert status_map["tests/test_ext_autodoc.py::test_format_signature"] == "FAILED" + assert "tests/test_ext_autodoc.py::test_autodoc_inherited" in flat_eval.passed_tests(status_map) + + +def test_parse_apply_patch_failed_is_untrusted(): + """A patch-apply-failed log yields an empty status map and patch_applied False.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("apply_patch_failed.log")) + assert status_map == {} + assert applied is False + + +def test_parse_timeout_is_untrusted(): + """A timeout log yields an empty status map and patch_applied False.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("tests_timeout.log")) + assert status_map == {} + assert applied is False + + +def test_parse_no_markers_is_untrusted(): + """A log with no test-output markers yields an empty status map and patch_applied False.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("no_markers.log")) + assert status_map == {} + assert applied is False + + +def test_parse_fallback_outside_markers(): + """Per-test lines appearing after the End marker are recovered by the whole-log fallback.""" + status_map, applied = flat_eval.parse_eval_log(_fixture("fallback_outside_markers.log")) + assert applied is True + assert len(flat_eval.passed_tests(status_map)) == 3 + + +def test_parse_duplicate_node_last_status_wins(): + """For a duplicated node id the last reported status wins. + + A node first reported FAILED then re-reported PASSED (e.g. via a rerun plugin) + ends up PASSED, and vice versa. + """ + log = "\n".join( + [ + flat_eval.APPLY_PATCH_PASS, + flat_eval.START_TEST_OUTPUT, + "FAILED tests/test_x.py::test_flaky", + "PASSED tests/test_x.py::test_flaky", + "PASSED tests/test_x.py::test_regressed", + "FAILED tests/test_x.py::test_regressed", + flat_eval.END_TEST_OUTPUT, + ] + ) + status_map, applied = flat_eval.parse_eval_log(log) + assert applied is True + # Last line wins for each node, not the first. + assert status_map["tests/test_x.py::test_flaky"] == "PASSED" + assert status_map["tests/test_x.py::test_regressed"] == "FAILED" + assert flat_eval.passed_tests(status_map) == ["tests/test_x.py::test_flaky"] + + +def test_parse_xfail_counts_as_pass(): + """An XFAIL node counts as a passed test.""" + log = "\n".join( + [ + flat_eval.APPLY_PATCH_PASS, + flat_eval.START_TEST_OUTPUT, + "XFAIL tests/test_x.py::test_known_bug", + "PASSED tests/test_x.py::test_ok", + flat_eval.END_TEST_OUTPUT, + ] + ) + status_map, applied = flat_eval.parse_eval_log(log) + assert applied is True + assert set(flat_eval.passed_tests(status_map)) == { + "tests/test_x.py::test_known_bug", + "tests/test_x.py::test_ok", + } + + +# ---- flat_grade over parsed fixtures (CI) ----------------------------------- + + +def _task(benchmark: str = "swe-bench", **overrides) -> SweTask: + """Build a SweTask with sensible defaults, overridable per keyword. + + Args: + benchmark: The benchmark name for the task. + **overrides: Field overrides merged onto the default task fields. + + Returns: + A SweTask configured for the given benchmark. + """ + base = dict( + instance_id="repo__inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + model_patch="diff --git a/x b/x\n", + fail_to_pass=["tests/test_ext_autodoc.py::test_format_signature"], + pass_to_pass=["tests/test_ext_autodoc.py::test_autodoc_inherited"], + benchmark=benchmark, + ) + base.update(overrides) + return SweTask(**base) + + +def _flat_artifacts(log: str) -> EvalArtifacts: + """Wrap an eval log in flat-eval EvalArtifacts. + + Args: + log: The eval-script log text. + + Returns: + EvalArtifacts carrying the log with a clean (non-error) flat raw payload. + """ + return EvalArtifacts(test_output=log, return_code=0, patch_applied=True, raw={"error_type": None, "flat": True}) + + +def test_flat_grade_resolved_on_success(): + """Flat grading resolves a success log with reward 1.0.""" + report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("resolved_success.log"))) + assert report.resolved is True + assert report.patch_applied is True + assert report.patch_exists is True + assert reward_from_report(report) == 1.0 + + +def test_flat_grade_unresolved_on_failure(): + """Flat grading leaves a failure log unresolved with reward 0.0.""" + report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("unresolved_failure.log"))) + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_flat_grade_unresolved_on_apply_failed(): + """A failed patch apply grades as a legitimate unresolved, not an infra mask.""" + report = flat_eval.flat_grade(_task(), _flat_artifacts(_fixture("apply_patch_failed.log"))) + assert report.resolved is False + assert report.patch_applied is False + assert report.error_kind is None + assert reward_from_report(report) == 0.0 + + +# ---- consistency of flat grading -------------------------------------------- +# +# Flat grading takes ``resolved`` straight from the parser's verdict (all F2P + +# all P2P passed) and never re-gates it on ``patch_applied``. The parser's +# ``log_patch_applied`` flag never changes ``resolved`` relative to a pure +# ``compute_resolved`` verdict: whenever ``parse_eval_log`` reports +# ``patch_applied=False`` it also returns an empty status map, so +# ``compute_resolved`` already yields False. These tests lock in that invariant +# so a future edit cannot reintroduce a divergent gate. + + +@pytest.mark.parametrize( + "fixture_name", + [ + "resolved_success.log", + "unresolved_failure.log", + "apply_patch_failed.log", + "tests_timeout.log", + "no_markers.log", + "fallback_outside_markers.log", + ], +) +def test_flat_grade_resolved_matches_ungated_compute_resolved(fixture_name): + """``flat_grade``'s resolved verdict agrees with a bare ``compute_resolved`` over the parsed passed-set. + + The patch-applied gate is redundant and never flips the verdict True<->False. + + Args: + fixture_name: The recorded fixture log to parse and grade. + """ + from resources_servers.swe_bench.harness import compute_resolved + + task = _task() + log = _fixture(fixture_name) + status_map, _applied = flat_eval.parse_eval_log(log) + ungated = compute_resolved( + fail_to_pass=task.fail_to_pass, + pass_to_pass=task.pass_to_pass, + passed=flat_eval.passed_tests(status_map), + ) + report = flat_eval.flat_grade(task, _flat_artifacts(log)) + assert report.resolved is ungated + + +@pytest.mark.parametrize( + "bad_code_attr", + ["APPLY_PATCH_FAIL", "RESET_FAILED", "TESTS_ERROR", "TESTS_TIMEOUT"], +) +def test_parse_eval_log_bad_code_empties_status_map_even_with_status_lines(bad_code_attr): + """A bad code forces an empty status map and patch_applied False even with per-test status lines. + + This is what makes the flat_grade patch-applied gate redundant: no path yields + patch_applied=False together with a non-empty status map. + + Args: + bad_code_attr: Name of the bad-code marker attribute on ``flat_eval``. + """ + bad_code = getattr(flat_eval, bad_code_attr) + log = "\n".join( + [ + bad_code, + flat_eval.START_TEST_OUTPUT, + "PASSED tests/test_ext_autodoc.py::test_format_signature", + "PASSED tests/test_ext_autodoc.py::test_autodoc_inherited", + flat_eval.END_TEST_OUTPUT, + ] + ) + status_map, applied = flat_eval.parse_eval_log(log) + assert applied is False + assert status_map == {} + # And it grades as a legitimate unresolved (not an infra mask): error_kind + # stays None, resolved False -> reward 0.0, matching the flat families. + report = flat_eval.flat_grade(_task(), _flat_artifacts(log)) + assert report.resolved is False + assert report.error_kind is None + assert reward_from_report(report) == 0.0 + + +def test_flat_grade_resolved_does_not_gate_on_artifact_patch_applied(): + """Flat ``resolved`` is the parser's verdict only and ignores the artifact's patch_applied flag. + + Even if the EvalArtifacts carries patch_applied False (e.g. the model patch + did not cleanly apply), a passing eval log still resolves, since grading is + based on the tests rather than the apply status. + """ + artifacts = EvalArtifacts( + test_output=_fixture("resolved_success.log"), + return_code=0, + patch_applied=False, + raw={"error_type": None, "flat": True}, + ) + report = flat_eval.flat_grade(_task(), artifacts) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_flat_grade_neutral_skipped_required_test_is_not_a_failure(): + """A required test reported SKIPPED is neutral (excluded), not a failure. + + This mirrors swebench's ``get_eval_tests_report`` + ``get_resolution_status``: a + required test counts as a failure only when absent or FAILED/ERROR. A neutral + status (SKIPPED/XPASS) is excluded from both the success and failure tallies, so a + run whose only "non-pass" required test is SKIPPED still resolves. A bare + ``passed``-set membership check (the prior behavior) would have treated the + SKIPPED test as a failure and wrongly graded it unresolved. + """ + log = "\n".join( + [ + flat_eval.APPLY_PATCH_PASS, + flat_eval.START_TEST_OUTPUT, + "PASSED tests/test_ext_autodoc.py::test_format_signature", + "SKIPPED tests/test_ext_autodoc.py::test_autodoc_inherited", + flat_eval.END_TEST_OUTPUT, + ] + ) + report = flat_eval.flat_grade(_task(), _flat_artifacts(log)) + # F2P passed; the SKIPPED P2P is neutral (excluded) -> zero failures -> resolved. + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_flat_grade_absent_required_test_is_a_failure(): + """A required test absent from the status map is a failure (not neutral). + + Per swebench's ``test_failed`` (``case not in sm``), an absent required test counts + as a failure, so the run must grade unresolved. + """ + log = "\n".join( + [ + flat_eval.APPLY_PATCH_PASS, + flat_eval.START_TEST_OUTPUT, + "PASSED tests/test_ext_autodoc.py::test_format_signature", + flat_eval.END_TEST_OUTPUT, + ] + ) + # P2P (test_autodoc_inherited) is absent from the log -> failure -> unresolved. + report = flat_eval.flat_grade(_task(), _flat_artifacts(log)) + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_flat_grade_masks_infra_error(): + """Flat grading masks an infra timeout to reward 0.0 with a timeout error kind.""" + artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout", "flat": True}) + report = flat_eval.flat_grade(_task(), artifacts) + assert report.error_kind == "timeout" + assert reward_from_report(report) == 0.0 + + +def test_flat_grade_unbuildable_eval_script_is_unmasked_unresolved(): + """An unbuildable / missing eval script grades UNMASKED unresolved (reward 0), not eval_error. + + Per main, only genuine sandbox/timeout infra failures are masked; an empty/unbuildable eval + spec produces no test markers and so grades as a legitimate unresolved (``error_kind`` None). + """ + artifacts = EvalArtifacts(test_output="", return_code=1, raw={"error_type": "eval_error", "flat": True}) + report = flat_eval.flat_grade(_task(), artifacts) + assert report.error_kind is None + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +# ---- gating (CI) ------------------------------------------------------------ + + +def test_flat_eval_enabled_harness_flag(): + """The harness-level flat-eval flag enables flat eval.""" + assert flat_eval.flat_eval_enabled(True, _task()) is True + + +def test_flat_eval_enabled_task_metadata(): + """Per-task ``flat_eval`` metadata enables flat eval.""" + assert flat_eval.flat_eval_enabled(False, _task(metadata={"flat_eval": True})) is True + + +def test_flat_eval_disabled_by_default(): + """Flat eval is disabled when neither the harness flag nor task metadata enables it.""" + assert flat_eval.flat_eval_enabled(False, _task()) is False + + +def test_swebench_supports_provider_gating(): + """The swe-bench harness is host-graded (flat), so it runs on any exec-capable provider.""" + harness = SweBenchHarness("swe-bench") + assert harness.supports_provider("docker") is True + assert harness.supports_provider("apptainer") is True + assert harness.supports_provider("opensandbox") is True + assert harness.grade_strategy == "flat-host-grade" + + +def test_r2egym_supports_provider_gating(): + """The r2e-gym harness is host-graded (flat), so it runs on any exec-capable provider.""" + harness = R2EGymHarness() + assert harness.supports_provider("docker") is True + assert harness.supports_provider("apptainer") is True + assert harness.supports_provider("opensandbox") is True + assert harness.grade_strategy == "flat-host-grade" + + +# ---- flat run_eval end-to-end via FakeSandbox (CI) -------------------------- + + +class _FakeFlatProvider: + """Scripted provider: ``bash eval.sh ...`` streams a fixture log; ``cat`` echoes it.""" + + name = "fake-flat-eval" + + def __init__(self, *, log_text="", run_rc=0, error_type=None, stream_empty=False, **_): + """Configure the scripted flat-eval provider's responses. + + Args: + log_text: The eval-script log text returned by the run and ``cat``. + run_rc: Return code returned for the eval-script run. + error_type: Optional error type attached to the run result. + stream_empty: When True, the eval-script run streams empty stdout so + the harness falls back to reading the tee'd log file. + **_: Ignored extra keyword arguments. + """ + self._log_text = log_text + self._run_rc = run_rc + self._error_type = error_type + self._stream_empty = stream_empty + self.commands: list[str] = [] + self.uploaded: dict[str, str] = {} + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + self.commands.append(command) + if command.startswith("cat "): + return SandboxExecResult(stdout=self._log_text, stderr="", return_code=0) + # The eval script run. + stdout = "" if self._stream_empty else self._log_text + return SandboxExecResult(stdout=stdout, stderr="", return_code=self._run_rc, error_type=self._error_type) + + async def upload_file(self, handle, local_path, remote_path): + try: + with open(local_path, encoding="utf-8") as fh: + self.uploaded[remote_path] = fh.read() + except OSError: + self.uploaded[remote_path] = "" + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-flat-eval", _FakeFlatProvider, override=True) + + +def _drive_flat(harness, task, *, log_text, run_rc=0, error_type=None, stream_empty=False): + """Drive materialize -> run_eval -> grade for a flat harness via the scripted provider. + + Args: + harness: The flat-capable harness under test. + task: The SweTask to evaluate. + log_text: The eval-script log text the provider returns. + run_rc: Return code returned for the eval-script run. + error_type: Optional error type attached to the run result. + stream_empty: When True, the run streams empty stdout so the harness falls + back to reading the tee'd log file. + + Returns: + A tuple of the graded report, the EvalArtifacts, and the provider instance. + """ + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def _go(): + provider = { + "fake-flat-eval": { + "log_text": log_text, + "run_rc": run_rc, + "error_type": error_type, + "stream_empty": stream_empty, + } + } + env = await AsyncSweEnvironment.start(provider, harness.build_spec(task)) + try: + await harness.materialize(env, task) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts), artifacts, env.sandbox._provider + finally: + await env.cleanup() + + return asyncio.run(_go()) + + +def test_swebench_flat_run_eval_resolved(): + """The swe-bench flat path resolves a success run and uploads the eval script.""" + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"eval_script": "echo running", "flat_eval": True}) + report, artifacts, provider = _drive_flat(harness, task, log_text=_fixture("resolved_success.log")) + assert artifacts.raw["flat"] is True + assert report.resolved is True + assert reward_from_report(report) == 1.0 + # The eval script was uploaded into the sandbox. + assert provider.uploaded.get(flat_eval.EVAL_SCRIPT_PATH, "").startswith("echo running") + + +def test_swebench_flat_run_eval_unresolved(): + """The swe-bench flat path leaves a failure run unresolved.""" + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"eval_script": "echo running"}) + report, _artifacts, _ = _drive_flat(harness, task, log_text=_fixture("unresolved_failure.log")) + assert report.resolved is False + + +def test_swebench_flat_run_eval_stream_empty_uses_log_file(): + """When streamed output is empty, run_eval reads back the tee'd log file.""" + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"eval_script": "echo running"}) + report, _artifacts, provider = _drive_flat( + harness, task, log_text=_fixture("resolved_success.log"), stream_empty=True + ) + assert any(cmd.startswith("cat ") for cmd in provider.commands) + assert report.resolved is True + + +def test_swebench_flat_run_eval_masks_sandbox_error(): + """The swe-bench flat path masks a sandbox error reported by the run.""" + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"eval_script": "echo running"}) + report, artifacts, _ = _drive_flat(harness, task, log_text="", run_rc=1, error_type="sandbox") + assert artifacts.raw["error_type"] == "sandbox" + assert report.error_kind == "sandbox" + + +def test_swebench_flat_run_eval_missing_script_is_unmasked_unresolved(): + """A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error. + + ``flat_run_eval`` still tags the artifact ``error_type == "eval_error"`` (so callers can log + it), but grading no longer masks on it: per main only genuine sandbox/timeout infra failures + are masked, and an empty spec simply produces no test markers and grades unresolved. + """ + harness = SweBenchHarness("swe-bench") + task = _task(metadata={}) # no eval_script + report, artifacts, _ = _drive_flat(harness, task, log_text="") + assert artifacts.raw["error_type"] == "eval_error" + assert report.error_kind is None + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_r2egym_flat_run_eval_resolved_via_task_metadata(): + """Per-task ``flat_eval`` metadata drives the r2e-gym flat path to a resolved run.""" + harness = R2EGymHarness() + task = _task(benchmark="r2e-gym", instance_id="r2e__pkg-1", metadata={"eval_script": "echo run"}) + report, artifacts, _ = _drive_flat(harness, task, log_text=_fixture("resolved_success.log")) + assert artifacts.raw["flat"] is True + assert report.resolved is True diff --git a/resources_servers/swe_bench/tests/test_lifecycle.py b/resources_servers/swe_bench/tests/test_lifecycle.py new file mode 100644 index 0000000000..dbde539ada --- /dev/null +++ b/resources_servers/swe_bench/tests/test_lifecycle.py @@ -0,0 +1,164 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Sandbox lifecycle (``acquire_sandbox``) and ``verify_task`` happy/timeout/empty paths. + +These tests cover always-teardown on context exit and the fresh-sandbox verify +sequence, including the resolved, empty-patch fast path, and eval-timeout cases. +""" + +from __future__ import annotations + +import asyncio + +import pytest + +import resources_servers.swe_bench.harnesses # noqa: F401 (register harnesses) +from nemo_gym.sandbox import SandboxExecResult, SandboxHandle, SandboxStatus +from resources_servers.swe_bench.harness import SweTask +from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness +from resources_servers.swe_bench.sandbox import acquire_sandbox +from resources_servers.swe_bench.verify_task import verify_task + + +class _CountingProvider: + """Provider instance passed directly so the test can count create/close/exec. + + Args: + exec_sleep: Seconds to sleep inside each ``exec`` call, used to simulate a + slow evaluation that triggers the eval timeout. + test_output: Stdout returned for pytest commands. The trailing-status + pytest format is the shape the test parser recognizes, and the ``.py`` + path normalizes to the F2P id in ``_task``. + """ + + name = "fake-life" + + def __init__(self, *, exec_sleep=0.0, test_output="tests/test_x.py::a PASSED\n"): + self.create_count = 0 + self.close_count = 0 + self._exec_sleep = exec_sleep + self._test_output = test_output + + async def create(self, spec): + self.create_count += 1 + return SandboxHandle( + sandbox_id=f"sb-{self.create_count}", provider_name=self.name, raw={"workdir": spec.workdir} + ) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + if self._exec_sleep: + await asyncio.sleep(self._exec_sleep) + if "pytest" in command: + return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, *a, **k): + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + self.close_count += 1 + + async def aclose(self): + return None + + +def _task(**kw) -> SweTask: + """Build a SweTask with sensible defaults, overridable per keyword. + + Args: + **kw: Field overrides merged onto the default task fields. + + Returns: + A SweTask configured for the swe-bench-ext benchmark. + """ + base = dict( + instance_id="inst-1", + image="img:tag", + base_commit="HEAD", + test_command="python -m pytest -rA -q", + model_patch="diff --git a/x b/x\n", + test_framework="pytest", + fail_to_pass=["tests/test_x.py::a"], + benchmark="swe-bench-ext", + ) + base.update(kw) + return SweTask(**base) + + +# ---- acquire_sandbox: starts an env, ALWAYS stops it ------------------------ + + +def test_acquire_sandbox_starts_and_cleans_up(): + """``acquire_sandbox`` creates one sandbox and tears it down on normal exit.""" + provider = _CountingProvider() + + async def run(): + spec = SweBenchExtHarness().build_spec(_task()) + async with acquire_sandbox(provider, spec, instance_id="inst-1") as env: + assert env.sandbox_id is not None + return provider.create_count, provider.close_count + + created, closed = asyncio.run(run()) + assert created == 1 + assert closed == 1 # torn down on normal exit + + +def test_acquire_sandbox_cleans_up_on_exception(): + """``acquire_sandbox`` tears down the sandbox even when the body raises.""" + provider = _CountingProvider() + + async def run(): + spec = SweBenchExtHarness().build_spec(_task()) + with pytest.raises(RuntimeError): + async with acquire_sandbox(provider, spec) as env: + assert env.sandbox_id is not None + raise RuntimeError("boom") + + asyncio.run(run()) + assert provider.close_count == 1 # torn down even on exception + + +# ---- verify_task: resolved / empty-patch fast path / eval-timeout mask ------- + + +def test_verify_task_resolved_in_fresh_sandbox(): + """``verify_task`` resolves a passing task in a freshly created sandbox.""" + provider = _CountingProvider() + report = asyncio.run(verify_task(provider, _task())) + assert report.resolved is True + assert provider.create_count == 1 + assert provider.close_count == 1 + + +def test_verify_task_empty_patch_fast_path_no_create(): + """An empty model patch short-circuits to unresolved without creating a sandbox.""" + provider = _CountingProvider() + report = asyncio.run(verify_task(provider, _task(model_patch=""))) + assert report.patch_exists is False + assert report.resolved is False + assert provider.create_count == 0 # no sandbox spun up for an empty patch + + +def test_verify_task_eval_timeout_masks(): + """An evaluation that exceeds the eval timeout is masked as an eval_timeout error.""" + provider = _CountingProvider(exec_sleep=0.5) + report = asyncio.run(verify_task(provider, _task(), eval_timeout_s=0.05)) + assert report.error_kind == "eval_timeout" diff --git a/resources_servers/swe_bench/tests/test_model_endpoint.py b/resources_servers/swe_bench/tests/test_model_endpoint.py new file mode 100644 index 0000000000..c55232d863 --- /dev/null +++ b/resources_servers/swe_bench/tests/test_model_endpoint.py @@ -0,0 +1,57 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for the model-server egress primitive that resolves a model endpoint per provider.""" + +from __future__ import annotations + +import pytest + +from resources_servers.swe_bench.self_drive import ModelEgressUnavailable, ModelEndpoint, resolve + + +def test_apptainer_uses_host_loopback_by_default(): + """Apptainer resolves to the host loopback base URL when none is configured.""" + ep = resolve("apptainer", {"model": "qwen"}) + assert ep.base_url == "http://127.0.0.1:8000/v1" + assert ep.model == "qwen" + + +def test_docker_uses_configured_base_when_present(): + """Docker uses the explicitly configured base URL.""" + ep = resolve("docker", {"base_url": "http://10.0.0.5:8000/v1"}) + assert ep.base_url == "http://10.0.0.5:8000/v1" + + +def test_opensandbox_requires_service_url(): + """Opensandbox raises when no reachable service URL is supplied.""" + with pytest.raises(ModelEgressUnavailable): + resolve("opensandbox", {"base_url": "http://127.0.0.1:8000/v1"}) + + +def test_opensandbox_with_service_url_ok(): + """Opensandbox resolves to the provided service URL.""" + ep = resolve("opensandbox", {"model": "m"}, opensandbox_service_url="http://gym-model.svc.cluster.local/v1") + assert ep.base_url == "http://gym-model.svc.cluster.local/v1" + + +def test_to_sandbox_env_is_minimal(): + """The sandbox env carries only the base URL, API key, and model name.""" + ak_value = "abc-test" + env = ModelEndpoint(base_url="http://h/v1", api_key=ak_value, model="m").to_sandbox_env() + assert env["OPENAI_BASE_URL"] == "http://h/v1" + assert env["OPENAI_API_KEY"] == ak_value + assert env["NEMO_GYM_MODEL"] == "m" + # never leaks a full global-config dict + assert "NEMO_GYM_CONFIG_DICT" not in env diff --git a/resources_servers/swe_bench/tests/test_nv_internal.py b/resources_servers/swe_bench/tests/test_nv_internal.py new file mode 100644 index 0000000000..3b1f049f2d --- /dev/null +++ b/resources_servers/swe_bench/tests/test_nv_internal.py @@ -0,0 +1,547 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the nv-internal-1 harness, driven by a FakeSandbox provider. + +nv-internal-1 is flat + host-graded, so it runs on any exec-capable provider. +The scripted provider returns the parsing_script ``output.json`` report on the +``cat /root/output.json`` hop; grading is a pure host-side parse. +""" + +from __future__ import annotations + +import asyncio +import json + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask, reward_from_report +from resources_servers.swe_bench.harnesses.nv_internal import ( + NV_DEFAULT_WORKDIR, + NVInternalHarness, + _coerce_test_list, + _format_test_files, + _nv_workdir, + _parse_dockerfile_env, + _resolve_required_tests, + parse_passed_tests, +) +from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + +class _FakeProvider: + """Scripted provider: ``cat /root/output.json`` returns a canned report.""" + + name = "fake-nv" + + def __init__(self, *, report="", apply_rc=0, **_): + """Configure the scripted provider's responses. + + Args: + report: JSON report stdout returned for ``cat /root/output.json``. + apply_rc: Return code returned for ``git apply`` commands. + **_: Ignored extra keyword arguments. + """ + self._report = report + self._apply_rc = apply_rc + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + if "cat /root/output.json" in command: + return SandboxExecResult(stdout=self._report, stderr="", return_code=0) + if "git apply" in command: + return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, *a, **k): + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-nv", _FakeProvider, override=True) + + +class _RecordingProvider: + """Provider that records exec ``cwd`` per command and captures uploads. + + Uploads are captured as ``{target_path: content}`` by reading the temp file + that ``write_text`` hands to ``upload_file``; execs are captured as a list of + ``(command, cwd)`` so tests can assert which directory each hop ran in. + """ + + name = "fake-nv-rec" + + def __init__(self, *, report="", **_): + """Configure the recording provider's canned report. + + Args: + report: JSON report stdout returned for ``cat /root/output.json``. + **_: Ignored extra keyword arguments. + """ + self._report = report + self.execs: list[tuple[str, str | None]] = [] + self.uploads: dict[str, str] = {} + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + self.execs.append((command, cwd)) + if "cat /root/output.json" in command: + return SandboxExecResult(stdout=self._report, stderr="", return_code=0) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, handle, source_path, target_path): + with open(source_path, encoding="utf-8") as fh: + self.uploads[target_path] = fh.read() + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-nv-rec", _RecordingProvider, override=True) + + +def _task(**overrides) -> SweTask: + """Build an nv-internal-1 SweTask with sensible defaults, overridable per keyword. + + Args: + **overrides: Field overrides merged onto the default task fields. + + Returns: + A SweTask configured for the nv-internal-1 benchmark. + """ + base = dict( + instance_id="nv-inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/app", + model_patch="diff --git a/x b/x\n", + fail_to_pass=["pkg/test_x.py::a"], + pass_to_pass=["pkg/test_x.py::b"], + benchmark="nv-internal-1", + metadata={ + "run_script": "echo run\n", + "parsing_script": "import sys\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + base.update(overrides) + return SweTask(**base) + + +def _report(*passed, failed=()): + """Build a JSON test report with the given passed and failed test names. + + Args: + *passed: Names of tests reported as PASSED. + failed: Names of tests reported as FAILED. + + Returns: + The report serialized as a JSON string under a ``tests`` key. + """ + tests = [{"name": name, "status": "PASSED"} for name in passed] + tests += [{"name": name, "status": "FAILED"} for name in failed] + return json.dumps({"tests": tests}) + + +async def _run(provider_cfg, task) -> SweEvalReport: + """Drive reset -> materialize -> run_eval -> grade against a scripted provider. + + Args: + provider_cfg: Provider configuration mapping for the ``fake-nv`` provider. + task: The SweTask to evaluate. + + Returns: + The graded SweEvalReport for the run. + """ + harness = NVInternalHarness() + env = await AsyncSweEnvironment.start({"fake-nv": provider_cfg}, harness.build_spec(task)) + try: + await harness.reset_repo(env, task) + await harness.materialize(env, task) + artifacts = await harness.run_eval(env, task) + finally: + await env.cleanup() + return harness.grade(task, artifacts) + + +# ---- pure helpers ----------------------------------------------------------- + + +def test_parse_passed_tests(): + """``parse_passed_tests`` returns only PASSED names and ignores malformed entries.""" + report = {"tests": [{"name": "a", "status": "PASSED"}, {"name": "b", "status": "FAILED"}]} + assert parse_passed_tests(report) == ["a"] + assert parse_passed_tests({}) == [] + # Malformed entries are ignored, not crashed on. + assert parse_passed_tests({"tests": ["junk", {"status": "PASSED"}]}) == [] + + +def test_format_test_files(): + """``_format_test_files`` joins list/JSON/CSV inputs into a comma-separated string.""" + assert _format_test_files(["a", "b"]) == "a,b" + assert _format_test_files('["a", "b"]') == "a,b" + assert _format_test_files("a,b") == "a,b" + assert _format_test_files(None) == "" + + +def test_format_test_files_single_quoted_list(): + """``_format_test_files`` parses repr-style single-quoted lists. + + Single-quoted lists are not valid JSON, so they are parsed with + ``ast.literal_eval``; unparseable bracketed text falls back to the raw string. + """ + assert _format_test_files("['pkg/test_x.py', 'pkg/test_y.py']") == "pkg/test_x.py,pkg/test_y.py" + # A single-element single-quoted list. + assert _format_test_files("['only.py']") == "only.py" + # Unparseable bracketed text falls back to the raw string, not a crash. + assert _format_test_files("[not a list") == "[not a list" + + +def test_build_spec(): + """The nv-internal-1 harness builds a sandbox spec from a task.""" + harness = NVInternalHarness() + assert harness.name == "nv-internal-1" + assert harness.grade_strategy == "flat-host-grade" + spec = harness.build_spec(_task()) + assert spec.image == "img:tag" + assert spec.workdir == "/app" + assert spec.metadata["instance_id"] == "nv-inst-1" + + +def test_supports_any_provider(): + """The nv-internal-1 harness supports any exec-capable provider.""" + assert NVInternalHarness().supports_provider("docker") is True + assert NVInternalHarness().supports_provider("apptainer") is True + + +def test_grade_masks_on_infra_error(): + """Grading masks an infra timeout to reward 0.0 and records its error kind.""" + harness = NVInternalHarness() + report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"})) + assert report.error_kind == "timeout" + assert reward_from_report(report) == 0.0 + + +def test_grade_masks_on_sandbox_error(): + """Grading masks a sandbox error to reward 0.0 and records its error kind.""" + harness = NVInternalHarness() + report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "sandbox"})) + assert report.error_kind == "sandbox" + assert reward_from_report(report) == 0.0 + + +def test_grade_empty_report_is_unresolved(): + """An empty report grades as unresolved.""" + harness = NVInternalHarness() + report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=0, patch_applied=True)) + assert report.resolved is False + + +def test_grade_malformed_report_is_unresolved(): + """A malformed (non-JSON) report grades as unresolved.""" + harness = NVInternalHarness() + report = harness.grade(_task(), EvalArtifacts(test_output="not json", return_code=0, patch_applied=True)) + assert report.resolved is False + + +# ---- full reset -> materialize -> run_eval -> grade ------------------------- + + +def test_resolved(): + """A run with all required tests passing resolves with reward 1.0.""" + report = _report("pkg/test_x.py::a", "pkg/test_x.py::b") + result = asyncio.run(_run({"report": report}, _task())) + assert result.patch_applied is True + assert result.resolved is True + assert reward_from_report(result) == 1.0 + + +def test_unresolved_failing_required_test(): + """A failing fail-to-pass test leaves the run unresolved with reward 0.0.""" + report = _report("pkg/test_x.py::b", failed=["pkg/test_x.py::a"]) + result = asyncio.run(_run({"report": report}, _task())) + assert result.resolved is False + assert reward_from_report(result) == 0.0 + + +def test_unresolved_missing_required_test(): + """A required test missing from the report leaves the run unresolved.""" + report = _report("pkg/test_x.py::a") + result = asyncio.run(_run({"report": report}, _task())) + assert result.resolved is False + + +def test_patch_apply_rc_does_not_gate_resolved(): + """A non-zero patch-apply return code does not gate ``resolved``. + + Grading derives ``resolved`` from the tests alone, so a rejected patch + (apply_rc != 0) with all required tests passing is still resolved. + """ + report = _report("pkg/test_x.py::a", "pkg/test_x.py::b") + result = asyncio.run(_run({"report": report, "apply_rc": 1}, _task())) + assert result.patch_applied is False + assert result.resolved is True + assert reward_from_report(result) == 1.0 + + +# ---- *_select precedence ---------------------------------------------------- + + +def test_resolve_required_tests_prefers_select_keys(): + """``fail_to_pass_select`` / ``pass_to_pass_select`` take precedence over the plain keys.""" + task = _task( + fail_to_pass=["plain::f2p"], + pass_to_pass=["plain::p2p"], + metadata={ + "fail_to_pass_select": ["sel::f2p"], + "pass_to_pass_select": ["sel::p2p"], + }, + ) + f2p, p2p = _resolve_required_tests(task) + assert f2p == ["sel::f2p"] + assert p2p == ["sel::p2p"] + + +def test_resolve_required_tests_falls_back_to_plain_keys(): + """Without ``*_select`` keys, the plain fail_to_pass / pass_to_pass keys are used.""" + task = _task(fail_to_pass=["plain::f2p"], pass_to_pass=["plain::p2p"], metadata={}) + f2p, p2p = _resolve_required_tests(task) + assert f2p == ["plain::f2p"] + assert p2p == ["plain::p2p"] + + +def test_resolve_required_tests_parses_stringified_select(): + """A ``*_select`` value given as a repr-style stringified list is parsed.""" + task = _task( + metadata={ + "fail_to_pass_select": "['sel::f2p']", + "pass_to_pass_select": "['sel::p2p']", + }, + ) + f2p, p2p = _resolve_required_tests(task) + assert f2p == ["sel::f2p"] + assert p2p == ["sel::p2p"] + + +def test_coerce_test_list(): + """``_coerce_test_list`` accepts lists and stringified lists, returning [] on bad input.""" + assert _coerce_test_list(["a", "b"]) == ["a", "b"] + assert _coerce_test_list("['a', 'b']") == ["a", "b"] + assert _coerce_test_list('["a", "b"]') == ["a", "b"] + assert _coerce_test_list("not a list") == [] + assert _coerce_test_list("[broken") == [] + + +def test_resolved_uses_select_tests_end_to_end(): + """End to end, ``*_select`` precedence resolves a run whose report has only the select tests.""" + # The report only contains the *_select tests; the plain keys would be unmet. + report = _report("sel::f2p", "sel::p2p") + task = _task( + fail_to_pass=["plain::f2p"], + pass_to_pass=["plain::p2p"], + metadata={ + "run_script": "echo run\n", + "parsing_script": "import sys\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + "fail_to_pass_select": ["sel::f2p"], + "pass_to_pass_select": ["sel::p2p"], + }, + ) + result = asyncio.run(_run({"report": report}, task)) + assert result.resolved is True + + +# ---- dockerfile ENV replay -------------------------------------------------- + + +def test_parse_dockerfile_env_equals_and_space_forms(): + """``_parse_dockerfile_env`` parses both ``ENV K=V`` and ``ENV K V`` forms, skipping non-ENV lines.""" + task = _task( + metadata={ + "base_dockerfile": "FROM ubuntu\nENV FOO=bar\nENV SPACED spaced_value\n", + "instance_dockerfile": "ENV BAZ = qux\nRUN echo hi\n", + }, + ) + env = _parse_dockerfile_env(task) + assert env["FOO"] == "bar" + assert env["SPACED"] == "spaced_value" + assert env["BAZ"] == "qux" + assert "RUN" not in env + + +def test_parse_dockerfile_env_absent_is_noop(): + """``_parse_dockerfile_env`` returns an empty mapping when no dockerfile is present.""" + assert _parse_dockerfile_env(_task(metadata={})) == {} + + +def test_build_spec_injects_dockerfile_env(): + """``build_spec`` injects dockerfile ENV entries while preserving the existing git env.""" + task = _task(metadata={"base_dockerfile": "ENV PATH=/custom/bin:$PATH\n"}) + spec = NVInternalHarness().build_spec(task) + # Existing git env preserved; dockerfile ENV injected. + assert spec.env["GIT_CONFIG_GLOBAL"] == "/dev/null" + assert spec.env["PATH"] == "/custom/bin:$PATH" + + +# ---- dotted script keys are uploaded ---------------------------------------- + + +async def _run_recording(task) -> _RecordingProvider: + """Drive reset -> materialize -> run_eval with a recording provider. + + Args: + task: The SweTask to evaluate. + + Returns: + The recording provider, so tests can inspect captured execs and uploads. + """ + provider = _RecordingProvider(report=_report("pkg/test_x.py::a", "pkg/test_x.py::b")) + harness = NVInternalHarness() + env = await AsyncSweEnvironment.start(provider, harness.build_spec(task)) + try: + await harness.reset_repo(env, task) + await harness.materialize(env, task) + await harness.run_eval(env, task) + finally: + await env.cleanup() + return provider + + +def test_materialize_reads_dotted_script_keys(): + """``materialize`` uploads scripts stored under the dotted keys ``run_script.sh`` / ``parsing_script.py``.""" + task = _task( + repo_workdir="/app", + metadata={ + "run_script.sh": "echo DOTTED_RUN\n", + "parsing_script.py": "print('DOTTED_PARSE')\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + provider = asyncio.run(_run_recording(task)) + assert provider.uploads["/root/run_script.sh"] == "echo DOTTED_RUN\n" + assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED_PARSE')\n" + + +def test_materialize_dotted_keys_take_precedence_over_extensionless(): + """When both dotted and extensionless script keys are present, the dotted keys win.""" + task = _task( + repo_workdir="/app", + metadata={ + "run_script.sh": "echo DOTTED\n", + "run_script": "echo EXTLESS\n", + "parsing_script.py": "print('DOTTED')\n", + "parsing_script": "print('EXTLESS')\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + provider = asyncio.run(_run_recording(task)) + assert provider.uploads["/root/run_script.sh"] == "echo DOTTED\n" + assert provider.uploads["/root/parsing_script.py"] == "print('DOTTED')\n" + + +def test_materialize_falls_back_to_extensionless_keys(): + """When only the extensionless script keys are present, they are used.""" + task = _task( + repo_workdir="/app", + metadata={ + "run_script": "echo EXTLESS_RUN\n", + "parsing_script": "print('EXTLESS_PARSE')\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + provider = asyncio.run(_run_recording(task)) + assert provider.uploads["/root/run_script.sh"] == "echo EXTLESS_RUN\n" + assert provider.uploads["/root/parsing_script.py"] == "print('EXTLESS_PARSE')\n" + + +# ---- hops run in /app ------------------------------------------------------- + + +def test_nv_workdir_defaults_to_app(): + """``_nv_workdir`` maps the generic /testbed default (or empty) to /app, honoring pinned paths.""" + assert _nv_workdir(_task(repo_workdir="/testbed")) == NV_DEFAULT_WORKDIR + assert _nv_workdir(_task(repo_workdir="")) == NV_DEFAULT_WORKDIR + # A row that pins a non-default workdir is honored. + assert _nv_workdir(_task(repo_workdir="/srv/repo")) == "/srv/repo" + assert _nv_workdir(_task(repo_workdir="/app")) == "/app" + + +def test_build_spec_workdir_defaults_to_app_for_generic_default(): + """``build_spec`` rewrites the generic /testbed default workdir to /app.""" + spec = NVInternalHarness().build_spec(_task(repo_workdir="/testbed")) + assert spec.workdir == NV_DEFAULT_WORKDIR + + +def test_all_hops_run_in_app_for_generic_default(): + """With the generic /testbed default, every reset/apply/run/parse/cat hop runs in /app.""" + task = _task( + repo_workdir="/testbed", + metadata={ + "run_script.sh": "echo run\n", + "parsing_script.py": "import sys\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + provider = asyncio.run(_run_recording(task)) + cwds = {cwd for _, cwd in provider.execs} + assert cwds == {NV_DEFAULT_WORKDIR} + # Spot-check that the key hops were exercised in /app. + by_cwd = {cmd: cwd for cmd, cwd in provider.execs} + assert any("git reset --hard" in cmd and cwd == "/app" for cmd, cwd in provider.execs) + assert any("git apply" in cmd and cwd == "/app" for cmd, cwd in provider.execs) + assert any("run_script.sh" in cmd and cwd == "/app" for cmd, cwd in provider.execs) + assert any("parsing_script.py" in cmd and cwd == "/app" for cmd, cwd in provider.execs) + assert by_cwd["cat /root/output.json"] == "/app" + + +def test_all_hops_honor_explicit_non_default_workdir(): + """A row that pins ``repo_workdir`` to a non-default path runs every hop there.""" + task = _task( + repo_workdir="/srv/repo", + metadata={ + "run_script.sh": "echo run\n", + "parsing_script.py": "import sys\n", + "selected_test_files_to_run": ["pkg/test_x.py"], + }, + ) + provider = asyncio.run(_run_recording(task)) + assert {cwd for _, cwd in provider.execs} == {"/srv/repo"} diff --git a/resources_servers/swe_bench/tests/test_r2egym.py b/resources_servers/swe_bench/tests/test_r2egym.py new file mode 100644 index 0000000000..b8a671b42a --- /dev/null +++ b/resources_servers/swe_bench/tests/test_r2egym.py @@ -0,0 +1,185 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the r2e-gym flat (host-graded) harness. + +r2e-gym now grades host-side via the shared flat-eval path (the apptainer-only nested +``run_local_evaluation`` grader was removed when PR #1694 took over the apptainer provider; the +nested re-wiring is tracked for a follow-up PR). These tests cover provisioning, the agent-phase +test-hiding command shape, ``reset_repo``, and the flat ``run_eval`` + ``grade`` path against a +scripted ``_FakeProvider``. +""" + +from __future__ import annotations + +import asyncio + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import SweTask, reward_from_report +from resources_servers.swe_bench.harnesses.r2egym import R2EGymHarness + + +_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n" + + +class _FakeProvider: + """Scripted provider: returns a canned eval log for the eval-script run; records uploads.""" + + name = "fake-r2egym" + + def __init__(self, *, log_text="", exec_rc=0, **_): + self._log_text = log_text + self._exec_rc = exec_rc + self.uploaded: dict[str, str] = {} + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + rc = 0 if command.startswith("cat ") else self._exec_rc + return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc) + + async def upload_file(self, handle, local_path, remote_path): + try: + with open(local_path, encoding="utf-8") as fh: + self.uploaded[remote_path] = fh.read() + except OSError: + self.uploaded[remote_path] = "" + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-r2egym", _FakeProvider, override=True) + + +def _task(**overrides) -> SweTask: + """Build an r2e-gym ``SweTask`` with sensible defaults.""" + base = dict( + instance_id="repo__inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + model_patch="diff --git a/x b/x\n", + fail_to_pass=["t::a"], + pass_to_pass=["t::b"], + benchmark="r2e-gym", + split="test", + ) + base.update(overrides) + return SweTask(**base) + + +def test_harness_identity(): + harness = R2EGymHarness() + assert harness.name == "r2e-gym" + assert harness.grade_strategy == "flat-host-grade" + + +def test_build_spec_image_workdir_metadata(): + spec = R2EGymHarness().build_spec(_task()) + assert spec.image == "img:tag" + assert spec.workdir == "/testbed" + assert spec.metadata["harness"] == "r2e-gym" + + +def test_build_spec_truncates_long_instance_id(): + spec = R2EGymHarness().build_spec(_task(instance_id="x" * 100)) + assert len(spec.metadata["instance_id"]) == 63 + + +def test_supports_provider_any_exec_capable(): + harness = R2EGymHarness() + assert harness.supports_provider("docker") is True + assert harness.supports_provider("apptainer") is True + + +def test_hide_eval_tests_commands_shape(): + commands = R2EGymHarness().hide_eval_tests_commands() + assert len(commands) == 3 + assert all("r2e_tests" in c for c in commands) + + +def test_materialize_writes_patch_diff(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = R2EGymHarness() + task = _task() + env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task)) + await harness.materialize(env, task) + return env.sandbox._provider + + provider = asyncio.run(run()) + assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n" + + +def test_run_eval_then_grade_flat_resolved(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = R2EGymHarness() + task = _task(metadata={"eval_script": "echo run"}) + env = await AsyncSweEnvironment.start({"fake-r2egym": {"log_text": _PASSING_LOG}}, harness.build_spec(task)) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts) + + report = asyncio.run(run()) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_run_eval_missing_eval_script_is_unmasked_unresolved(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + # A missing/unbuildable eval script grades UNMASKED unresolved (reward 0), not eval_error: + # only genuine sandbox/timeout infra failures are masked. + async def run(): + harness = R2EGymHarness() + task = _task() + env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task)) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts) + + report = asyncio.run(run()) + assert report.error_kind is None + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_reset_repo_is_noop(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = R2EGymHarness() + task = _task() + env = await AsyncSweEnvironment.start({"fake-r2egym": {}}, harness.build_spec(task)) + await harness.reset_repo(env, task) # must not raise + + asyncio.run(run()) diff --git a/resources_servers/swe_bench/tests/test_swe_bench_ext.py b/resources_servers/swe_bench/tests/test_swe_bench_ext.py new file mode 100644 index 0000000000..673689ed6e --- /dev/null +++ b/resources_servers/swe_bench/tests/test_swe_bench_ext.py @@ -0,0 +1,491 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Tests for the swe-bench-ext harness grading. + +These cover two grading behaviors: + +* ``grade`` delegates to the vendored lighthouse parser + (``parse_and_check_tests``) — so junit-xml parsing, ``normalize_test_id`` plus + 4-stage fuzzy matching, the 20+ framework dispatch, and the + ``::build``/``::compile`` synthetic-PASS injection all drive ``resolved``. + Recorded fixture logs (one per parser path) anchor the expectation. +* ``resolved`` is the parser's verdict only; a failed ``git apply`` is recorded + in ``patch_applied`` but never gates ``resolved``. + +The harness is flat / host-graded (no nested container), so ``run_eval`` runs +against a scripted ``FakeSandbox`` rather than a real image. +""" + +from __future__ import annotations + +import asyncio +from pathlib import Path + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report +from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness + + +_FIXTURES = Path(__file__).parent / "fixtures" / "swe_bench_ext" + + +def _fixture(name: str) -> str: + """Read a recorded fixture log by file name. + + Args: + name: The fixture file name under the ``swe_bench_ext`` fixtures dir. + + Returns: + str: The fixture file contents. + """ + return (_FIXTURES / name).read_text() + + +def _task(**overrides) -> SweTask: + """Build a swe-bench-ext ``SweTask`` with sensible defaults. + + Args: + **overrides: Field values overriding the defaults. + + Returns: + SweTask: A task populated from the defaults merged with overrides. + """ + base = dict( + instance_id="repo__inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + test_command="python -m pytest -rA -q", + test_framework="pytest", + model_patch="diff --git a/x b/x\n", + fail_to_pass=["tests/test_core.py::test_fix_applied"], + pass_to_pass=["tests/test_core.py::test_regression_guard"], + benchmark="swe-bench-ext", + ) + base.update(overrides) + return SweTask(**base) + + +def _artifacts(test_output: str, *, patch_applied: bool = True, error_type=None) -> EvalArtifacts: + """Build ``EvalArtifacts`` for a graded run. + + Args: + test_output: The captured test transcript handed to the parser. + patch_applied: Whether the model patch applied cleanly. + error_type: Infrastructure error kind, or None for a clean run. + + Returns: + EvalArtifacts: The artifacts passed to ``grade``. + """ + return EvalArtifacts( + test_output=test_output, + return_code=0, + patch_applied=patch_applied, + raw={"error_type": error_type}, + ) + + +# --- vendored parser drives resolved ---------------------------------------- + + +def test_grade_junit_xml_resolved(): + """junit-xml parsing + fuzzy id matching resolves a clean F2P/P2P pass.""" + harness = SweBenchExtHarness() + report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml"))) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + # The parser report is surfaced for inspection. + assert report.tests_status["framework"] == "pytest" + assert report.tests_status["f2p_passed"] == 1 + assert report.tests_status["p2p_passed"] == 1 + + +def test_grade_junit_xml_unresolved_when_p2p_fails(): + harness = SweBenchExtHarness() + task = _task(pass_to_pass=["tests/test_core.py::test_unrelated_broken"]) + report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml"))) + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_grade_pytest_text_fuzzy_id_match(): + """Normalized/fuzzy id matching: ``src/pkg/...py::test`` log id resolves a + differently-delimited expected id via normalize_test_id.""" + harness = SweBenchExtHarness() + task = _task( + fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha"], + pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"], + ) + report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt"))) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_grade_pytest_text_unresolved_when_f2p_fails(): + harness = SweBenchExtHarness() + task = _task( + fail_to_pass=["src/pkg/tests/test_widget.py::test_gamma"], + pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"], + ) + report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt"))) + assert report.resolved is False + + +def test_grade_build_synthetic_pass_injection(): + """An F2P entry ending ``::build`` not present in the parsed output is + injected as PASSED (synthetic build/compile handling).""" + harness = SweBenchExtHarness() + task = _task( + fail_to_pass=["src/pkg/tests/test_widget.py::test_alpha", "mypkg::build"], + pass_to_pass=["src/pkg/tests/test_widget.py::test_beta"], + ) + report = harness.grade(task, _artifacts(_fixture("pytest_text_fuzzy.txt"))) + assert report.resolved is True + assert report.tests_status["fail_to_pass_results"]["mypkg::build"] == "PASSED" + + +def test_grade_non_pytest_framework_go_json(): + """A non-pytest framework (``go``) dispatches to the go-json parser.""" + harness = SweBenchExtHarness() + task = _task( + test_framework="go", + fail_to_pass=["github.com/acme/widget::TestAlpha"], + pass_to_pass=["github.com/acme/widget::TestBeta"], + ) + report = harness.grade(task, _artifacts(_fixture("go_json.txt"))) + assert report.resolved is True + assert report.tests_status["framework"] == "go" + + +def test_grade_non_pytest_framework_go_json_unresolved(): + harness = SweBenchExtHarness() + task = _task( + test_framework="go", + fail_to_pass=["github.com/acme/widget::TestGamma"], + pass_to_pass=["github.com/acme/widget::TestBeta"], + ) + report = harness.grade(task, _artifacts(_fixture("go_json.txt"))) + assert report.resolved is False + + +# --- empty framework is passed VERBATIM (NOT coerced to pytest) -------------- + + +def test_grade_empty_framework_passed_verbatim_not_coerced_to_pytest(): + """``test_framework`` is passed through UNCHANGED — an empty framework reaches + ``parse_and_check_tests`` as ``""`` and hits the parser's auto-detect path, NOT + the pytest junit-xml parser. + + Coercing ``""`` -> ``"pytest"`` would let junit-xml parse and report + ``resolved`` for an instance that should auto-detect. We assert the framework + reaches the parser verbatim (recorded in ``report.framework``) and that + junit-xml is therefore NOT parsed under an empty framework. + """ + harness = SweBenchExtHarness() + task = _task(test_framework="") + report = harness.grade(task, _artifacts(_fixture("pytest_junit.xml"))) + # Framework recorded verbatim — not silently rewritten to "pytest". + assert report.tests_status["framework"] == "" + # Auto-detect path does not understand junit-xml -> nothing parsed -> unresolved. + assert report.tests_status["parsed_count"] == 0 + assert report.resolved is False + + +def test_grade_empty_framework_uses_autodetect_path(): + """An empty framework grades via parse_test_output's auto-detect path (TAP / + Mocha-Hardhat console) when the instance ships no framework. Here a TAP + transcript resolves without any framework hint.""" + harness = SweBenchExtHarness() + tap_output = ( + "<<>>\n" + "TAP version 13\n" + "1..2\n" + "ok 1 - test_fix_applied\n" + "ok 2 - test_regression_guard\n" + "<<>>\n" + ) + task = _task( + test_framework="", + fail_to_pass=["test_fix_applied"], + pass_to_pass=["test_regression_guard"], + ) + report = harness.grade(task, _artifacts(tap_output)) + assert report.tests_status["framework"] == "" + assert report.tests_status["parsed_count"] >= 2 + assert report.resolved is True + + +def test_run_eval_and_grade_share_framework_value(): + """run_eval (flag/result-file selection) and grade (parsing) use the SAME + framework. With an empty framework, run_eval must NOT inject pytest's + ``--junitxml`` flag and must wrap the bare command, and grade must parse under + ``""`` — proving the two share ``_resolve_framework`` rather than diverging on a + pytest default.""" + task = _task(test_framework="", test_command="run-my-tests") + _, _, provider = _run_eval(task, test_output="", run_cmd="run-my-tests") + eval_cmds = [c for c in provider.commands if "run-my-tests" in c] + assert eval_cmds, "expected the bare framework command to be wrapped" + wrapped = eval_cmds[-1] + # Empty framework => default framework config => no output flag, no result file. + assert "--junitxml" not in wrapped + assert "<<>>" not in wrapped + # The mkdir parent-dir creation is present regardless. + assert "mkdir -p /workspace/test-results" in wrapped + + +def test_run_eval_command_less_row_injects_no_default_runner(): + """A command-less row runs NO test runner, matching main's SweBenchExtDatasetProcessor. + + Main uses ``inst.get("test_command", "")`` verbatim (empty when absent), so a row that + ships no command runs nothing and grades unresolved. The harness must not fabricate a + ``python -m pytest`` default that would diverge from main by manufacturing results. + """ + task = _task(test_command="", test_framework="") + _, _, provider = _run_eval(task, test_output="", run_cmd="__never__") + eval_cmds = [c for c in provider.commands if "git apply" not in c and "cat " not in c] + wrapped = eval_cmds[-1] + assert "pytest" not in wrapped # no default runner injected + # The command slot between the START/END marker echoes is empty (no runner line). + assert 'echo "<<>>"\n\necho "<<>>"' in wrapped + + +# --- patch_applied does not gate resolved ----------------------------------- + + +def test_grade_resolved_even_when_patch_apply_failed(): + """Grading is on tests ONLY; a failed apply is recorded but never flips a + tests-passing run to unresolved.""" + harness = SweBenchExtHarness() + report = harness.grade(_task(), _artifacts(_fixture("pytest_junit.xml"), patch_applied=False)) + assert report.patch_applied is False + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +# --- infra masking ---------------------------------------------------------- + + +def test_grade_masks_on_infra_error(): + harness = SweBenchExtHarness() + report = harness.grade(_task(), _artifacts("", error_type="timeout")) + assert report.error_kind == "timeout" + assert reward_from_report(report) == 0.0 + + +# --- run_eval against a scripted FakeSandbox -------------------------------- + + +class _FakeExtProvider: + """Scripted provider that records git-apply attempts and returns a transcript. + + Args: + test_output: The transcript returned for the wrapped eval command. + apply_rc: Return code for ``git apply`` commands. + run_cmd: Substring identifying the wrapped eval command. + git_dir: Directory whose ``.git`` probe succeeds; ``None`` means every + probed dir reports a checkout (so the first ladder entry wins). + """ + + name = "fake-ext" + + def __init__(self, *, test_output="", apply_rc=0, run_cmd="pytest", git_dir=None, **_): + self._test_output = test_output + self._apply_rc = apply_rc + # Marker that identifies the wrapped eval command (defaults to the pytest + # command); tests with a custom command pass run_cmd. + self._run_cmd = run_cmd + # Which directory holds the repo checkout: a ``test -d "/.git"`` probe + # succeeds only for this dir. None => every probed dir reports a checkout + # (so the first ladder entry, /testbed, wins). + self._git_dir = git_dir + self.commands: list[str] = [] + self.exec_cwds: list[str | None] = [] + self.uploaded: dict[str, str] = {} + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + self.commands.append(command) + self.exec_cwds.append(cwd) + if command.startswith("test -d "): + # The repo-workdir probe: succeed only for the configured git dir (or any + # dir when unconfigured). + if self._git_dir is None or f'"{self._git_dir}/.git"' in command: + return SandboxExecResult(stdout="", stderr="", return_code=0) + return SandboxExecResult(stdout="", stderr="", return_code=1) + if "git apply" in command: + return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc) + if self._run_cmd in command: + return SandboxExecResult(stdout=self._test_output, stderr="", return_code=0) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, handle, local_path, remote_path): + try: + with open(local_path, encoding="utf-8") as fh: + self.uploaded[remote_path] = fh.read() + except OSError: + self.uploaded[remote_path] = "" + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-ext", _FakeExtProvider, override=True) + + +def _run_eval(task: SweTask, *, test_output: str, apply_rc: int = 0, run_cmd: str = "pytest", git_dir=None): + """Run the harness through a scripted provider and return the run outputs. + + Args: + task: The task to evaluate. + test_output: The transcript the provider returns for the eval command. + apply_rc: Return code for ``git apply`` commands. + run_cmd: Substring identifying the wrapped eval command. + git_dir: Directory whose ``.git`` probe succeeds (None => any dir). + + Returns: + tuple: The harness, the produced ``EvalArtifacts``, and the provider + instance (for command inspection). + """ + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = SweBenchExtHarness() + env = await AsyncSweEnvironment.start( + {"fake-ext": {"test_output": test_output, "apply_rc": apply_rc, "run_cmd": run_cmd, "git_dir": git_dir}}, + harness.build_spec(task), + ) + await harness.materialize(env, task) + artifacts = await harness.run_eval(env, task) + return harness, artifacts, env.sandbox._provider + + return asyncio.run(run()) + + +def test_run_eval_uses_legacy_apply_flags_and_grades_resolved(): + task = _task() + harness, artifacts, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml")) + apply_cmds = [c for c in provider.commands if "git apply" in c] + assert apply_cmds, "expected a git-apply attempt" + # The git-apply flag set, with no --3way fallback. + assert all("--reject --recount --ignore-space-change --ignore-whitespace" in c for c in apply_cmds) + assert all("--3way" not in c for c in apply_cmds) + assert artifacts.patch_applied is True + report = harness.grade(task, artifacts) + assert report.resolved is True + + +def test_run_eval_apply_failure_still_resolves_on_tests(): + # End-to-end through run_eval -> grade: a failed apply records + # patch_applied=False but a tests-passing run still resolves. + task = _task() + harness, artifacts, _ = _run_eval(task, test_output=_fixture("pytest_junit.xml"), apply_rc=1) + assert artifacts.patch_applied is False + report = harness.grade(task, artifacts) + assert report.patch_applied is False + assert report.resolved is True + + +def test_run_eval_wraps_command_with_structured_output_and_markers(): + # run_eval wraps the command — add the structured-output flag (--junitxml) via + # get_test_command_with_output and run between the SWE_BENCH_EXT markers (plus + # result-file dump), so parse_and_check_tests receives junit-xml / marked + # output rather than raw "-rA" text it cannot parse. + task = _task() + _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml")) + eval_cmds = [c for c in provider.commands if "pytest" in c and "git apply" not in c] + assert eval_cmds, "expected a wrapped pytest eval command" + wrapped = eval_cmds[-1] + assert "<<>>" in wrapped + assert "<<>>" in wrapped + assert "--junitxml=" in wrapped # structured-output flag from get_test_command_with_output + assert "<<>>" in wrapped # junit result-file dumped for the parser + # The result-file parent dir is created first. + assert "mkdir -p /workspace/test-results" in wrapped + + +# --- repo-workdir fallback ladder (matches main's cd /testbed||/workspace/repo||/app) ---- + + +def _eval_cwd(provider) -> str | None: + """Return the cwd of the wrapped eval command (the command holding the markers).""" + for command, cwd in zip(provider.commands, provider.exec_cwds): + if "<<>>" in command: + return cwd + return None + + +def test_run_eval_resolves_workdir_from_ladder_when_repo_not_at_testbed(): + """A repo at /workspace/repo (not /testbed) is found via main's fallback ladder. + + Main's eval script runs ``cd /testbed || cd /workspace/repo || cd /app``; the harness + must reproduce that so the patches and tests run in the real checkout rather than the + hardcoded /testbed default. + """ + task = _task() # default repo_workdir == /testbed + _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/workspace/repo") + # The patch-apply and the wrapped eval command run in the located checkout. + apply_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if "git apply" in cmd] + assert apply_cwds and all(cwd == "/workspace/repo" for cwd in apply_cwds) + assert _eval_cwd(provider) == "/workspace/repo" + + +def test_run_eval_prefers_explicit_non_default_row_workdir(): + """An explicit, non-default ``repo_workdir`` holding a checkout wins over the ladder.""" + task = _task(repo_workdir="/srv/project") + _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/srv/project") + assert _eval_cwd(provider) == "/srv/project" + + +def test_run_eval_defaults_to_testbed_when_present(): + """When /testbed holds the checkout it wins (first ladder entry), preserving prior behavior.""" + task = _task() + _, _, provider = _run_eval(task, test_output=_fixture("pytest_junit.xml"), git_dir="/testbed") + assert _eval_cwd(provider) == "/testbed" + + +def test_reset_repo_resolves_workdir_from_ladder(): + """reset_repo runs ``git reset --hard`` in the located checkout, not a hardcoded /testbed.""" + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = SweBenchExtHarness() + task = _task() + env = await AsyncSweEnvironment.start( + {"fake-ext": {"git_dir": "/app"}}, + harness.build_spec(task), + ) + await harness.reset_repo(env, task) + return env.sandbox._provider + + provider = asyncio.run(run()) + reset_cwds = [cwd for cmd, cwd in zip(provider.commands, provider.exec_cwds) if cmd.startswith("git reset --hard")] + assert reset_cwds == ["/app"] diff --git a/resources_servers/swe_bench/tests/test_swe_env.py b/resources_servers/swe_bench/tests/test_swe_env.py new file mode 100644 index 0000000000..e6be92588a --- /dev/null +++ b/resources_servers/swe_bench/tests/test_swe_env.py @@ -0,0 +1,414 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the swe_env library, driven by a FakeSandbox provider.""" + +from __future__ import annotations + +import ast +import asyncio +from pathlib import Path + +import resources_servers.swe_bench.harnesses # noqa: F401 (registers harnesses) +from nemo_gym.sandbox import ( + SandboxCreateError, + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench import ( + compute_resolved, + get_harness, + list_harnesses, + reward_from_report, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweEvalReport, SweTask +from resources_servers.swe_bench.harnesses.swe_bench_ext import SweBenchExtHarness +from resources_servers.swe_bench.verify_task import ProviderCapabilityError, verify_task + + +# Trailing-status pytest text (`` PASSED``) is the format the test +# parser recognizes; node ids carry a ``.py`` path so they normalize to the +# F2P/P2P ids below. +_PASS_OUTPUT = "tests/test_x.py::a PASSED\ntests/test_x.py::b PASSED\n" +_F2P_FAIL_OUTPUT = "tests/test_x.py::a FAILED\ntests/test_x.py::b PASSED\n" + + +class _FakeProvider: + """Scripted provider: pytest commands return a canned transcript.""" + + name = "fake-swe" + + def __init__(self, *, test_output="", test_rc=0, apply_rc=0, create_error=False, sink=None, **_): + """Configure the scripted provider's responses. + + Args: + test_output: Stdout returned for pytest commands. + test_rc: Return code returned for pytest commands. + apply_rc: Return code returned for ``git apply`` commands. + create_error: When True, ``create`` raises a SandboxCreateError. + sink: Optional list each created spec is appended to, for asserting on what + ``verify_task`` passed the provider (e.g. the stamped ``ttl_s``). + **_: Ignored extra keyword arguments. + """ + self._test_output = test_output + self._test_rc = test_rc + self._apply_rc = apply_rc + self._create_error = create_error + self._sink = sink + + async def create(self, spec): + if self._sink is not None: + self._sink.append(spec) + if self._create_error: + raise SandboxCreateError("simulated create failure") + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + if "pytest" in command: + return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc) + if "git apply" in command: + return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, *a, **k): + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-swe", _FakeProvider, override=True) + + +def _task(**overrides) -> SweTask: + """Build a SweTask with sensible defaults, overridable per keyword. + + Args: + **overrides: Field overrides merged onto the default task fields. + + Returns: + A SweTask configured for the swe-bench-ext benchmark. + """ + base = dict( + instance_id="inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + test_command="python -m pytest -rA -q", + model_patch="diff --git a/x b/x\n", + test_framework="pytest", + fail_to_pass=["tests/test_x.py::a"], + pass_to_pass=["tests/test_x.py::b"], + benchmark="swe-bench-ext", + ) + base.update(overrides) + return SweTask(**base) + + +# ---- pure helpers ----------------------------------------------------------- + + +def test_compute_resolved(): + """``compute_resolved`` is True only when all required tests are in the passed set.""" + assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a", "b"]) is True + assert compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=["a"]) is False + assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=["a"]) is False + + +def test_compute_resolved_fail_only(): + """The ``fail_only`` eval type mirrors swebench's ``check_fail_only``. + + A required test is success UNLESS it is present in the status map AND ==FAILED, so an + absent test (silent success) still resolves; a present-and-FAILED test does not. + """ + # Required test absent from the status map -> success (silent) -> resolved. + assert ( + compute_resolved(fail_to_pass=["a"], pass_to_pass=["b"], passed=[], eval_type="fail_only", status_map={}) + is True + ) + # A present-and-FAILED required test -> failure -> unresolved. + assert ( + compute_resolved( + fail_to_pass=["a"], + pass_to_pass=["b"], + passed=["b"], + eval_type="fail_only", + status_map={"a": "FAILED", "b": "PASSED"}, + ) + is False + ) + # Present but not FAILED (e.g. SKIPPED/ERROR) -> success under fail_only -> resolved. + assert ( + compute_resolved( + fail_to_pass=["a"], + pass_to_pass=["b"], + passed=[], + eval_type="fail_only", + status_map={"a": "SKIPPED", "b": "ERROR"}, + ) + is True + ) + # Empty required set is still unresolved under fail_only (the validated edge). + assert compute_resolved(fail_to_pass=[], pass_to_pass=[], passed=[], eval_type="fail_only") is False + + +def test_compute_resolved_pass_and_fail_status_map(): + """The default ``pass_and_fail`` rule with a populated status_map mirrors swebench. + + This is the path that runs for SWE-bench Verified: a required test is a failure only when it + is absent or its status is FAILED/ERROR; PASSED/XFAIL pass and any other status (SKIPPED/XPASS) + is neutral (excluded, not a failure). Locking it in guards the swebench-equivalence this PR + depends on. + """ + f2p, p2p = ["a"], ["b"] + # All required tests PASSED -> resolved. + assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED", "b": "PASSED"}) + # A required test FAILED -> unresolved. + assert not compute_resolved( + fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "FAILED", "b": "PASSED"} + ) + # A required test ERROR -> unresolved. + assert not compute_resolved( + fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "ERROR", "b": "PASSED"} + ) + # A required test absent from the status_map -> unresolved. + assert not compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "PASSED"}) + # XFAIL passes; SKIPPED/XPASS are neutral (not failures) -> resolved. + assert compute_resolved(fail_to_pass=f2p, pass_to_pass=p2p, passed=[], status_map={"a": "XFAIL", "b": "SKIPPED"}) + + +def test_agent_adapters_do_not_call_grading_methods(): + """Agent-facing swe_env modules never call the grader-only harness methods. + + ``harness.py`` documents a trust boundary: ``reset_repo`` / ``run_eval`` / ``grade`` are used + ONLY by the grader (``verify_task``). This AST guard enforces it — the agent adapters + (``self_drive``, ``sandbox``) must reach grading through ``verify_task``, never by calling + those methods directly — so the boundary the docstring promises cannot silently regress. + """ + grading_only = {"reset_repo", "run_eval", "grade"} + adapter_dir = Path(__file__).resolve().parent.parent + for module in ("self_drive.py", "sandbox.py"): + tree = ast.parse((adapter_dir / module).read_text()) + referenced = sorted( + node.attr for node in ast.walk(tree) if isinstance(node, ast.Attribute) and node.attr in grading_only + ) + assert not referenced, f"{module} calls grader-only methods {referenced}; route grading via verify_task" + + +def test_reward_from_report(): + """``reward_from_report`` is 1.0 for a resolved report and 0.0 otherwise or when masked.""" + assert reward_from_report(SweEvalReport(instance_id="i", resolved=True)) == 1.0 + assert reward_from_report(SweEvalReport(instance_id="i", resolved=False)) == 0.0 + assert reward_from_report(SweEvalReport(instance_id="i", resolved=True, error_kind="sandbox")) == 0.0 + + +def test_registry_and_build_spec(): + """The swe-bench-ext harness is registered and builds the expected sandbox spec.""" + assert "swe-bench-ext" in list_harnesses() + harness = get_harness("swe-bench-ext") + assert isinstance(harness, SweBenchExtHarness) + spec = harness.build_spec(_task()) + assert spec.image == "img:tag" + assert spec.workdir == "/testbed" + assert spec.metadata["instance_id"] == "inst-1" + + +def test_grade_masks_on_infra_error(): + """Grading masks an infra error to reward 0.0 and records its error kind.""" + harness = get_harness("swe-bench-ext") + report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"})) + assert report.error_kind == "timeout" + assert reward_from_report(report) == 0.0 + + +# ---- verify_task orchestrator (fresh-sandbox, FakeProvider) ----------------- + + +def test_verify_task_resolved(): + """``verify_task`` resolves a task whose required tests all pass.""" + provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}} + report = asyncio.run(verify_task(provider, _task())) + assert report.resolved is True + assert report.patch_applied is True + assert reward_from_report(report) == 1.0 + + +def test_verify_task_unresolved(): + """``verify_task`` leaves a task unresolved when a required test fails.""" + provider = {"fake-swe": {"test_output": _F2P_FAIL_OUTPUT, "test_rc": 1}} + report = asyncio.run(verify_task(provider, _task())) + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_verify_task_empty_patch_fast_path(): + """An empty model patch short-circuits to an unresolved report.""" + report = asyncio.run(verify_task({"fake-swe": {}}, _task(model_patch=""))) + assert report.patch_exists is False + assert report.resolved is False + + +def test_verify_task_non_timeout_eval_failure_unmasked(): + """A non-timeout eval-stage failure is unmasked: resolved=False, reward 0.0. + + Mirrors main's app.py, which catches any eval exception, returns no report file + (resolved=False) and leaves eval_timed_out False (so mask_sample stays False). + Only a genuine wall-clock eval timeout is masked. + """ + report = asyncio.run(verify_task({"fake-swe": {"create_error": True}}, _task())) + assert report.error_kind is None + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_verify_task_golden(): + """Running with ``run_golden`` applies the golden patch and resolves the task.""" + provider = {"fake-swe": {"test_output": _PASS_OUTPUT}} + task = _task(model_patch="", metadata={"golden_patch": "diff --git a/x b/x\n"}) + report = asyncio.run(verify_task(provider, task, run_golden=True)) + assert report.resolved is True + + +def test_verify_task_patch_apply_failure_does_not_gate_resolved(): + """A failed patch apply is recorded but does not gate ``resolved``. + + The patch is applied best-effort and grading is based on the tests only, so a + failed apply (patch_applied=False) does not flip a tests-passing run to + unresolved. + """ + provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "apply_rc": 1}} + report = asyncio.run(verify_task(provider, _task())) + assert report.patch_applied is False + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_unsupported_provider_raises(): + """``verify_task`` raises when the harness does not support the given provider.""" + + class _NestedOnly(SweBenchExtHarness): + name = "nested-only-test" + + def supports_provider(self, provider_name: str) -> bool: + """Report support for every provider except ``fake-swe``. + + Args: + provider_name: The provider name being checked. + + Returns: + True for any provider other than ``fake-swe``. + """ + return provider_name != "fake-swe" + + from resources_servers.swe_bench.harness import register_harness + + register_harness(_NestedOnly(), override=True) + task = _task(benchmark="nested-only-test") + try: + asyncio.run(verify_task({"fake-swe": {}}, task)) + except ProviderCapabilityError: + return + raise AssertionError("expected ProviderCapabilityError") + + +def test_verify_task_propagates_grader_dependency_error(): + """``verify_task`` propagates ``GraderDependencyError`` instead of swallowing it to reward-0. + + A missing grading dependency (e.g. swebench for a SWE-bench instance) must fail loud rather + than silently degrade the resolve rate, so it is re-raised, not caught by the unmasked + eval-stage handler. + """ + from resources_servers.swe_bench.harness import GraderDependencyError, register_harness + + class _MissingGrader(SweBenchExtHarness): + name = "missing-grader-test" + + def grade(self, task, artifacts): + """Simulate a harness whose required grading dependency is unavailable. + + Args: + task: The task being graded. + artifacts: The eval artifacts (unused). + + Raises: + GraderDependencyError: Always, to exercise the propagation path. + """ + raise GraderDependencyError("grading dependency missing") + + register_harness(_MissingGrader(), override=True) + try: + asyncio.run(verify_task({"fake-swe": {"test_output": _PASS_OUTPUT}}, _task(benchmark="missing-grader-test"))) + except GraderDependencyError: + return + raise AssertionError("expected GraderDependencyError to propagate") + + +def test_verify_task_flat_eval_metadata(): + """``metadata['flat_eval']`` routes grading through the harness's flat variant.""" + provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "test_rc": 0}} + report = asyncio.run(verify_task(provider, _task(metadata={"flat_eval": True}))) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_verify_task_stamps_ttl_when_unset(): + """``verify_task`` stamps ``ttl_s = eval_timeout_s + slack`` when the harness leaves it unset. + + The stamp lets TTL-honoring backends (opensandbox) self-expire an eval sandbox orphaned by a + hard crash; harnesses that already set ``ttl_s`` (e.g. swe-bench-ext) keep their own value. + """ + import dataclasses + + from resources_servers.swe_bench.harness import register_harness + from resources_servers.swe_bench.verify_task import _TTL_SLACK_S + + class _NoTtl(SweBenchExtHarness): + name = "no-ttl-test" + + def build_spec(self, task): + """Build the swe-bench-ext spec but clear ``ttl_s`` so verify_task must stamp it. + + Args: + task: The task to build a spec for. + + Returns: + The base spec with ``ttl_s`` reset to None. + """ + return dataclasses.replace(super().build_spec(task), ttl_s=None) + + register_harness(_NoTtl(), override=True) + captured: list = [] + provider = {"fake-swe": {"test_output": _PASS_OUTPUT, "sink": captured}} + asyncio.run(verify_task(provider, _task(benchmark="no-ttl-test"), eval_timeout_s=120)) + assert captured, "expected create() to be called with a stamped spec" + assert captured[-1].ttl_s == 120 + _TTL_SLACK_S + + +def test_report_to_reward_wrapper(): + """``report_to_reward`` is a thin wrapper that scores a report like ``reward_from_report``.""" + from resources_servers.swe_bench.verify_task import report_to_reward + + assert report_to_reward(SweEvalReport(instance_id="i", resolved=True)) == 1.0 + assert report_to_reward(SweEvalReport(instance_id="i", resolved=False)) == 0.0 diff --git a/resources_servers/swe_bench/tests/test_swe_rebench.py b/resources_servers/swe_bench/tests/test_swe_rebench.py new file mode 100644 index 0000000000..9d6faa1cbf --- /dev/null +++ b/resources_servers/swe_bench/tests/test_swe_rebench.py @@ -0,0 +1,483 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the swe-rebench harness (FakeSandbox provider). + +A tiny fake ``agent/log_parsers.py`` is written to a tmp dir so the real +``_load_rebench_log_parsers`` import and ``NAME_TO_PARSER`` resolution path is +exercised end to end, then the resolved / unresolved / masked grade paths are +driven. +""" + +from __future__ import annotations + +import asyncio +import textwrap +from pathlib import Path + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweTask +from resources_servers.swe_bench.harnesses.swe_rebench import ( + SweRebenchHarness, + _normalize_test_name, +) + + +class _FakeProvider: + """Scripted provider: test command returns a canned transcript.""" + + name = "fake-rebench" + + def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_): + """Initialize the scripted provider. + + Args: + test_output: Transcript returned for the test command. + test_rc: Return code for the test command. + apply_rc: Return code for ``git apply`` commands. + """ + self._test_output = test_output + self._test_rc = test_rc + self._apply_rc = apply_rc + + async def create(self, spec): + raw = {"workdir": spec.workdir, "env": spec.env} + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw=raw) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + if "git apply" in command: + return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc) + if "pytest" in command or "test" in command: + return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, *a, **k): + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-rebench", _FakeProvider, override=True) + + +class _RecordingProvider: + """Scripted provider that records every exec command, in order.""" + + name = "recording-rebench" + commands: list[str] = [] + # (command, timeout_s) for every exec, so tests can assert the eval timeout + # is threaded into the test exec. + exec_calls: list[tuple[str, object]] = [] + + def __init__(self, *, test_output="", test_rc=0, apply_rc=0, **_): + """Initialize the recording provider. + + Args: + test_output: Transcript returned for the test command. + test_rc: Return code for the test command. + apply_rc: Return code for ``git apply`` commands. + """ + self._test_output = test_output + self._test_rc = test_rc + self._apply_rc = apply_rc + + async def create(self, spec): + return SandboxHandle(sandbox_id="rec", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + type(self).commands.append(command) + type(self).exec_calls.append((command, timeout_s)) + if "git apply" in command: + return SandboxExecResult(stdout="", stderr="", return_code=self._apply_rc) + if "pytest" in command or "test" in command: + return SandboxExecResult(stdout=self._test_output, stderr="", return_code=self._test_rc) + return SandboxExecResult(stdout="", stderr="", return_code=0) + + async def upload_file(self, *a, **k): + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("recording-rebench", _RecordingProvider, override=True) + + +# A standalone log_parsers module the harness imports dynamically. The parser +# splits " " lines into {node: STATUS} and exposes a +# NAME_TO_PARSER registry of callables, matching the shape the harness expects. +_FAKE_LOG_PARSERS = textwrap.dedent( + """ + def parse_simple(log): + results = {} + for line in log.splitlines(): + line = line.strip() + if not line: + continue + node, _, status = line.rpartition(" ") + if node and status: + results[node] = status + return results + + NAME_TO_PARSER = {"simple": parse_simple} + """ +) + + +def _write_fake_parsers(tmp_path: Path) -> Path: + """Write the fake ``agent/log_parsers.py`` module under a tmp repo dir. + + Args: + tmp_path: The pytest tmp dir to create the repo under. + + Returns: + Path: The created ``SWE-rebench-V2`` repo directory. + """ + repo_dir = tmp_path / "SWE-rebench-V2" + (repo_dir / "agent").mkdir(parents=True) + (repo_dir / "agent" / "log_parsers.py").write_text(_FAKE_LOG_PARSERS) + return repo_dir + + +def _task(**overrides) -> SweTask: + """Build a swe-rebench ``SweTask`` with sensible defaults. + + Args: + **overrides: Field values overriding the defaults. + + Returns: + SweTask: A task populated from the defaults merged with overrides. + """ + base = dict( + instance_id="rebench-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + test_command="python -m pytest -rA -q", + model_patch="diff --git a/x b/x\n", + test_patch="diff --git a/t b/t\n", + fail_to_pass=["t::a"], + pass_to_pass=["t::b"], + benchmark="swe-rebench", + ) + base.update(overrides) + return SweTask(**base) + + +# ---- pure helpers ----------------------------------------------------------- + + +def test_normalize_test_name_strips_timing(): + assert _normalize_test_name("t::a [ 12 ms ]") == "t::a" + assert _normalize_test_name("t::a [0.3s]") == "t::a" + assert _normalize_test_name("t::a in 1.2 sec") == "t::a" + assert _normalize_test_name("t::a (5 ms)") == "t::a" + assert _normalize_test_name(" t::a ") == "t::a" + # No timing suffix -> unchanged. + assert _normalize_test_name("pkg::mod::test_x") == "pkg::mod::test_x" + + +def test_build_spec_sets_java_env(): + harness = SweRebenchHarness() + spec = harness.build_spec(_task()) + assert spec.env["_JAVA_OPTIONS"] == "-Djava.net.preferIPv6Addresses=false" + assert spec.metadata["harness"] == "swe-rebench" + assert spec.image == "img:tag" + + +# ---- grade paths (real dynamic-import of the fake parser) -------------------- + + +def test_grade_resolved(tmp_path): + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}, + ) + # Both required tests pass; timing suffix on one exercises normalization. + artifacts = EvalArtifacts(test_output="t::a [ 12 ms ] PASSED\nt::b PASSED\n", patch_applied=True) + report = harness.grade(task, artifacts) + assert report.resolved is True + assert report.error_kind is None + assert set(report.tests_status["passed"]) == {"t::a", "t::b"} + + +def test_grade_unresolved_missing_pass_to_pass(tmp_path): + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}, + ) + artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b FAILED\n", patch_applied=True) + report = harness.grade(task, artifacts) + assert report.resolved is False + assert report.error_kind is None + + +def test_grade_no_patch_applied_gate(tmp_path): + """``resolved`` is the test verdict ONLY and does not gate on patch_applied. + So even when the model patch failed to apply (``patch_applied=False``), a run + where every F2P/P2P test passes scores resolved=True.""" + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}, + ) + artifacts = EvalArtifacts(test_output="t::a PASSED\nt::b PASSED\n", patch_applied=False) + report = harness.grade(task, artifacts) + assert report.resolved is True + assert report.error_kind is None + + +def test_grade_masks_missing_clone(): + harness = SweRebenchHarness() + # No rebench_repo_dir in metadata -> the clone is not provisioned. + report = harness.grade(_task(), EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True)) + assert report.error_kind == "eval_error" + assert report.resolved is False + + +def test_grade_masks_unknown_parser(tmp_path): + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "does_not_exist"}}, + ) + report = harness.grade(task, EvalArtifacts(test_output="t::a PASSED\n", patch_applied=True)) + assert report.error_kind == "eval_error" + + +def test_grade_masks_on_infra_error(): + harness = SweRebenchHarness() + report = harness.grade(_task(), EvalArtifacts(test_output="", return_code=1, raw={"error_type": "timeout"})) + assert report.error_kind == "timeout" + + +# ---- run_eval (FakeSandbox) ------------------------------------------------- + + +def test_run_eval_then_grade_resolved(tmp_path): + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={ + "rebench_repo_dir": str(repo_dir), + "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"}, + }, + ) + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "test_rc": 0}} + + async def _run(): + spec = harness.build_spec(task) + env = await AsyncSweEnvironment.start(provider, spec) + try: + await harness.reset_repo(env, task) + await harness.materialize(env, task) + artifacts = await harness.run_eval(env, task) + finally: + await env.cleanup() + return artifacts + + artifacts = asyncio.run(_run()) + assert artifacts.patch_applied is True + report = harness.grade(task, artifacts) + assert report.resolved is True + + +def test_run_eval_patch_not_applied_still_grades_on_tests(tmp_path): + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}) + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + # apply_rc=1 -> model patch fails to apply -> patch_applied False, but grading + # is on the tests only (no patch_applied gate), so a run where every F2P/P2P + # test passes is still resolved=True. + provider = {"fake-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n", "apply_rc": 1}} + + async def _run(): + spec = harness.build_spec(task) + env = await AsyncSweEnvironment.start(provider, spec) + try: + await harness.run_eval(env, task) + return await harness.run_eval(env, task) + finally: + await env.cleanup() + + artifacts = asyncio.run(_run()) + assert artifacts.patch_applied is False + assert harness.grade(task, artifacts).resolved is True + + +# ---- apply order ------------------------------------------------------------ + + +def test_run_eval_applies_model_patch_before_test_patch(tmp_path): + """The model patch (/root/patch.diff) is applied BEFORE the test patch + (/root/test_patch.diff).""" + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task(metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}) + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + _RecordingProvider.commands = [] + _RecordingProvider.exec_calls = [] + provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}} + + async def _run(): + spec = harness.build_spec(task) + env = await AsyncSweEnvironment.start(provider, spec) + try: + await harness.run_eval(env, task) + finally: + await env.cleanup() + + asyncio.run(_run()) + applies = [c for c in _RecordingProvider.commands if "git apply" in c] + assert len(applies) == 2 + assert "/root/patch.diff" in applies[0], applies + assert "/root/test_patch.diff" in applies[1], applies + + +# ---- eval timeout threaded into the test exec ------------------------------- + + +def _rebench_test_exec_timeout(commands_and_timeouts): + """Return the timeout_s passed to the test exec (the one running the tests). + + The test block is the only exec that is neither a ``git apply`` nor an + install command; in these tests the test command always contains ``pytest``. + + Args: + commands_and_timeouts: An iterable of ``(command, timeout_s)`` pairs. + + Returns: + The ``timeout_s`` value recorded for the test exec. + + Raises: + AssertionError: If no test exec is found in the recorded calls. + """ + for command, timeout_s in commands_and_timeouts: + if "git apply" not in command and ("pytest" in command or "test" in command): + return timeout_s + raise AssertionError(f"no test exec found in {commands_and_timeouts!r}") + + +def test_run_eval_threads_tests_timeout_into_test_exec(tmp_path): + """The test exec receives timeout_s = task.metadata['tests_timeout'] when + present so a stuck run is bounded instead of hanging the verifier. Uses a + non-default value (600) so this distinguishes an explicit override from the + 1800 default.""" + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={ + "rebench_repo_dir": str(repo_dir), + "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"}, + "tests_timeout": 600, + }, + ) + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + _RecordingProvider.commands = [] + _RecordingProvider.exec_calls = [] + provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}} + + async def _run(): + spec = harness.build_spec(task) + env = await AsyncSweEnvironment.start(provider, spec) + try: + await harness.run_eval(env, task) + finally: + await env.cleanup() + + asyncio.run(_run()) + assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 600 + + +def test_run_eval_tests_timeout_absent_defaults_to_1800(tmp_path): + """The timeout (default 30*60) is applied to every swe-rebench run. Rows that + carry no tests_timeout (including SWE-bench-Verified) still get the 1800s + bound rather than an unbounded (None) run.""" + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + metadata={ + "rebench_repo_dir": str(repo_dir), + "install_config": {"log_parser": "simple", "test_cmd": "python -m pytest -rA -q"}, + }, + ) + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + _RecordingProvider.commands = [] + _RecordingProvider.exec_calls = [] + provider = {"recording-rebench": {"test_output": "t::a PASSED\nt::b PASSED\n"}} + + async def _run(): + spec = harness.build_spec(task) + env = await AsyncSweEnvironment.start(provider, spec) + try: + await harness.run_eval(env, task) + finally: + await env.cleanup() + + asyncio.run(_run()) + assert _rebench_test_exec_timeout(_RecordingProvider.exec_calls) == 1800 + + +# ---- grading parity / empty-required ---------------------------------------- + + +def test_grade_empty_required_resolves_true(tmp_path): + """``resolved`` is purely (fail_to_pass_set <= passed) and + (pass_to_pass_set <= passed). With no required tests, both empty sets are + subsets of any passed set, so resolved=True — there is no bool(required) + requirement.""" + repo_dir = _write_fake_parsers(tmp_path) + harness = SweRebenchHarness() + task = _task( + fail_to_pass=[], + pass_to_pass=[], + metadata={"rebench_repo_dir": str(repo_dir), "install_config": {"log_parser": "simple"}}, + ) + artifacts = EvalArtifacts(test_output="something PASSED\n", patch_applied=True) + report = harness.grade(task, artifacts) + assert report.resolved is True + assert report.error_kind is None diff --git a/resources_servers/swe_bench/tests/test_swebench.py b/resources_servers/swe_bench/tests/test_swebench.py new file mode 100644 index 0000000000..282049630a --- /dev/null +++ b/resources_servers/swe_bench/tests/test_swebench.py @@ -0,0 +1,234 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the swe-bench / swe-bench-multilingual flat (host-graded) harness. + +The harness runs the instance's eval script in the sandbox and grades the produced log +host-side (swebench's per-repo parser, falling back to the generic flat parser), so it runs on +any exec-capable provider. These tests validate provisioning (``build_spec`` / ``materialize``), +the flat ``run_eval`` + ``grade`` path, and family validation, against a scripted ``_FakeProvider``. +""" + +from __future__ import annotations + +import asyncio + +import pytest + +from nemo_gym.sandbox import ( + SandboxExecResult, + SandboxHandle, + SandboxStatus, + register_provider, +) +from resources_servers.swe_bench.harness import EvalArtifacts, SweTask, reward_from_report +from resources_servers.swe_bench.harnesses.swebench import SweBenchHarness + + +# Canned eval-script log with the SWE-bench sentinels + pytest-style passing lines. +_PASSING_LOG = ">>>>> Start Test Output\nPASSED t::a\nPASSED t::b\n>>>>> End Test Output\n" + + +class _FakeProvider: + """Scripted provider: returns a canned eval log for the eval-script run; records uploads. + + Args: + log_text: Text returned by the eval-script (``bash``) and ``cat`` commands. + exec_rc: Return code for the eval-script command. + """ + + name = "fake-swebench" + + def __init__(self, *, log_text="", exec_rc=0, **_): + self._log_text = log_text + self._exec_rc = exec_rc + self.uploaded: dict[str, str] = {} + self.commands: list[str] = [] + + async def create(self, spec): + return SandboxHandle(sandbox_id="fake", provider_name=self.name, raw={"workdir": spec.workdir}) + + async def exec(self, handle, command, *, cwd=None, env=None, timeout_s=None, user=None): + self.commands.append(command) + rc = 0 if command.startswith("cat ") else self._exec_rc + return SandboxExecResult(stdout=self._log_text, stderr="", return_code=rc) + + async def upload_file(self, handle, local_path, remote_path): + try: + with open(local_path, encoding="utf-8") as fh: + self.uploaded[remote_path] = fh.read() + except OSError: + self.uploaded[remote_path] = "" + return None + + async def download_file(self, *a, **k): + return None + + async def status(self, handle): + return SandboxStatus.RUNNING + + async def close(self, handle): + return None + + async def aclose(self): + return None + + +register_provider("fake-swebench", _FakeProvider, override=True) + + +def _task(**overrides) -> SweTask: + """Build a swe-bench ``SweTask`` with sensible defaults.""" + base = dict( + instance_id="repo__inst-1", + image="img:tag", + base_commit="abc123", + repo_workdir="/testbed", + model_patch="diff --git a/x b/x\n", + fail_to_pass=["t::a"], + pass_to_pass=["t::b"], + benchmark="swe-bench", + split="test", + ) + base.update(overrides) + return SweTask(**base) + + +def test_grade_strategy_is_flat(): + assert SweBenchHarness("swe-bench").grade_strategy == "flat-host-grade" + assert SweBenchHarness("swe-bench-multilingual").grade_strategy == "flat-host-grade" + + +def test_unknown_family_rejected(): + with pytest.raises(ValueError): + SweBenchHarness("not-a-family") + + +def test_build_spec_image_workdir_metadata(): + spec = SweBenchHarness("swe-bench").build_spec(_task()) + assert spec.image == "img:tag" + assert spec.workdir == "/testbed" + assert spec.metadata["instance_id"] == "repo__inst-1" + assert spec.metadata["harness"] == "swe-bench" + + +def test_build_spec_preserves_task_provider_options(): + spec = SweBenchHarness("swe-bench").build_spec(_task(metadata={"provider_options": {"network": "host"}})) + assert spec.provider_options.get("network") == "host" + + +def test_supports_provider_any_exec_capable(): + harness = SweBenchHarness("swe-bench") + assert harness.supports_provider("docker") is True + assert harness.supports_provider("apptainer") is True + assert harness.supports_provider("opensandbox") is True + + +def test_with_flat_eval_is_self(): + harness = SweBenchHarness("swe-bench") + assert harness.with_flat_eval() is harness + + +def test_materialize_writes_patch_diff(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = SweBenchHarness("swe-bench") + task = _task() + env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task)) + await harness.materialize(env, task) + return env.sandbox._provider + + provider = asyncio.run(run()) + assert provider.uploaded.get("/root/patch.diff") == "diff --git a/x b/x\n" + + +def test_materialize_empty_patch_writes_nothing(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + async def run(): + harness = SweBenchHarness("swe-bench") + task = _task(model_patch="") + env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task)) + await harness.materialize(env, task) + return env.sandbox._provider + + provider = asyncio.run(run()) + assert "/root/patch.diff" not in provider.uploaded + + +def test_run_eval_then_grade_flat_resolved(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + # eval_script preset so flat_run_eval executes it; no instance_dict -> grade falls back to + # the generic flat parser over the canned passing log. + async def run(): + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"eval_script": "echo run"}) + env = await AsyncSweEnvironment.start({"fake-swebench": {"log_text": _PASSING_LOG}}, harness.build_spec(task)) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts) + + report = asyncio.run(run()) + assert report.resolved is True + assert reward_from_report(report) == 1.0 + + +def test_run_eval_missing_eval_script_is_unmasked_unresolved(): + from resources_servers.swe_bench.sandbox import AsyncSweEnvironment + + # No instance_dict + no preset eval_script -> _flat_eval_script returns "" -> the run tags an + # eval_error, but grading no longer masks it: per main an unbuildable/empty spec grades as a + # legitimate unmasked unresolved (reward 0), not an eval_error mask. + async def run(): + harness = SweBenchHarness("swe-bench") + task = _task() + env = await AsyncSweEnvironment.start({"fake-swebench": {}}, harness.build_spec(task)) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts) + + report = asyncio.run(run()) + assert report.error_kind is None + assert report.resolved is False + assert reward_from_report(report) == 0.0 + + +def test_grade_masks_on_infra_error(): + report = SweBenchHarness("swe-bench").grade(_task(), EvalArtifacts(raw={"error_type": "timeout"})) + assert report.error_kind == "timeout" + assert reward_from_report(report) == 0.0 + + +def test_flat_eval_script_empty_without_instance_dict(): + assert SweBenchHarness("swe-bench")._flat_eval_script(_task()) == "" + + +def test_grade_fails_loud_when_swebench_unavailable(monkeypatch): + """A SWE-bench instance whose ``swebench`` install is missing fails loud, not silent-degrade. + + Degrading to the generic pytest-only parser would mis-score non-pytest repos (e.g. django) as + unresolved, silently skewing the resolve rate. Instead grading raises ``GraderDependencyError`` + so the misconfiguration surfaces. + """ + import sys + + from resources_servers.swe_bench.harness import GraderDependencyError + + # Simulate a missing / broken swebench install for the import inside _swebench_flat_grade. + monkeypatch.setitem(sys.modules, "swebench.harness.constants", None) + harness = SweBenchHarness("swe-bench") + task = _task(metadata={"instance_dict": {"instance_id": "repo__inst-1", "repo": "x/y"}}) + artifacts = EvalArtifacts(test_output=_PASSING_LOG, return_code=0, raw={}) + with pytest.raises(GraderDependencyError): + harness.grade(task, artifacts) diff --git a/resources_servers/swe_bench/tests/test_task.py b/resources_servers/swe_bench/tests/test_task.py new file mode 100644 index 0000000000..4888746d86 --- /dev/null +++ b/resources_servers/swe_bench/tests/test_task.py @@ -0,0 +1,81 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +import json + +import pytest + +from nemo_gym.openai_utils import NeMoGymResponseCreateParamsNonStreaming +from resources_servers.swe_bench.task import ( + ENVIRONMENT_NAME, + SweTask, + TaskSubmission, + build_task, + harness_family_key, + parse_submission, + parse_task_from_request, +) + + +def _sample_row() -> dict: + inst = { + "instance_id": "astropy__astropy-12907", + "base_commit": "abc123", + "test_patch": "", + "FAIL_TO_PASS": '["tests/test_x.py::a"]', + "PASS_TO_PASS": '["tests/test_x.py::b"]', + } + return { + "instance_id": "astropy__astropy-12907", + "dataset_name": "princeton-nlp/SWE-bench_Verified", + "split": "test", + "problem_statement": "Fix the bug.", + "instance_dict": json.dumps(inst), + "responses_create_params": NeMoGymResponseCreateParamsNonStreaming( + input=[{"role": "user", "content": "Fix the bug."}], + ), + } + + +def test_harness_family_key_from_dataset_name() -> None: + assert harness_family_key("princeton-nlp/SWE-bench_Verified") == "swe-bench" + assert harness_family_key("something/R2E-Gym/foo") == "r2e-gym" + + +def test_build_task_sets_benchmark_fields() -> None: + task = build_task(_sample_row(), container_formatter="swebench/sweb.eval.x86_64.{instance_id}") + assert task.task_id == "astropy__astropy-12907" + assert task.harness_family == "swe-bench" + assert task.dataset_name == "princeton-nlp/SWE-bench_Verified" + assert task.problem_statement == "Fix the bug." + assert task.metadata["instance_dict"]["base_commit"] == "abc123" + + +def test_public_view_excludes_privileged_metadata() -> None: + task = build_task(_sample_row(), container_formatter="x.{instance_id}") + public = task.public_view() + assert public.task_id == task.task_id + assert public.environment == ENVIRONMENT_NAME + assert public.harness_family == "swe-bench" + assert not hasattr(public, "instance_dict") + + +def test_parse_task_from_request_requires_instance_id() -> None: + class Body: + responses_create_params = None + verifier_metadata = {} + + with pytest.raises(ValueError, match="instance_id"): + parse_task_from_request(Body(), container_formatter="x.{instance_id}") + + +def test_with_submission() -> None: + task = SweTask(instance_id="x", benchmark="swe-bench") + updated = task.with_submission(TaskSubmission(model_patch="diff")) + assert updated.model_patch == "diff" + + +def test_parse_submission_accepts_git_patch_alias() -> None: + assert parse_submission({"git_patch": "p"}).model_patch == "p" diff --git a/resources_servers/swe_bench/verify_task.py b/resources_servers/swe_bench/verify_task.py new file mode 100644 index 0000000000..3c08c5a3cf --- /dev/null +++ b/resources_servers/swe_bench/verify_task.py @@ -0,0 +1,183 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Verification orchestrator for the SWE environment. + +Grades an agent patch via the ``swe_bench`` resources server ``/verify`` endpoint. +Runs a fresh-only sequence via ``acquire_sandbox`` (always-teardown), bounded by a +per-call eval timeout. + +Every eval spec is stamped with a ``ttl_s`` so TTL-honoring backends (such as +opensandbox) self-expire orphaned sandboxes. +""" + +from __future__ import annotations + +import asyncio +import dataclasses +from collections.abc import Mapping +from typing import Any + +# Importing this package registers the swe_bench harnesses; the docker/apptainer +# providers are built into nemo_gym.sandbox and resolve lazily (no import needed). +import resources_servers.swe_bench.harnesses # noqa: F401 +from nemo_gym.sandbox import SandboxProvider +from resources_servers.swe_bench.harness import ( + GraderDependencyError, + SweEvalReport, + SweTask, + get_harness, + reward_from_report, +) +from resources_servers.swe_bench.sandbox import acquire_sandbox + + +#: Slack added to the eval timeout when stamping a sandbox TTL (covers spin-up + +#: teardown so a TTL-honoring backend does not expire a still-running eval). +_TTL_SLACK_S = 600.0 + + +class ProviderCapabilityError(RuntimeError): + """Raised when a task's harness does not support the configured provider.""" + + +def _provider_name(provider: Mapping[str, Any] | SandboxProvider) -> str: + """Return the provider's name. + + Args: + provider: Either a single-key provider mapping or a ``SandboxProvider`` + instance. + + Returns: + str: The provider name, or ``"?"`` if it cannot be determined. + """ + if isinstance(provider, Mapping): + return next(iter(provider), "?") + return getattr(provider, "name", "?") + + +async def verify_task( + provider: Mapping[str, Any] | SandboxProvider, + task: SweTask, + *, + run_golden: bool = False, + eval_timeout_s: float | None = None, +) -> SweEvalReport: + """Grade a task's patch in a fresh sandbox and return a report. + + Selects the harness for the task's benchmark, optionally substitutes the + golden patch, then resets the repo, materializes the patch, runs the eval, + and grades the artifacts. An empty patch short-circuits without spinning up + a sandbox. A genuine wall-clock eval timeout is returned as a report carrying + ``error_kind="eval_timeout"``; other non-timeout eval-stage failures are + returned unmasked (``resolved=False``, ``error_kind=None``) to mirror main, + rather than raised. + + Args: + provider: Single-key provider mapping or ``SandboxProvider`` selecting + the sandbox backend. + task: The task whose patch is graded. + run_golden: When True, grade the task's golden patch instead of the + model patch. + eval_timeout_s: Optional override for the per-call eval timeout in + seconds; falls back to the task metadata or a default. + + Returns: + SweEvalReport: The grading outcome, with ``error_kind="eval_timeout"`` set + only on a genuine wall-clock eval timeout; non-timeout eval-stage + failures are reported unmasked (``resolved=False``, ``error_kind=None``). + + Raises: + ProviderCapabilityError: If the task's harness does not support the provider. + GraderDependencyError: If a required grading dependency is unavailable for an + instance the harness must grade exactly (propagated, not swallowed). + """ + harness = get_harness(task.benchmark) + if task.metadata.get("flat_eval"): + # Grade host-side (flat) so nested families (swe-bench / r2e-gym) can be graded on + # exec-only providers like docker; a no-op for already-flat families. + harness = harness.with_flat_eval() + + if run_golden: + task = dataclasses.replace(task, model_patch=task.metadata.get("golden_patch", "")) + + # Empty/falsy-patch fast path: skip eval spin-up entirely. + if not (task.model_patch or "").strip(): + return SweEvalReport(instance_id=task.instance_id, patch_exists=False, resolved=False) + + provider_name = _provider_name(provider) + if not harness.supports_provider(provider_name): + raise ProviderCapabilityError( + f"Harness {harness.name!r} does not support provider {provider_name!r} " + f"(grade_strategy={harness.grade_strategy})" + ) + + spec = harness.build_spec(task) + timeout = eval_timeout_s if eval_timeout_s is not None else float(task.metadata.get("eval_timeout_s", 1800)) + # Stamp a TTL so backends that honor it (opensandbox) self-expire an eval sandbox + # orphaned by a hard crash. docker ignores ttl_s; its finally-teardown covers it. + if spec.ttl_s is None: + spec = dataclasses.replace(spec, ttl_s=timeout + _TTL_SLACK_S) + + try: + async with acquire_sandbox(provider, spec, instance_id=task.instance_id) as env: + + async def _sequence() -> SweEvalReport: + await harness.reset_repo(env, task) + await harness.materialize(env, task) + artifacts = await harness.run_eval(env, task) + return harness.grade(task, artifacts) + + return await asyncio.wait_for(_sequence(), timeout=timeout) + except GraderDependencyError: + # A required grader dependency is missing (e.g. swebench for a SWE-bench instance). + # Propagate rather than degrading to an unmasked reward-0 so the misconfiguration is + # loud (a crash in the standalone path; every sample masked in the anyswe path) instead + # of silently skewing the resolve rate. + raise + except (asyncio.TimeoutError, TimeoutError): + # Genuine wall-clock eval timeout: mask via error_kind. This mirrors main's + # app.py, which sets eval_timed_out (-> mask_sample) only when the final eval + # elapsed time reaches the configured tests timeout. + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + error_kind="eval_timeout", + tests_status={"timeout_s": timeout}, + ) + except Exception as exc: # non-timeout eval-stage failure -> unmasked reward 0 + # A non-timeout eval-stage crash is NOT masked: main's app.py catches any eval + # exception, returns no report file (resolved=False) and leaves eval_timed_out + # False, so the sample stays in the gradient at reward 0. Returning + # error_kind=None here keeps mask_sample aligned with main rather than masking + # the infra crash (which main does not do). + return SweEvalReport( + instance_id=task.instance_id, + patch_exists=bool(task.model_patch), + resolved=False, + error_kind=None, + tests_status={"exception": repr(exc)}, + ) + + +def report_to_reward(report: SweEvalReport) -> float: + """Convert an eval report into a scalar reward. + + Args: + report: The grading outcome to score. + + Returns: + float: The reward derived from the report. + """ + return reward_from_report(report) diff --git a/responses_api_agents/claude_code_agent/app.py b/responses_api_agents/claude_code_agent/app.py index 6970d92ce1..9a399a5094 100644 --- a/responses_api_agents/claude_code_agent/app.py +++ b/responses_api_agents/claude_code_agent/app.py @@ -15,9 +15,11 @@ import asyncio import copy +import dataclasses import json import logging import os +import shlex import shutil import subprocess import tempfile @@ -50,6 +52,7 @@ NeMoGymResponseOutputTokensDetails, NeMoGymResponseUsage, ) +from nemo_gym.sandbox import AsyncSandbox, SandboxResources, SandboxSpec from nemo_gym.server_utils import get_response_json, raise_for_status from nemo_gym.skills import stage_skills from responses_api_agents.claude_code_agent.setup_claude_code import ensure_claude_code @@ -237,10 +240,13 @@ class ClaudeCodeAgentConfig(BaseResponsesAPIAgentConfig): bare: bool = True mcp_config: Optional[str] = None settings: Optional[str] = None + sandbox_provider: Optional[dict[str, Any]] = None + in_box_timeout_s: int = 1800 class ClaudeCodeAgentRunRequest(BaseRunRequest): model_config = ConfigDict(extra="allow") + verifier_metadata: Optional[dict[str, Any]] = None class ClaudeCodeAgentVerifyResponse(BaseVerifyResponse): @@ -490,6 +496,131 @@ def _write_rollout_mcp_config(self, seed_response_json: dict[str, Any], output_d config_path.write_text(json.dumps(config, indent=2, sort_keys=True)) return str(config_path) + @staticmethod + def _sandbox_spec_from_descriptor(spec_dict: dict[str, Any]) -> SandboxSpec: + payload = dict(spec_dict) + resources = payload.pop("resources", None) + if resources is None: + resources = SandboxResources() + elif not isinstance(resources, SandboxResources): + resources = SandboxResources.from_mapping(resources) + return SandboxSpec(**payload, resources=resources) + + def _anthropic_env(self) -> tuple[dict[str, str], str]: + base_url = self._resolve_base_url() + model = self.config.model if base_url else self.config.model.split("/")[-1] + api_key = self.config.anthropic_api_key + env = { + "ANTHROPIC_API_KEY": api_key, # pragma: allowlist secret + "ANTHROPIC_MODEL": model, + "ANTHROPIC_DEFAULT_HAIKU_MODEL": model, + "ANTHROPIC_DEFAULT_SONNET_MODEL": model, + "ANTHROPIC_DEFAULT_OPUS_MODEL": model, + "CLAUDE_CODE_SUBAGENT_MODEL": model, + "IS_SANDBOX": "1", + } + if base_url: + env["ANTHROPIC_BASE_URL"] = base_url + env["ANTHROPIC_AUTH_TOKEN"] = api_key or "local" + return env, model + + async def _run_in_box( + self, + body: ClaudeCodeAgentRunRequest, + seed_resp_json: dict[str, Any], + *, + skills_path: Optional[str] = None, + ) -> tuple[dict[str, Any], str]: + spec_dict = (seed_resp_json.get("sandbox") or {}).get("spec") or {} + workdir = spec_dict.get("workdir") or "/testbed" + spec = self._sandbox_spec_from_descriptor(spec_dict) + egress_env = (seed_resp_json.get("egress") or {}).get("env") or {} + anthropic_env, model = self._anthropic_env() + spec = dataclasses.replace(spec, env={**spec.env, **egress_env, **anthropic_env}) + + provider = self.config.sandbox_provider or {"docker": {}} + sandbox = AsyncSandbox(provider, spec) + await sandbox.start() + claude_config_dir: Path | None = None + try: + claude_config_dir = self._setup_config_dir(skills_path=skills_path) + remote_cfg = "/tmp/nemo_gym_claude" + await sandbox.exec(f"mkdir -p {shlex.quote(remote_cfg)}", cwd=workdir, timeout_s=60) + await sandbox.upload(str(claude_config_dir / "settings.json"), f"{remote_cfg}/settings.json") + + params = body.responses_create_params.model_copy(deep=True) + if isinstance(params.input, str): + params.input = [NeMoGymEasyInputMessage(role="user", content=params.input)] + user_message, input_system = _extract_instruction(params.input) + system_parts = [p for p in [self.config.system_prompt, input_system] if p] + system_prompt = "\n\n".join(system_parts) if system_parts else None + + cmd_parts = self._build_command( + model, + user_message, + system_prompt=system_prompt, + skills_active=bool(skills_path), + ) + env_prefix = " ".join(f"{shlex.quote(k)}={shlex.quote(v)}" for k, v in spec.env.items()) + remote_cmd = f"{env_prefix} CLAUDE_CONFIG_DIR={shlex.quote(remote_cfg)} {shlex.join(cmd_parts)}" + result = await sandbox.exec(remote_cmd, cwd=workdir, timeout_s=self.config.in_box_timeout_s) + stdout = result.stdout or "" + if result.error_type == "timeout": + LOG.warning("claude-code in-box timed out after %ss", self.config.in_box_timeout_s) + elif result.return_code not in (0, None) and stdout.strip() == "": + LOG.warning( + "claude-code in-box exited %s: %s", + result.return_code, + (result.stderr or "")[:500], + ) + + output_items, usage = parse_stream_json(stdout) + if not any( + getattr(item, "type", None) == "message" and getattr(item, "role", None) == "assistant" + for item in output_items + ): + output_items.append( + NeMoGymResponseOutputMessage( + id=f"msg_{uuid4().hex}", + content=[NeMoGymResponseOutputText(text="", annotations=[])], + role="assistant", + status="completed", + type="message", + ) + ) + + input_tokens = usage.get("input_tokens", 0) + output_tokens = usage.get("output_tokens", 0) + agent_resp = NeMoGymResponse( + id=f"resp_{uuid4().hex}", + created_at=int(time()), + model=model, + object="response", + output=output_items, + tool_choice=params.tool_choice, + tools=params.tools, + parallel_tool_calls=params.parallel_tool_calls, + usage=NeMoGymResponseUsage( + input_tokens=input_tokens, + input_tokens_details=NeMoGymResponseInputTokensDetails(cached_tokens=0), + output_tokens=output_tokens, + output_tokens_details=NeMoGymResponseOutputTokensDetails(reasoning_tokens=0), + total_tokens=input_tokens + output_tokens, + ), + ) + + patch_result = await sandbox.exec( + f"cd {shlex.quote(workdir)} && git add -A && git diff --cached", + cwd=workdir, + timeout_s=120, + ) + patch = patch_result.stdout or "" + return agent_resp.model_dump(mode="json"), patch + finally: + if claude_config_dir is not None: + shutil.rmtree(claude_config_dir, ignore_errors=True) + await sandbox.stop() + async def _create_response( self, body: NeMoGymResponseCreateParamsNonStreaming, @@ -569,23 +700,32 @@ async def run(self, request: Request, body: ClaudeCodeAgentRunRequest) -> Claude cookies = seed_resp.cookies seed_resp_json = await get_response_json(seed_resp) - # The run-level skills_ref (stamped by rollout collection) rides on the request body - # (extra="allow"). Pass its path straight into _create_response so the CLI invocation - # can stage the skills into its per-request CLAUDE_CONFIG_DIR. run() calls _create_response - # in-process, so no metadata side-channel is needed (unlike the schema-forbidden HTTP path). skills_path = ((body.model_extra or {}).get(SKILLS_REF_KEY_NAME) or {}).get("path") - - with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir: - mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir)) - agent_resp = await self._create_response( - body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path - ) - agent_resp_json = agent_resp.model_dump(mode="json") + topology = (seed_resp_json.get("placement") or {}).get("topology") or "none" + + if topology == "agent_in_env": + agent_resp_json, model_patch = await self._run_in_box(body, seed_resp_json, skills_path=skills_path) + verifier_metadata = { + **(body.verifier_metadata or {}), + **(seed_resp_json.get("verifier_metadata") or {}), + "model_patch": model_patch, + } + else: + with tempfile.TemporaryDirectory(prefix="nemo_gym_claude_mcp_") as mcp_config_dir: + mcp_config = self._write_rollout_mcp_config(seed_resp_json, Path(mcp_config_dir)) + agent_resp = await self._create_response( + body.responses_create_params, mcp_config=mcp_config, skills_path=skills_path + ) + agent_resp_json = agent_resp.model_dump(mode="json") + verifier_metadata = { + **(body.verifier_metadata or {}), + **(seed_resp_json.get("verifier_metadata") or {}), + } verify_resp = await self.server_client.post( server_name=self.config.resources_server.name, url_path="/verify", - json=body.model_dump() | {"response": agent_resp_json}, + json=body.model_dump() | {"response": agent_resp_json, "verifier_metadata": verifier_metadata}, cookies=cookies, ) await raise_for_status(verify_resp) diff --git a/tests/unit_tests/test_docker_provider.py b/tests/unit_tests/test_docker_provider.py new file mode 100644 index 0000000000..46062a60d6 --- /dev/null +++ b/tests/unit_tests/test_docker_provider.py @@ -0,0 +1,213 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the local Docker ``SandboxProvider`` (CLI mocked, no docker required).""" + +import asyncio +from pathlib import Path +from typing import Any, Callable + +import pytest + +from nemo_gym.sandbox.providers.base import ( + SandboxCreateError, + SandboxResources, + SandboxSpec, + SandboxStatus, +) +from nemo_gym.sandbox.providers.docker.provider import DockerSandboxProvider + + +class RunRecorder: + """Stand-in for ``DockerSandboxProvider._run`` that records argv and returns canned output. + + The responder maps the captured ``docker`` args to a ``(rc, stdout, stderr)`` tuple, and may + raise (e.g. ``TimeoutError``) to simulate a CLI failure. + """ + + def __init__(self, responder: Callable[[list[str]], tuple[int, str, str]]) -> None: + self.calls: list[dict[str, Any]] = [] + self._responder = responder + + async def __call__(self, *args: str, timeout_s: float | None = None) -> tuple[int, str, str]: + self.calls.append({"args": list(args), "timeout_s": timeout_s}) + return self._responder(list(args)) + + +def _make_provider( + monkeypatch: pytest.MonkeyPatch, responder: Callable[[list[str]], tuple[int, str, str]], **kwargs: Any +) -> tuple[DockerSandboxProvider, RunRecorder]: + provider = DockerSandboxProvider(**kwargs) + rec = RunRecorder(responder) + monkeypatch.setattr(provider, "_run", rec) + return provider, rec + + +def _ran(rec: RunRecorder, *prefix: str) -> bool: + """True if any recorded call's args start with ``prefix`` (e.g. ``"rm", "-f"``).""" + return any(call["args"][: len(prefix)] == list(prefix) for call in rec.calls) + + +# --------------------------------------------------------------------------- # +# Construction +# --------------------------------------------------------------------------- # +def test_concurrency_must_be_positive() -> None: + """A non-positive concurrency is rejected up front.""" + with pytest.raises(ValueError): + DockerSandboxProvider(concurrency=0) + + +def test_concurrency_bounds_the_semaphore() -> None: + """The provider's shared semaphore is sized to the configured concurrency.""" + assert DockerSandboxProvider(concurrency=4)._semaphore._value == 4 + + +# --------------------------------------------------------------------------- # +# create() +# --------------------------------------------------------------------------- # +def test_create_returns_handle_with_last_line_id(monkeypatch: pytest.MonkeyPatch) -> None: + """create() uses the LAST stdout line as the container id and pre-assigns a unique name.""" + provider, rec = _make_provider( + monkeypatch, lambda args: (0, "WARNING: noise\ncontainer-abc\n", ""), network="host" + ) + handle = asyncio.run(provider.create(SandboxSpec(image="img:tag", workdir="/testbed", env={"A": "1"}))) + assert handle.sandbox_id == "container-abc" + run_args = rec.calls[0]["args"] + assert run_args[:3] == ["run", "-d", "--init"] + assert "--name" in run_args and run_args[run_args.index("--name") + 1].startswith("nemo-gym-") + assert ["--network", "host"] == run_args[run_args.index("--network") : run_args.index("--network") + 2] + assert "img:tag" in run_args + + +def test_create_requires_image() -> None: + """A spec without an image is rejected before any docker call.""" + with pytest.raises(SandboxCreateError): + asyncio.run(DockerSandboxProvider().create(SandboxSpec(image=None))) + + +def test_create_empty_stdout_guard_and_reap(monkeypatch: pytest.MonkeyPatch) -> None: + """rc 0 with empty stdout raises (no IndexError) and reaps the pre-assigned name.""" + provider, rec = _make_provider(monkeypatch, lambda args: (0, " \n", "") if args[0] == "run" else (0, "", "")) + with pytest.raises(SandboxCreateError, match="did not return a container id"): + asyncio.run(provider.create(SandboxSpec(image="img:tag"))) + assert _ran(rec, "rm", "-f") + + +def test_create_nonzero_rc_reaps(monkeypatch: pytest.MonkeyPatch) -> None: + """A non-zero ``docker run`` reaps the orphan and raises with the stderr.""" + provider, rec = _make_provider(monkeypatch, lambda args: (125, "", "boom") if args[0] == "run" else (0, "", "")) + with pytest.raises(SandboxCreateError, match="boom"): + asyncio.run(provider.create(SandboxSpec(image="img:tag"))) + assert _ran(rec, "rm", "-f") + + +def test_create_timeout_reaps(monkeypatch: pytest.MonkeyPatch) -> None: + """A timed-out ``docker run`` reaps the (possibly daemon-started) orphan by name.""" + + def responder(args: list[str]) -> tuple[int, str, str]: + if args[0] == "run": + raise asyncio.TimeoutError + return (0, "", "") + + provider, rec = _make_provider(monkeypatch, responder) + with pytest.raises(SandboxCreateError, match="timed out"): + asyncio.run(provider.create(SandboxSpec(image="img:tag"))) + assert _ran(rec, "rm", "-f") + + +def test_create_applies_resource_limits(monkeypatch: pytest.MonkeyPatch) -> None: + """Resource requests become ``--memory``/``--cpus``/``--gpus`` run args.""" + provider, rec = _make_provider(monkeypatch, lambda args: (0, "cid\n", "")) + spec = SandboxSpec(image="img:tag", resources=SandboxResources(cpu=2, memory_mib=512, gpu=1)) + asyncio.run(provider.create(spec)) + run_args = rec.calls[0]["args"] + assert "--memory=512m" in run_args + assert "--cpus=2" in run_args + assert "--gpus=all" in run_args + + +# --------------------------------------------------------------------------- # +# exec() +# --------------------------------------------------------------------------- # +def test_exec_classifies_docker_level_failure(monkeypatch: pytest.MonkeyPatch) -> None: + """rc 125/126/127 with no stdout is a docker-level (``sandbox``) failure.""" + provider, _ = _make_provider(monkeypatch, lambda args: (125, "", "no such container")) + res = asyncio.run(provider.exec(_handle(), "echo hi")) + assert res.return_code == 125 + assert res.error_type == "sandbox" + + +def test_exec_success_has_no_error_type(monkeypatch: pytest.MonkeyPatch) -> None: + """A successful exec carries stdout and no error type.""" + provider, _ = _make_provider(monkeypatch, lambda args: (0, "ok", "")) + res = asyncio.run(provider.exec(_handle(), "true")) + assert res.return_code == 0 and res.stdout == "ok" and res.error_type is None + + +def test_exec_timeout_returns_124(monkeypatch: pytest.MonkeyPatch) -> None: + """A timed-out exec returns rc 124 + ``timeout`` error type rather than raising.""" + + def responder(args: list[str]) -> tuple[int, str, str]: + raise asyncio.TimeoutError + + provider, _ = _make_provider(monkeypatch, responder) + res = asyncio.run(provider.exec(_handle(), "sleep 1", timeout_s=0.01)) + assert res.return_code == 124 and res.error_type == "timeout" + + +# --------------------------------------------------------------------------- # +# status / close / file transfer +# --------------------------------------------------------------------------- # +def test_status_running_and_stopped(monkeypatch: pytest.MonkeyPatch) -> None: + """status() maps docker inspect output to RUNNING/STOPPED/UNKNOWN.""" + provider, _ = _make_provider(monkeypatch, lambda args: (0, "true\n", "")) + assert asyncio.run(provider.status(_handle())) is SandboxStatus.RUNNING + provider2, _ = _make_provider(monkeypatch, lambda args: (0, "false\n", "")) + assert asyncio.run(provider2.status(_handle())) is SandboxStatus.STOPPED + provider3, _ = _make_provider(monkeypatch, lambda args: (1, "", "gone")) + assert asyncio.run(provider3.status(_handle())) is SandboxStatus.UNKNOWN + + +def test_close_force_removes(monkeypatch: pytest.MonkeyPatch) -> None: + """close() force-removes the container by id.""" + provider, rec = _make_provider(monkeypatch, lambda args: (0, "", "")) + asyncio.run(provider.close(_handle())) + assert _ran(rec, "rm", "-f", "cid") + + +def test_upload_failure_raises(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> None: + """A failed ``docker cp`` upload raises a clear RuntimeError.""" + provider, _ = _make_provider(monkeypatch, lambda args: (0, "", "") if args[0] == "exec" else (1, "", "nope")) + src = tmp_path / "f.txt" + src.write_text("x") + with pytest.raises(RuntimeError, match="upload failed"): + asyncio.run(provider.upload_file(_handle(), src, "/dst/f.txt")) + + +def test_reap_orphan_swallows_errors(monkeypatch: pytest.MonkeyPatch) -> None: + """_reap_orphan never raises, even when the ``docker rm`` itself fails/raises.""" + + def responder(args: list[str]) -> tuple[int, str, str]: + raise RuntimeError("docker daemon down") + + provider, _ = _make_provider(monkeypatch, responder) + asyncio.run(provider._reap_orphan("nemo-gym-x")) # must not raise + + +def _handle(): + """A minimal docker SandboxHandle for exec/status/close tests.""" + from nemo_gym.sandbox.providers.base import SandboxHandle + + return SandboxHandle(sandbox_id="cid", provider_name="docker", raw={"workdir": "/testbed"})