Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -237,3 +237,6 @@ env.yaml

# Backup files
*.backup

# Claude Code agent runtime artifacts
**/.claude_node/
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,6 @@ The Dataset column links to publicly available datasets (e.g., on HuggingFace).
| Code Gen | coding | Model must submit the right code to solve a problem | Improve competitive coding capabilities | ✓ | ✓ | Apache 2.0 | <a href='resources_servers/code_gen/configs/code_gen.yaml'>code_gen.yaml</a> | <a href='https://huggingface.co/datasets/nvidia/nemotron-RL-coding-competitive_coding'>nemotron-RL-coding-competitive_coding</a> |
| Competitive Coding Challenges | coding | Execution of competitive programming competition questions | Improve competitive coding capabilities on contest-style problems | - | - | - | <a href='resources_servers/competitive_coding_challenges/configs/competitive_coding_challenges.yaml'>competitive_coding_challenges.yaml</a> | - |
| Critpt | other | Research-level physics problems scored by the Artificial Analysis API | Evaluate model performance on research-level physics reasoning | - | - | - | <a href='resources_servers/critpt/configs/critpt.yaml'>critpt.yaml</a> | - |
| Cvdp | coding | CVDP benchmark dataset for code generation | Evaluate RTL code generation capabilities | - | ✓ | - | <a href='resources_servers/cvdp/configs/cvdp.yaml'>cvdp.yaml</a> | - |
| Equivalence Llm Judge | agent | Short bash command generation questions with LLM-as-a-judge | Improve foundational bash and IF capabilities | ✓ | ✓ | GNU General Public License v3.0 | <a href='resources_servers/equivalence_llm_judge/configs/nl2bash-equivalency.yaml'>nl2bash-equivalency.yaml</a> | - |
| Equivalence Llm Judge | knowledge | Short answer questions with LLM-as-a-judge | Improve knowledge-related benchmarks like GPQA / HLE | - | - | - | <a href='resources_servers/equivalence_llm_judge/configs/equivalence_llm_judge.yaml'>equivalence_llm_judge.yaml</a> | - |
| Equivalence Rule | knowledge | Question - Answering with rule-based reward | Improve retrieval and counting capabilities | - | - | - | <a href='resources_servers/equivalence_rule/configs/lc.yaml'>lc.yaml</a> | - |
Expand Down
110 changes: 98 additions & 12 deletions resources_servers/cvdp/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# CVDP Benchmark
# CVDP Benchmark

This resources server is for model evaluation purposes. It is reproducing [CVDP](https://github.com/NVlabs/cvdp_benchmark).

Expand All @@ -14,31 +14,116 @@ JSONL entry so the server is self-contained.

Mirrors `repository.py` in the [CVDP source](https://github.com/NVlabs/cvdp_benchmark):

1. Parse model response via `ModelHelpers.parse_model_response()`
1. Obtain the candidate RTL: grade the files the agent wrote on disk (`rtl_files` in the verify request, agentic flow) when present, otherwise parse the model's text response via `ModelHelpers.parse_model_response()`
2. Write harness files to temp workspace — applies image placeholder substitutions
3. Write extracted RTL to `workdir/rtl/`
4. For each service in `docker-compose.yml`, pull the Docker image as a cached SIF file and run via `apptainer exec` with `--bind` mounts for `rtl/`, `verif/`, `docs/`, `src/`, `rundir/`
4. For each service in `docker-compose.yml`, pull the Docker image as a cached SIF file and run it through the Apptainer sandbox provider (`instance start` + `exec`) with `--bind` mounts for `rtl/`, `verif/`, `docs/`, `src/`, `rundir/`
5. Exit code `0` across all services → reward `1.0`; any failure → reward `0.0`

Code layout: `app.py` owns the HTTP `verify` contract and reward scoring; the sandbox execution (docker-compose → Apptainer translation, SIF cache, provider lifecycle) lives in `harness.py`'s `HarnessRunner`.

> **Note:** Both the verification harness here and the agentic agent share the same `ApptainerProvider`. Because `apptainer instance start` launches a long-lived instance, the provider starts it in "daemonize" mode (captures output to temp files and waits only for the foreground process) so the call returns immediately instead of blocking until the instance exits. This is internal to `create()` — nothing to configure here. See the [provider README](../../nemo_gym/sandbox/providers/apptainer/README.md#why-create-runs-instance-start-in-daemonize-mode).

## Configuration


| Field | Default | Description |
| ------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------- |
| `oss_sim_image` | `ghcr.io/hdl/sim/osvb` | Container image for open-source simulation (Icarus) |
| `oss_pnr_image` | `""` | Container image for place-and-route problems |
| `eda_sim_image` | `""` | Commercial EDA image (Cadence Xcelium etc.) |
| `container_timeout` | `600` | Seconds before an Apptainer run is killed |
| `num_processes` | `4` | Max concurrent Apptainer jobs |
| `sif_cache_dir` | `~/.cache/nemo-gym/sif` | Directory for cached SIF images pulled from Docker registries |
| `harness_workspace_dir` | `""` | Optional host directory where per-rollout temp workspaces are created (default: system temp) |
| Field | Default | Description |
| ------------------------- | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `oss_sim_image` | `ghcr.io/hdl/sim/osvb` | Container image for open-source simulation (Icarus) |
| `oss_pnr_image` | `""` | Container image for place-and-route problems |
| `eda_sim_image` | `""` | Commercial EDA image (Cadence Xcelium etc.) |
| `container_timeout` | `600` | Seconds before an Apptainer run is killed |
| `num_processes` | `4` | Max concurrent Apptainer jobs |
| `sif_cache_dir` | `~/.cache/nemo-gym/sif` | Directory for cached SIF images pulled from Docker registries |
| `harness_workspace_dir` | `""` | Optional host directory where per-rollout temp workspaces are created (default: system temp) |
| `container_tmp_bind_path` | `""` | If set, redirects in-container temp (e.g. `/tmp`) to per-rollout host storage and forces temp env vars (`TMPDIR`, `XCELIUM_TMPDIR`, `CDS_LOCK`, `JAVA_TOOL_OPTIONS`) — useful when default `/tmp` is too small or tools (Cadence/Java) write large temp/lock artifacts |


**Note**: To run the commercial subset, pass the EDA image name in the yaml config file (/scratch/artij/Gym/resources_servers/cvdp/configs/cvdp.yaml).

```
eda_sim_image: cvdp-cadence-verif:latest
```

## Agents

There are two ways to drive this resources server:

- **Non-agentic** (`cvdp_agent`, `responses_api_agents/cvdp_agent/app.py`, config `configs/cvdp_agent.yaml`): the model emits the RTL directly in its text response; the server parses it out and runs the harness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit confusing to describe this path as non-agentic, considering cvdp_agent is itself an agent that we are using in the first scenario

- **Agentic** (`cvdp_agent_agentic`, `responses_api_agents/cvdp_agent/agentic_app.py`, config `configs/cvdp_agent_agentic.yaml`): runs Claude Code **inside** the EDA sim container so it can edit files on disk and self-test with the in-container EDA tools, then reports the files it wrote back to the server as `rtl_files` for grading. See `[responses_api_agents/cvdp_agent/](../../responses_api_agents/cvdp_agent/)`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend we think about harness as first-class composable unit, describe this as illustration using Claude Code but could swap in other harnesses as well.


### Agentic agent settings (`configs/cvdp_agent_agentic.yaml`)


| Field | Default | Description |
| -------------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
| `model` | `${anthropic_model_name}` | Claude model used inside the container |
| `anthropic_api_key` | `${anthropic_api_key}` | API key for Claude (set via env.yaml) |
| `anthropic_base_url` | `${anthropic_base_url}` | Anthropic-compatible endpoint |
| `sim_image` | `nvidia/cvdp-sim:v1.0.0` | EDA sim image Claude runs inside (pulled/converted to a cached `.sif`) |
| `sif_path` | `null` | Explicit `.sif` to use instead of pulling `sim_image` |
| `sif_cache_dir` | `""` | SIF cache dir (defaults to `~/.cache/nemo-gym/sif`) |
| `claude_node_dir` | `""` | Host Node+Claude prefix to bind into the container (defaults to a built-in self-contained one) |
| `container_workdir` | `/code` | Workspace mount point + cwd + `HOME` inside the container |
| `max_turns` | `30` | Max Claude Code turns |
| `timeout` | `900` | Per-task wall-clock budget (seconds) |
| `concurrency` | `4` | Max concurrent agent runs |
| `max_context_tokens` | `1000000` | Sets `CLAUDE_CODE_MAX_CONTEXT_TOKENS` inside the container |


`system_prompt`, `allowed_tools`, `disallowed_tools`, and `claude_code_version` are inherited Claude Code knobs (leave `null` for defaults).

Add the Claude settings to your repo-root `env.yaml`:

```yaml
anthropic_model_name: <claude-model>
anthropic_api_key: <your-api-key>
anthropic_base_url: https://api.anthropic.com
```

To run the agentic variant, swap the agent config in and target the agent by name (no separate model server — the agent calls Claude itself):

```bash
ng_run "+config_paths=[resources_servers/cvdp/configs/cvdp.yaml,responses_api_agents/cvdp_agent/configs/cvdp_agent_agentic.yaml]"

ng_collect_rollouts \
+agent_name=cvdp_agent_agentic \
+input_jsonl_fpath=resources_servers/cvdp/data/<dataset>.jsonl \
+output_jsonl_fpath=results/rollouts.jsonl \
+num_repeats=5 \
+num_samples_in_parallel=4 \
"+config_paths=[resources_servers/cvdp/configs/cvdp.yaml,responses_api_agents/cvdp_agent/configs/cvdp_agent_agentic.yaml]"
```

## Build the Open-Source Simulation Image

If you're using the CVDP v1.1.0 data (e.g. `data/example_agentic.jsonl`), build the open-source
simulation image **once** before collecting rollouts. CVDP v1.1.0 uses a dedicated open-source
simulation image for non-commercial simulation tasks:

```bash
cd /path/to/cvdp_benchmark
docker build -f docker/Dockerfile.sim -t nvidia/cvdp-sim:v1.0.0 .
```

This image provides the default `OSS_SIM_IMAGE` environment used by dataset harnesses via
`__OSS_SIM_IMAGE__`. CVDP v1.1.0 no longer uses the legacy third-party simulation images for this
default open-source simulation flow. The build includes cocotb 2.0.1, pytest 8.3.2, Icarus Verilog
v13_0, Yosys yosys-0.40, and Verilator v5.038.

If you tag the image differently, set the matching value in `.env`:

```bash
OSS_SIM_IMAGE=nvidia/cvdp-sim:v1.0.0
```

Open-source place-and-route tasks still use the separate `OSS_PNR_IMAGE` setting, but in CVDP
v1.1.0 its default points at the same `nvidia/cvdp-sim:v1.0.0` image:

```bash
OSS_PNR_IMAGE=nvidia/cvdp-sim:v1.0.0
```

## Download Dataset

The data can be found [on Hugging Face](https://huggingface.co/datasets/nvidia/cvdp-benchmark-dataset).
Expand Down Expand Up @@ -113,6 +198,7 @@ pre-commit install
```

To install apptainer:

```bash
wget https://github.com/apptainer/apptainer/releases/download/v1.3.1/apptainer_1.3.1_amd64.deb
apt install -y ./apptainer_1.3.1_amd64.deb
Expand Down
Loading
Loading