Skip to content

Sdd/dev#1

Open
sdevare-nv wants to merge 21 commits into
devfrom
sdd/dev
Open

Sdd/dev#1
sdevare-nv wants to merge 21 commits into
devfrom
sdd/dev

Conversation

@sdevare-nv

Copy link
Copy Markdown
Owner

Issue for this PR

Closes #

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Please provide a description of the issue, the changes you made to fix it, and why they work. It is expected that you understand why your changes work and if you do not understand why at least say as much so a maintainer knows how much to value the PR.

If you paste a large clearly AI generated description here your PR may be IGNORED or CLOSED!

How did you verify your code works?

Screenshots / recordings

If this is a UI change, please include a screenshot or recording.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

If you do not follow this template your PR will be automatically rejected.

sdevare-nv and others added 21 commits May 6, 2026 15:58
The opencode CLI registers TUI subcommands (Attach, TuiThread) via eager
imports in index.ts. Loading them transitively imports cli/cmd/tui/app.tsx,
whose JSX is meant to compile against @opentui/solid (per the package
tsconfig's jsxImportSource) but is incorrectly resolved against
react/jsx-dev-runtime when Bun runs the un-bundled .ts file at runtime —
react isn't a dep of packages/opencode, so the bench harness crashes at
startup before the run command ever executes.

The bench harness (packages/opencode/src/bench/cli.ts) only ever spawns
\`bun src/index.ts run\` as a subprocess; it never invokes the TUI. Remove
the imports + command registrations entirely so the cli/cmd/tui/ subtree
is unreachable from the bench code path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running opencode's CLI un-bundled (`bun src/index.ts run`) triggers
cascading runtime-resolution failures: tsconfig's `jsxImportSource`
isn't honored for inline `.tsx` JIT compilation (we already removed the
TUI commands to dodge that), and bun's isolated install layout under
`node_modules/.bun/<pkg>@<ver>/...` breaks `..`-relative `.mjs` imports
inside packages like @anthropic-ai/sdk (the `internal/to-file.mjs`
import from `core/uploads.mjs` fails to traverse the symlink the way
Node does at runtime).

opencode is meant to be shipped as a pre-bundled single file (their own
`bin/opencode` is built the same way) — `bun build` resolves every
transitive import statically and inlines the result, so runtime never
has to. The companion gym `setup_scripts/opencode.sh` now invokes
`bun build packages/opencode/src/index.ts --outdir .bench-build
--entry-naming opencode.js`. This commit teaches `bench/cli.ts` to
prefer that bundle when it exists, and falls back to `src/index.ts` for
dev / when the build step hasn't run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`provider/models.ts:137` does `import("./models-snapshot.js")` for a
static cache of models.dev metadata. Upstream generates this file at
build time via `script/generate.ts` (network fetch from models.dev),
and `.gitignore` correctly excludes it as a build artifact.

For the bench harness we register a single custom provider
(`@opencode-ai/nemo-gym`) in the per-instance opencode config and
never consult the snapshot. But `bun build --target=bun
packages/opencode/src/index.ts ...` still needs the import target to
exist at static-analysis time, otherwise it errors with:

    error: Could not resolve: "./models-snapshot.js"

Force-add empty stubs so the bundle build succeeds without running
`script/generate.ts` (which needs network + adds 5+ seconds). The
runtime `try:` lambda in models.ts handles an empty snapshot fine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfig

`PermissionActionConfig` accepts a glob-keyed map only for file/shell-style
permissions (edit, bash). webfetch / websearch use a different shape (a
single literal action), so emitting `{"*": "deny"}` for them trips the
config validator at startup:

    Error: Configuration is invalid at .../opencode.jsonc
    ↳ Expected PermissionActionConfig | undefined, got {"*":"deny"}
      agent.swe-bench.permission.webfetch
    ↳ Expected PermissionActionConfig | undefined, got {"*":"deny"}
      agent.swe-bench.permission.websearch

Both tools are already disabled via the agent's `tools:` block
(webfetch: false, websearch: false), so the permission entries are
redundant. Removing them lets the per-instance config load cleanly and
the run command actually start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
opencode's bundled-provider loader invokes the factory with
\`{ name, ...options }\` where `options` is the merged user + custom
provider defaults from `provider.ts`. For some providers opencode
injects an empty/static `headers: {}` into that merged options dict
(distinct from upstream openai-compatible's schema where `headers` is a
function `() => Record<string, string | undefined>`).

We were unconditionally invoking `this.cfg.headers?.()`. The optional
chain only short-circuits on null/undefined, so when opencode passed a
plain object it threw at session-start:

    Error: this.cfg.headers is not a function.
    (In 'this.cfg.headers?.()', 'this.cfg.headers' is an instance of Object)

Branch on `typeof headers` to support both shapes: call it if it's a
function, spread it directly if it's a plain object. Either way the
resulting kv pairs are merged into the per-request HTTP headers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The constructor merged caller config over defaults via:

    this.cfg = { requestTimeoutMs: 600_000, retries: 3, ...cfg }

opencode's provider loader invokes the factory with options that
include explicit `undefined` for optional fields. Our factory in
`provider/sdk/nemo-gym/index.ts` faithfully passes those through:

    new NemoGymLanguageModel(modelId, {
      ...,
      requestTimeoutMs: opts.requestTimeoutMs,  // undefined
      retries: opts.retries,                    // undefined
    })

The `...cfg` spread then overwrites the 600_000 / 3 defaults with
`undefined`. At runtime `setTimeout(fn, undefined)` is treated as
`setTimeout(fn, 0)` — so the abort timer fires before `fetch` can
even hand off the request. Result:

    NeMoGym chat completions failed after 3 attempts:
        AbortError: The operation was aborted.

with ~10 ms total elapsed between step_start and the error event.

Swap to the spread-first, coalesce-with-`??` pattern so undefineds
fall through to the defaults instead of clobbering them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the input row's responses_create_params.model is unset, the gym
side hands opencode an empty model string. opencode's session/provider
resolver then falls back to its sentinel `"default"` (or threads the
empty string straight through). We were faithfully POSTing
\`model: "default"\` to the gym openai_model server, which forwarded it
to OpenAI:

    The model `default` does not exist or you do not have access to it.

The gym openai_model server already has the right pattern for this:

    body_dict.setdefault("model", self.config.openai_model)

so omitting `model` from the outbound body lets the server fill in the
policy-configured model name (e.g. gpt-4.1-2025-04-14) without any
client-side knowledge of what the policy actually points to. Guard the
model assignment in `_buildRequestParams` to do exactly that.

The companion gym-side change is in OpenCodeHarnessProcessor.get_run_command:
when body.model is empty, fall back to the resolved
model_server_cfg.openai_model (or .model for vllm) before writing the
per-instance config, so opencode also sees a real string when it's
populating the agent's model field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The todowrite tool is just an in-session planning aid — explicit
multi-step task lists help the model decompose non-trivial fixes
without affecting token-ID contiguity or subagent dispatch. Was
disabled defensively; flipping it on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Strip git history past base_commit so the agent can't reach future
commits via `git checkout <future-sha>` / `git log` exploration.
Two-pass: careful per-ref iteration first; nuclear batch-delete
fallback for monorepos with thousands of refs (e.g. datadog-agent
with 5k+ release tags would time out the careful pass).

Runs once in cli.ts before spawning the opencode session, scoped to
the per-instance workspace, using the base_commit field from the
instance JSONL. `|| true` at the tail so a busted git state can't
kill the rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Mark each phase + each ref delete/reset with a `[deep_reset:careful]`
or `[deep_reset:nuclear]` line so the agent log shows what happened.
Mirrors the per-step `echo` markers in nv-OpenHands' run_infer.py
deep-reset path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
- language-model.ts: read sessionID/parentSessionID from `x-session-affinity`
  / `x-parent-session-id` request headers (set by opencode's session/llm.ts
  for non-opencode providers). Per-session turn counter prevents subagent
  dumps from clobbering the main session's. Filename includes sessionID
  so each session's per-turn JSONs sit side-by-side; payload now carries
  session_id, parent_session_id, and turn for downstream reconstruction
  of the agent tree.
- cli.ts: new --enable-subagents flag; toggles `task: <bool>` in the
  per-instance opencode.jsonc. Default off.
- run_infer.sh: forwards ENABLE_SUBAGENTS env -> --enable-subagents flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Replaces the 12-line bench-local DEFAULT_SYSTEM_PROMPT with the real
anthropic system prompt opencode ships (`session/prompt/anthropic.txt`,
~105 lines), plus a short SWE-bench addendum (don't commit/format the
diff, harness captures git diff as the patch, don't edit tests).

`--system-prompt <path>` override still wins when supplied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
The env block in session/system.ts (cwd, worktree, platform, today's
date) is appended to the system prompt every turn. The `Today's date`
field uses `new Date().toDateString()` — across a midnight rollover
the system message shifts, which breaks the RL prompt-token-prefix
invariant (turn[N+1].prompt_token_ids must extend turn[N]'s).

Bench mode sets OPENCODE_DISABLE_ENV_PROMPT=1 in the child env so the
environment() effect short-circuits to []. Normal `opencode run` is
unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Some apptainer images ENOENT on bare program names through Bun's
posix_spawn (libuv falls back to PATH-less execv on some kernels /
musl builds). Symptoms in the wild:

  [bench] deep_reset spawn error: ENOENT posix_spawn 'bash'
  [bench] wrote ... error=opencode_exit_999     # bun spawn also ENOENT

Fixes:
  - deep_reset.ts: probe /bin/bash, /usr/bin/bash, /bin/sh, /usr/bin/sh
    and spawn the first that exists. The script is POSIX-compatible so
    /bin/sh works fine when bash is absent (minimal/distroless SIFs).
    Skip deep_reset with a warning if no shell is found.
  - cli.ts runOpencode: use process.execPath (the currently-running bun
    binary) instead of bare "bun".
  - cli.ts captureGitDiff: probe /usr/bin/git, /bin/git, /usr/local/bin/git
    with bare "git" as last-ditch fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
The previous fix moved to absolute binary paths but `/bin/bash` still
ENOENTs through Bun's posix_spawn whenever the spawn passes a `cwd`
option on minimal apptainer images (e.g., SWE-bench's astropy SIF).
Root cause: libc lacks `posix_spawn_file_actions_addchdir_np`, so
libuv's cwd-handling fallback fails the spawn outright instead of
falling back to fork+chdir+exec.

Workaround: don't pass `cwd` to spawn at all.

  - deep_reset.ts: prepend `cd <workspaceRoot> && ` to the shell script
    itself (shellQuote handles paths with spaces / special chars).
  - cli.ts runOpencode: drop `cwd: workspaceRoot` from the bun spawn.
    Opencode's existing `--dir <workspaceRoot>` flag changes its own
    working directory, so spawn-level cwd was redundant anyway.
  - captureGitDiff already uses `git -C <workspaceRoot>` (no cwd).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Plain `git -C <workspaceRoot> diff` only reports changes to tracked
files. Every new file the agent creates via the `write` tool stays
untracked, so it was silently dropped from the captured patch and the
rollout was recorded as `patch_exists=false` / reward 0 even though the
agent ran to completion (`reason: "stop"`, bench `exit=0`).

In a recent SWE-bench batch (swebench_results_1780596228320_bcd81ba7),
8 of the 21 zero-byte patches were exactly this case — 7/8 had `write`
calls to brand-new paths and never staged. Example trajectory:
denoland-deno-8408-agentic wrote a 6499-byte std/encoding/csv_stringify.ts
from scratch, type-checked, ran a functional test, and stopped. git_patch
came back "" because the file was untracked.

Fix: mark untracked files as intent-to-add (`git add -AN`) so they
appear in `git diff` without being committed. Worktree-style diff is
preserved so the SWE-bench evaluator's `git apply` still works. Also
pass `--binary` so any binary artifacts the agent produces aren't
silently truncated by the textual diff path.

The system-prompt addendum already tells the model not to commit or
push, which is still correct after this fix — the harness now picks up
unstaged new files on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Some dataset SIFs (notably `swe-bench-ext`, and certain SWE-rebench
variants) copy the repo into the workspace as a flat source tree with
no `.git` directory. Without one:

  * runDeepReset's first `git rev-parse` fails, the outer `|| true`
    swallows the error, and deep_reset is a silent no-op.
  * captureGitDiff runs `git -C <workspace> diff` against a non-repo;
    git emits `fatal: not a git repository` to stderr and exits with
    empty stdout, so the function returns "".

Net effect: every rollout for those SIFs lands as `patch=0 bytes`
regardless of how many files the agent edited. Recent symptom — in
swebench_results_1781123430109_686771d1 (swe-bench-ext, 37 instances),
every patch was zero bytes despite trajectories with up to 30 successful
`edit` calls on the workspace. `cd /workspace/repo && git status` in
the same SIF returns `fatal: not a git repository`.

Port of nv-OpenHands' run_infer.py:1142-1156 swe-bench-ext baseline:
detect the missing `.git`, init a local repo, snapshot the pristine
tree as `opencode_bench_baseline`, and skip deep_reset (the upstream
base_commit SHA doesn't exist in the fresh repo, so deep_reset would
just fall through to its nuclear pass and noisily no-op anyway).

This composes with 52efa89 (`git add -AN && git diff --binary`) — the
prior commit handled `.git`-present-but-untracked-write-files; this
commit handles `.git`-missing-entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Falling off the end of main() with no explicit process.exit() left Bun
to drain the event loop on its own, which produced a flaky exit=1 even
on successful runs that wrote a valid output.jsonl with `error=none`.
The residual handles seem to come from the spawned opencode subprocess
(piped stdio that hasn't been explicitly closed) and from opencode's
sqlite migration code — both stay registered long enough that Bun
treats the runtime drain as a non-clean exit.

Symptom in DeepSeek-V4-Flash_opencode_0-0-12711152.out (SWE-rebench-V2
train shard): three instances logged

  [bench] wrote .../bench_run/output.jsonl (patch=N bytes, error=none)

with N=2001, 9046, ... — `error=none` means `result.exitCode === 0`,
so opencode itself exited cleanly and the patch was captured. But
apptainer exited 1, gym's runner raised
`RuntimeError("Command failed with return code 1")`, and the rollout
was discarded as a failure even though the bench succeeded.

Fix: explicitly `process.exit(result.exitCode === 0 ? 0 : 1)` after the
[bench] wrote log line. Deterministic, mirrors opencode's exit code,
and doesn't depend on the event-loop drain heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant