feat(runner): OpenShell sandbox integration with POD_IP injection and OpenShift RBAC#1698
Conversation
- Fix source layout: add model.py, observability files, fixtures/, remove duplicate workspace.py - Document AGUI_TOKEN session auth middleware and SDK_OPTIONS env var - Document runtime model switching via POST /model - Add 'Desired State: OpenShell Credential Isolation' section with migration path 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…hell sandbox integration Update specs and docs to reflect the actual implemented state of the OpenShell sandbox integration, replacing the original "desired state" proposal with detailed implementation records including: - Runner spec: replace proposed desired state with implemented architecture, verified isolation layers, required capabilities (7, not 1), policy format, CP integration, known limitations, and design decisions - New security spec (openshell-sandbox.spec.md): formal RFC 2119 requirements for sandbox activation, network namespace isolation, TLS proxy, Landlock filesystem restrictions, privilege drop, seccomp-BPF, and ConfigMap propagation - Adaptation doc: rewrite from proposal to implementation record with full debugging journey (7-error progression, EINVAL misdiagnosis, ptrace tracing), verified results, OpenShift SCC reference, and future work phases - Security analysis: add implementation status, integration point status table, and lessons learned (file mode, 7 caps, Landlock ABI compat) - Bookmarks: add OpenShell sandbox spec entry 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…dboxing Add NVIDIA OpenShell Supervisor (v0.0.56, file mode) to the runner image, wrapping the Claude Code subprocess in five isolation layers: network namespace, TLS L7 proxy, Landlock filesystem sandbox, seccomp-BPF, and privilege drop to unprivileged sandbox user. Dockerfile changes: - Pin openshell-sandbox v0.0.56 from ghcr.io/nvidia/openshell/supervisor - Add iproute package for network namespace management (ip netns) - Create sandbox user/group for privilege drop target - Pre-create /workspace owned by sandbox, /var/run/netns for mount points - Symlink bundled Claude CLI to /usr/local/bin/claude for stable policy path - Set /home/sandbox permissions to 755 New files: - openshell-claude-wrapper.sh: dispatches to supervisor or direct claude based on OPENSHELL_ENABLED env var - .openshell-ref/policy.rego: official OPA Rego from OpenShell repository - .openshell-ref/policy.yaml: filesystem, network, process policy data with endpoint ACLs for Anthropic, Vertex AI, GitHub, GitLab, npm, PyPI bridge.py: 1-line change sets cli_path to wrapper when OPENSHELL_ENABLED=true 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…iler Add conditional OpenShell sandbox support to the CP reconciler, activated by OPENSHELL_ENABLED=true environment variable. Reconciler changes (kube_reconciler.go): - buildRunnerSecurityContext: grant 7 capabilities (NET_ADMIN, SYS_ADMIN, SYS_PTRACE, SETUID, SETGID, CHOWN, DAC_OVERRIDE), allowPrivilegeEscalation, runAsUser:0 when OpenShell enabled - ensurePod: set pod-level seccompProfile to Unconfined - buildVolumes/buildVolumeMounts: mount openshell-policy ConfigMap at /etc/openshell - buildEnv: inject OPENSHELL_ENABLED, OPENSHELL_POLICY_RULES, OPENSHELL_POLICY_DATA - ensureOpenShellPolicy: propagate policy ConfigMap from CP namespace to runner namespace Config changes (config.go): - OpenShellEnabled (from OPENSHELL_ENABLED env var) - OpenShellPolicyName (from OPENSHELL_POLICY_CONFIGMAP, default: openshell-policy) KubeClient changes (kubeclient.go): - Add ConfigMapGVR, GetConfigMap, CreateConfigMap methods 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
When a runner pod is terminated, the OpenShell supervisor's Drop cleanup for NetworkNamespace doesn't fire, leaving stale /var/run/netns/sandbox-* files and veth interfaces. On the next session, the supervisor creates a new sandbox namespace, but the old veth's 10.200.0.0/24 route persists, causing duplicate routes that prevent proxy connectivity (SYN-SENT). The wrapper script now: - Cleans stale sandbox-* netns and their veth interfaces before launch - Remounts /proc/sys rw and disables rp_filter on default/all so ARP resolves on dynamically-created veth interfaces Also updates policy.yaml to allow the bundled Claude CLI path and node-22 binary, and passes OPENSHELL_LOG_LEVEL from the reconciler. Verified: openshell-v5 image deployed to ROSA, both simple and complex (multi-tool-use) sessions complete successfully through the sandbox. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…CLI hotkey indicators Inject POD_IP via Kubernetes Downward API so MCP sidecars are reachable from OpenShell sandboxes (which have isolated network namespaces). Add OpenShift build/image/route RBAC rules to project reconciler and ClusterRole manifests. Update CLI TUI with hotkey indicator rendering and sensible defaults for wrap and timestamp modes. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Include build, deploy, and implementation plan docs used for bootstrapping the OpenShell sandbox integration on ROSA clusters. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for cheerful-kitten-f556a0 canceled.
|
📝 WalkthroughWalkthroughIntegrates NVIDIA OpenShell Supervisor as an optional sandbox for the Claude Code agent subprocess. Controlled by ChangesOpenShell Sandbox Integration
CLI TUI Message Stream Defaults
Config and Tooling Housekeeping
Sequence Diagram(s)sequenceDiagram
participant ControlPlane as KubeReconciler
participant KubeClient
participant CPNamespace as CP Namespace
participant SessionNS as Session Namespace
participant Runner as Runner Pod
ControlPlane->>KubeClient: ensureOpenShellPolicy(sessionNS)
KubeClient->>CPNamespace: GetConfigMap("openshell-policy")
CPNamespace-->>KubeClient: policy.rego + policy.yaml data
KubeClient->>SessionNS: CreateConfigMap("openshell-policy")
ControlPlane->>ControlPlane: buildRunnerSecurityContext(openShellEnabled=true)<br/>→ NET_ADMIN caps, allowPrivilegeEscalation
ControlPlane->>ControlPlane: seccompProfile: Unconfined (pod level)
ControlPlane->>ControlPlane: append openshell-policy volume → mount /etc/openshell
ControlPlane->>ControlPlane: buildCredentialSidecars(openShellEnabled=true)<br/>→ AMBIENT_MCP_URL uses $(POD_IP)
ControlPlane->>SessionNS: create runner Pod
Runner->>Runner: openshell-claude-wrapper.sh<br/>(OPENSHELL_ENABLED=true)
Runner->>Runner: cleanup stale netns, disable rp_filter
Runner->>Runner: exec /openshell-sandbox --rules /etc/openshell/policy.rego<br/>--data /etc/openshell/policy.yaml -- /usr/local/bin/claude
Runner->>Runner: OPA evaluates allow_network / allow_request per connection
Possibly related PRs
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 851-858: The POD_IP environment variable injection using
envVarFromFieldRef is currently nested inside the useMCPSidecar condition block,
but it needs to be injected whenever OpenShellEnabled is true, regardless of
whether the MCP sidecar is enabled. Move the POD_IP injection outside of the
useMCPSidecar condition and gate it directly on r.cfg.OpenShellEnabled so that
credential sidecar URLs can properly resolve the $(POD_IP) placeholder in all
cases where OpenShell is enabled.
- Around line 777-783: The GetConfigMap call at line 777 on the target namespace
treats any error as "not present" and continues to copy from the source. This
masks non-NotFound errors like RBAC or transport issues. Instead of checking err
== nil, explicitly check if the error is a NotFound error using the appropriate
error type checking (such as k8s apierrors.IsNotFound). Only proceed to copy
from the source if the ConfigMap is truly not found; for any other error from
r.nsKube().GetConfigMap on the target namespace, propagate the error up
immediately rather than silently continuing.
- Around line 310-312: The ensureImageBuildAccess method call currently logs a
warning when it fails but continues execution, which can leave the namespace in
a partially provisioned state. Instead of just logging the error, you need to
return the error from this code path so that namespace provisioning fails
completely rather than continuing and potentially failing later at build time.
Remove the warning-only logging pattern around the ensureImageBuildAccess call
and either directly return the error or aggregate it with other errors as needed
to ensure the failure is properly propagated up the call chain.
In `@components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh`:
- Around line 144-146: The ClusterRole definition in the deploy-openshell.sh
script includes pods/exec in the resources array with dangerous verbs like
create, delete, and patch, which grants arbitrary command execution in any pod.
Either remove pods/exec from the resources list in the ClusterRole to restrict
this privilege, or if it is truly required for OpenShell functionality, add
explicit documentation above the ClusterRole definition (in comments within the
script) explaining the security rationale and why this elevated privilege is
necessary for production use.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d1dd49b0-b257-4fd1-947d-f87a99b37c76
📒 Files selected for processing (24)
.mcp.jsonBOOKMARKS.mdcomponents/ambient-cli/cmd/acpctl/ambient/tui/views/messages.gocomponents/ambient-control-plane/cmd/ambient-control-plane/main.gocomponents/ambient-control-plane/internal/config/config.gocomponents/ambient-control-plane/internal/kubeclient/kubeclient.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler.gocomponents/ambient-control-plane/internal/reconciler/kube_reconciler_test.gocomponents/ambient-control-plane/internal/reconciler/project_reconciler.gocomponents/ambient-control-plane/internal/reconciler/project_reconciler_test.gocomponents/manifests/base/rbac/control-plane-clusterrole.yamlcomponents/pr-test/install-standard.shcomponents/runners/ambient-runner/.openshell-ref/PLAN.mdcomponents/runners/ambient-runner/.openshell-ref/build-openshell.shcomponents/runners/ambient-runner/.openshell-ref/deploy-openshell.shcomponents/runners/ambient-runner/.openshell-ref/policy.regocomponents/runners/ambient-runner/.openshell-ref/policy.yamlcomponents/runners/ambient-runner/Dockerfilecomponents/runners/ambient-runner/ambient_runner/bridges/claude/bridge.pycomponents/runners/ambient-runner/openshell-claude-wrapper.shdocs/internal/agents/openshell-runner-adaptation.mddocs/internal/agents/openshell-security-analysis.mdspecs/agents/runner.spec.mdspecs/security/openshell-sandbox.spec.md
| if err := r.ensureImageBuildAccess(ctx, namespace); err != nil { | ||
| r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access") | ||
| } |
There was a problem hiding this comment.
Do not continue after failing to grant image build RBAC.
Line 310 logs the error and continues. That leaves the namespace partially provisioned and can fail later at build time while session provisioning still reports success. Return an error from this path (or aggregate it explicitly) instead of warning-only continuation.
Proposed fix
if err := r.ensureImagePullAccess(ctx, namespace); err != nil {
r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image pull access")
}
- if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
- r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access")
- }
+ if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
+ return fmt.Errorf("granting image build access in %s: %w", namespace, err)
+ }
}As per coding guidelines, "Never silently swallow partial failures; every error path must propagate or be explicitly collected, never discarded."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err := r.ensureImageBuildAccess(ctx, namespace); err != nil { | |
| r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access") | |
| } | |
| if err := r.ensureImageBuildAccess(ctx, namespace); err != nil { | |
| return fmt.Errorf("granting image build access in %s: %w", namespace, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 310 - 312, The ensureImageBuildAccess method call currently logs a
warning when it fails but continues execution, which can leave the namespace in
a partially provisioned state. Instead of just logging the error, you need to
return the error from this code path so that namespace provisioning fails
completely rather than continuing and potentially failing later at build time.
Remove the warning-only logging pattern around the ensureImageBuildAccess call
and either directly return the error or aggregate it with other errors as needed
to ensure the failure is properly propagated up the call chain.
Source: Coding guidelines
| if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil { | ||
| return nil | ||
| } | ||
|
|
||
| src, err := r.nsKube().GetConfigMap(ctx, r.cfg.CPRuntimeNamespace, policyName) | ||
| if err != nil { | ||
| return fmt.Errorf("reading openshell policy configmap %s/%s: %w", r.cfg.CPRuntimeNamespace, policyName, err) |
There was a problem hiding this comment.
Propagate non-NotFound errors when checking target policy ConfigMap.
Line 777 treats any GetConfigMap error as “not present” and proceeds to copy from source. This masks root-cause failures (e.g., RBAC/transport errors) and can lead to confusing follow-on behavior.
Proposed fix
- if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil {
- return nil
- }
+ if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil {
+ return nil
+ } else if !k8serrors.IsNotFound(err) {
+ return fmt.Errorf("checking openshell policy configmap %s/%s: %w", namespace, policyName, err)
+ }As per coding guidelines, "Never silently swallow partial failures; every error path must propagate or be explicitly collected, never discarded."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 777 - 783, The GetConfigMap call at line 777 on the target
namespace treats any error as "not present" and continues to copy from the
source. This masks non-NotFound errors like RBAC or transport issues. Instead of
checking err == nil, explicitly check if the error is a NotFound error using the
appropriate error type checking (such as k8s apierrors.IsNotFound). Only proceed
to copy from the source if the ConfigMap is truly not found; for any other error
from r.nsKube().GetConfigMap on the target namespace, propagate the error up
immediately rather than silently continuing.
Source: Coding guidelines
| if useMCPSidecar { | ||
| env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL)) | ||
| if r.cfg.OpenShellEnabled { | ||
| env = append(env, envVarFromFieldRef("POD_IP", "status.podIP")) | ||
| env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort))) | ||
| } else { | ||
| env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Inject POD_IP whenever OpenShell is enabled, not only when MCP sidecar is enabled.
Line 851 currently gates POD_IP injection on useMCPSidecar, but Line 1225 builds credential sidecar URLs with $(POD_IP) whenever OpenShell is enabled. If MCP sidecar is disabled and credential sidecars are enabled, those URLs won’t resolve correctly.
Proposed fix
- if useMCPSidecar {
- if r.cfg.OpenShellEnabled {
- env = append(env, envVarFromFieldRef("POD_IP", "status.podIP"))
- env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort)))
- } else {
- env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
- }
- }
+ if r.cfg.OpenShellEnabled {
+ env = append(env, envVarFromFieldRef("POD_IP", "status.podIP"))
+ }
+ if useMCPSidecar {
+ if r.cfg.OpenShellEnabled {
+ env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort)))
+ } else {
+ env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 851 - 858, The POD_IP environment variable injection using
envVarFromFieldRef is currently nested inside the useMCPSidecar condition block,
but it needs to be injected whenever OpenShellEnabled is true, regardless of
whether the MCP sidecar is enabled. Move the POD_IP injection outside of the
useMCPSidecar condition and gate it directly on r.cfg.OpenShellEnabled so that
credential sidecar URLs can properly resolve the $(POD_IP) placeholder in all
cases where OpenShell is enabled.
| - apiGroups: [""] | ||
| resources: ["secrets", "serviceaccounts", "services", "pods", "pods/log", "pods/exec", "configmaps"] | ||
| verbs: ["get", "list", "watch", "create", "update", "patch", "delete", "deletecollection"] |
There was a problem hiding this comment.
pods/exec grants arbitrary command execution in any pod.
This ClusterRole includes pods/exec which is not present in the base control-plane-clusterrole.yaml. While this may be intentional for OpenShell debugging/admin scenarios, it's a significant privilege escalation that allows executing arbitrary commands inside any pod in namespaces the control-plane manages.
If this is required for OpenShell functionality, consider documenting the security rationale. If it's for debugging only, consider removing it from the production reference script.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh` around
lines 144 - 146, The ClusterRole definition in the deploy-openshell.sh script
includes pods/exec in the resources array with dangerous verbs like create,
delete, and patch, which grants arbitrary command execution in any pod. Either
remove pods/exec from the resources list in the ClusterRole to restrict this
privilege, or if it is truly required for OpenShell functionality, add explicit
documentation above the ClusterRole definition (in comments within the script)
explaining the security rationale and why this elevated privilege is necessary
for production use.
Summary
Changes
status.podIPfieldRef,$(POD_IP)expansion in MCP URLs when OpenShell enabledbuild.openshift.io,image.openshift.io,route.openshift.ioAPI group rules; reconcile-not-skip pattern for role updatesiproutepackage, sandbox user, wrapper scriptTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
POST /model)Configuration
Security & RBAC
Documentation