Skip to content

feat(runner): OpenShell sandbox integration with POD_IP injection and OpenShift RBAC#1698

Draft
markturansky wants to merge 7 commits into
ambient-code:mainfrom
markturansky:spec/runner-openshell-desired-state
Draft

feat(runner): OpenShell sandbox integration with POD_IP injection and OpenShift RBAC#1698
markturansky wants to merge 7 commits into
ambient-code:mainfrom
markturansky:spec/runner-openshell-desired-state

Conversation

@markturansky

@markturansky markturansky commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Integrate OpenShell supervisor for agent subprocess sandboxing with network namespace isolation
  • Inject POD_IP via Kubernetes Downward API for MCP sidecar connectivity across network namespaces
  • Add OpenShift RBAC rules for build, image, and route API groups in control-plane reconciler
  • Include operational reference scripts for OpenShell deployment

Changes

  • Control plane: POD_IP injection via status.podIP fieldRef, $(POD_IP) expansion in MCP URLs when OpenShell enabled
  • RBAC: Added build.openshift.io, image.openshift.io, route.openshift.io API group rules; reconcile-not-skip pattern for role updates
  • Runner Dockerfile: Added OpenShell supervisor binary, iproute package, sandbox user, wrapper script
  • Policy: CIDR-based network policy for MCP sidecar ports 8090-8094
  • CLI: TUI hotkey indicator rendering improvements
  • Reference scripts: PLAN.md, build-openshell.sh, deploy-openshell.sh for operational use

Test plan

  • gofmt passes on all changed Go files
  • go vet passes
  • No panic() in production code
  • Secret scan passes (no hardcoded credentials)
  • Verify session pod provisioning with OpenShell enabled on vteam-uat cluster
  • Verify MCP sidecar connectivity via POD_IP from sandbox namespace

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added OpenShell sandbox isolation for runner subprocess execution
    • Added runtime model-switching endpoint (POST /model)
    • Enhanced message stream UI with toggle indicators and hotkey support
  • Configuration

    • Added OpenShell sandbox activation controls via environment variables
    • Updated policy configuration format for MCP servers
  • Security & RBAC

    • Updated control-plane permissions with image build access
    • Added OpenShell-specific security constraints and namespace isolation
    • Enhanced Kubernetes ConfigMap management
  • Documentation

    • Added comprehensive OpenShell security and integration specifications
    • Updated runner specification with isolation architecture details

user and others added 7 commits June 3, 2026 18:02
- Fix source layout: add model.py, observability files, fixtures/, remove duplicate workspace.py
- Document AGUI_TOKEN session auth middleware and SDK_OPTIONS env var
- Document runtime model switching via POST /model
- Add 'Desired State: OpenShell Credential Isolation' section with migration path

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…hell sandbox integration

Update specs and docs to reflect the actual implemented state of the OpenShell
sandbox integration, replacing the original "desired state" proposal with
detailed implementation records including:

- Runner spec: replace proposed desired state with implemented architecture,
  verified isolation layers, required capabilities (7, not 1), policy format,
  CP integration, known limitations, and design decisions
- New security spec (openshell-sandbox.spec.md): formal RFC 2119 requirements
  for sandbox activation, network namespace isolation, TLS proxy, Landlock
  filesystem restrictions, privilege drop, seccomp-BPF, and ConfigMap propagation
- Adaptation doc: rewrite from proposal to implementation record with full
  debugging journey (7-error progression, EINVAL misdiagnosis, ptrace tracing),
  verified results, OpenShift SCC reference, and future work phases
- Security analysis: add implementation status, integration point status table,
  and lessons learned (file mode, 7 caps, Landlock ABI compat)
- Bookmarks: add OpenShell sandbox spec entry

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dboxing

Add NVIDIA OpenShell Supervisor (v0.0.56, file mode) to the runner image,
wrapping the Claude Code subprocess in five isolation layers: network
namespace, TLS L7 proxy, Landlock filesystem sandbox, seccomp-BPF, and
privilege drop to unprivileged sandbox user.

Dockerfile changes:
- Pin openshell-sandbox v0.0.56 from ghcr.io/nvidia/openshell/supervisor
- Add iproute package for network namespace management (ip netns)
- Create sandbox user/group for privilege drop target
- Pre-create /workspace owned by sandbox, /var/run/netns for mount points
- Symlink bundled Claude CLI to /usr/local/bin/claude for stable policy path
- Set /home/sandbox permissions to 755

New files:
- openshell-claude-wrapper.sh: dispatches to supervisor or direct claude
  based on OPENSHELL_ENABLED env var
- .openshell-ref/policy.rego: official OPA Rego from OpenShell repository
- .openshell-ref/policy.yaml: filesystem, network, process policy data
  with endpoint ACLs for Anthropic, Vertex AI, GitHub, GitLab, npm, PyPI

bridge.py: 1-line change sets cli_path to wrapper when OPENSHELL_ENABLED=true

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…iler

Add conditional OpenShell sandbox support to the CP reconciler, activated
by OPENSHELL_ENABLED=true environment variable.

Reconciler changes (kube_reconciler.go):
- buildRunnerSecurityContext: grant 7 capabilities (NET_ADMIN, SYS_ADMIN,
  SYS_PTRACE, SETUID, SETGID, CHOWN, DAC_OVERRIDE), allowPrivilegeEscalation,
  runAsUser:0 when OpenShell enabled
- ensurePod: set pod-level seccompProfile to Unconfined
- buildVolumes/buildVolumeMounts: mount openshell-policy ConfigMap at /etc/openshell
- buildEnv: inject OPENSHELL_ENABLED, OPENSHELL_POLICY_RULES, OPENSHELL_POLICY_DATA
- ensureOpenShellPolicy: propagate policy ConfigMap from CP namespace to runner namespace

Config changes (config.go):
- OpenShellEnabled (from OPENSHELL_ENABLED env var)
- OpenShellPolicyName (from OPENSHELL_POLICY_CONFIGMAP, default: openshell-policy)

KubeClient changes (kubeclient.go):
- Add ConfigMapGVR, GetConfigMap, CreateConfigMap methods

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
When a runner pod is terminated, the OpenShell supervisor's Drop cleanup
for NetworkNamespace doesn't fire, leaving stale /var/run/netns/sandbox-*
files and veth interfaces. On the next session, the supervisor creates a
new sandbox namespace, but the old veth's 10.200.0.0/24 route persists,
causing duplicate routes that prevent proxy connectivity (SYN-SENT).

The wrapper script now:
- Cleans stale sandbox-* netns and their veth interfaces before launch
- Remounts /proc/sys rw and disables rp_filter on default/all so ARP
  resolves on dynamically-created veth interfaces

Also updates policy.yaml to allow the bundled Claude CLI path and
node-22 binary, and passes OPENSHELL_LOG_LEVEL from the reconciler.

Verified: openshell-v5 image deployed to ROSA, both simple and complex
(multi-tool-use) sessions complete successfully through the sandbox.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…CLI hotkey indicators

Inject POD_IP via Kubernetes Downward API so MCP sidecars are reachable
from OpenShell sandboxes (which have isolated network namespaces).
Add OpenShift build/image/route RBAC rules to project reconciler and
ClusterRole manifests. Update CLI TUI with hotkey indicator rendering
and sensible defaults for wrap and timestamp modes.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Include build, deploy, and implementation plan docs used for
bootstrapping the OpenShell sandbox integration on ROSA clusters.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@netlify

netlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Deploy Preview for cheerful-kitten-f556a0 canceled.

Name Link
🔨 Latest commit bc462c8
🔍 Latest deploy log https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/6a32e16c0f1a0000085429c9

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Integrates NVIDIA OpenShell Supervisor as an optional sandbox for the Claude Code agent subprocess. Controlled by OPENSHELL_ENABLED, the runner image gains the supervisor binary, a wrapper script, and a sandbox user. The control-plane reconciler propagates an OPA policy ConfigMap, adjusts pod security contexts, and switches sidecar MCP URLs to use POD_IP. RBAC manifests and project reconciler update logic are tightened, and a full reference deployment script is included.

Changes

OpenShell Sandbox Integration

Layer / File(s) Summary
OPA sandbox policy (Rego + YAML)
components/runners/ambient-runner/.openshell-ref/policy.rego, components/runners/ambient-runner/.openshell-ref/policy.yaml
Introduces a complete Rego policy implementing L4 + L7 network access control (REST, SQL, GraphQL) and a YAML data file specifying filesystem allowlists, Landlock settings, and per-service endpoint/binary allowlists.
Runner Dockerfile and wrapper script
components/runners/ambient-runner/Dockerfile, components/runners/ambient-runner/openshell-claude-wrapper.sh, components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
Adds iproute, copies the pinned openshell-sandbox binary, creates the sandbox user/group and netns directory, adds /usr/local/bin/claude symlink, and conditionally routes Claude CLI execution through the supervisor when OPENSHELL_ENABLED=true.
Control-plane config and kubeclient extensions
components/ambient-control-plane/internal/config/config.go, components/ambient-control-plane/internal/kubeclient/kubeclient.go
Adds OpenShellEnabled/OpenShellPolicyName to ControlPlaneConfig with env-var loading; extends KubeClient with ConfigMapGVR, GetConfigMap, CreateConfigMap, and UpdateRole.
Kube reconciler: OpenShell pod spec and policy propagation
components/ambient-control-plane/internal/reconciler/kube_reconciler.go, components/ambient-control-plane/cmd/ambient-control-plane/main.go, components/ambient-control-plane/internal/reconciler/kube_reconciler_test.go
Extends KubeReconcilerConfig with OpenShell fields; adds ensureOpenShellPolicy (ConfigMap copy), buildRunnerSecurityContext (conditional capabilities), openshell-policy volume/mount, pod-level seccompProfile: Unconfined, POD_IP fieldRef env var, and MCP URL switching in buildCredentialSidecars.
Project reconciler RBAC update logic and tests
components/ambient-control-plane/internal/reconciler/project_reconciler.go, components/ambient-control-plane/internal/reconciler/project_reconciler_test.go
Centralizes rules in controlPlaneRBACRules(), refactors ensureControlPlaneRBAC to update existing Roles via SetNestedSlice+UpdateRole, and adds a test suite covering rule contents and create/update behavior.
ClusterRole and OpenShift RBAC manifests
components/manifests/base/rbac/control-plane-clusterrole.yaml, components/pr-test/install-standard.sh
Adds delete verb on namespaces, replaces broad pod/secret/service permissions with a narrow pods/log get rule, and extends the OpenShift ClusterRole with build.openshift.io, image.openshift.io, and route.openshift.io rules.
OpenShell build and deploy reference scripts
components/runners/ambient-runner/.openshell-ref/build-openshell.sh, components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh
Adds build-openshell.sh (podman build+push for runner and control-plane images) and deploy-openshell.sh (full OpenShift provisioning: namespace, secrets, RBAC, SCC binding, policy ConfigMap, PostgreSQL, API server, control-plane, Route, and health check).
Specs, internal docs, and reference plan
specs/security/openshell-sandbox.spec.md, specs/agents/runner.spec.md, docs/internal/agents/openshell-runner-adaptation.md, docs/internal/agents/openshell-security-analysis.md, components/runners/ambient-runner/.openshell-ref/PLAN.md, BOOKMARKS.md
Adds the sandbox security spec, updates the runner spec with OpenShell/AGUI_TOKEN/model-switch sections, adds implementation and security analysis internal docs, preserves the reference plan, and updates the bookmarks index.

CLI TUI Message Stream Defaults

Layer / File(s) Summary
MessageStream default state and toggle hotkey rendering
components/ambient-cli/cmd/acpctl/ambient/tui/views/messages.go
NewMessageStream initializes wrapMode=true and timestampMode=1; renderToggle now accepts and visually highlights a hotkey parameter in the status bar label.

Config and Tooling Housekeeping

Layer / File(s) Summary
.mcp.json args array formatting
.mcp.json
Reformats context7 and deepwiki mcpServers args arrays from inline to multi-line JSON; no behavioral change.

Sequence Diagram(s)

sequenceDiagram
    participant ControlPlane as KubeReconciler
    participant KubeClient
    participant CPNamespace as CP Namespace
    participant SessionNS as Session Namespace
    participant Runner as Runner Pod

    ControlPlane->>KubeClient: ensureOpenShellPolicy(sessionNS)
    KubeClient->>CPNamespace: GetConfigMap("openshell-policy")
    CPNamespace-->>KubeClient: policy.rego + policy.yaml data
    KubeClient->>SessionNS: CreateConfigMap("openshell-policy")
    ControlPlane->>ControlPlane: buildRunnerSecurityContext(openShellEnabled=true)<br/>→ NET_ADMIN caps, allowPrivilegeEscalation
    ControlPlane->>ControlPlane: seccompProfile: Unconfined (pod level)
    ControlPlane->>ControlPlane: append openshell-policy volume → mount /etc/openshell
    ControlPlane->>ControlPlane: buildCredentialSidecars(openShellEnabled=true)<br/>→ AMBIENT_MCP_URL uses $(POD_IP)
    ControlPlane->>SessionNS: create runner Pod

    Runner->>Runner: openshell-claude-wrapper.sh<br/>(OPENSHELL_ENABLED=true)
    Runner->>Runner: cleanup stale netns, disable rp_filter
    Runner->>Runner: exec /openshell-sandbox --rules /etc/openshell/policy.rego<br/>--data /etc/openshell/policy.yaml -- /usr/local/bin/claude
    Runner->>Runner: OPA evaluates allow_network / allow_request per connection
Loading

Possibly related PRs

  • ambient-code/platform#1599: Introduced the credential sidecar isolation model with CREDENTIAL_MCP_URLS; this PR extends that path by switching MCP sidecar URLs between localhost and $(POD_IP) based on OpenShellEnabled.

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
Security And Secret Handling ❌ Error Kubernetes Secrets created in project_reconciler.go, kube_reconciler.go, and keypair/bootstrap.go lack OwnerReferences, violating the blocking check requirement #6. Add ownerReferences field to Secret metadata with controller=true and blockOwnerDeletion=true to enable automatic cleanup.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety ⚠️ Warning Child resources (Pods, Secrets, ConfigMaps) created in kube_reconciler.go lack OwnerReferences, violating K8s cleanup best practices. Additionally, deploy-openshell.sh reference script grants pods/... Add ownerReferences to Pod, Secret, and ConfigMap metadata in kube_reconciler.go; remove pods/exec from deploy-openshell.sh RBAC unless explicitly required for operations.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format with feat type, runner scope, and clearly describes the main OpenShell sandbox integration with POD_IP injection and RBAC changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity ✅ Passed No meaningful performance regressions identified. Loop complexity remains bounded: wrapper script iterates stale namespaces (~single digits per pod), fixture 2-iteration rp_filter setup. K8s List o...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch spec/runner-openshell-desired-state
✨ Simplify code
  • Create PR with simplified code

Comment @coderabbitai help to get the list of available commands and usage tips.

@markturansky markturansky marked this pull request as draft June 17, 2026 18:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`:
- Around line 851-858: The POD_IP environment variable injection using
envVarFromFieldRef is currently nested inside the useMCPSidecar condition block,
but it needs to be injected whenever OpenShellEnabled is true, regardless of
whether the MCP sidecar is enabled. Move the POD_IP injection outside of the
useMCPSidecar condition and gate it directly on r.cfg.OpenShellEnabled so that
credential sidecar URLs can properly resolve the $(POD_IP) placeholder in all
cases where OpenShell is enabled.
- Around line 777-783: The GetConfigMap call at line 777 on the target namespace
treats any error as "not present" and continues to copy from the source. This
masks non-NotFound errors like RBAC or transport issues. Instead of checking err
== nil, explicitly check if the error is a NotFound error using the appropriate
error type checking (such as k8s apierrors.IsNotFound). Only proceed to copy
from the source if the ConfigMap is truly not found; for any other error from
r.nsKube().GetConfigMap on the target namespace, propagate the error up
immediately rather than silently continuing.
- Around line 310-312: The ensureImageBuildAccess method call currently logs a
warning when it fails but continues execution, which can leave the namespace in
a partially provisioned state. Instead of just logging the error, you need to
return the error from this code path so that namespace provisioning fails
completely rather than continuing and potentially failing later at build time.
Remove the warning-only logging pattern around the ensureImageBuildAccess call
and either directly return the error or aggregate it with other errors as needed
to ensure the failure is properly propagated up the call chain.

In `@components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh`:
- Around line 144-146: The ClusterRole definition in the deploy-openshell.sh
script includes pods/exec in the resources array with dangerous verbs like
create, delete, and patch, which grants arbitrary command execution in any pod.
Either remove pods/exec from the resources list in the ClusterRole to restrict
this privilege, or if it is truly required for OpenShell functionality, add
explicit documentation above the ClusterRole definition (in comments within the
script) explaining the security rationale and why this elevated privilege is
necessary for production use.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d1dd49b0-b257-4fd1-947d-f87a99b37c76

📥 Commits

Reviewing files that changed from the base of the PR and between edaaa9a and bc462c8.

📒 Files selected for processing (24)
  • .mcp.json
  • BOOKMARKS.md
  • components/ambient-cli/cmd/acpctl/ambient/tui/views/messages.go
  • components/ambient-control-plane/cmd/ambient-control-plane/main.go
  • components/ambient-control-plane/internal/config/config.go
  • components/ambient-control-plane/internal/kubeclient/kubeclient.go
  • components/ambient-control-plane/internal/reconciler/kube_reconciler.go
  • components/ambient-control-plane/internal/reconciler/kube_reconciler_test.go
  • components/ambient-control-plane/internal/reconciler/project_reconciler.go
  • components/ambient-control-plane/internal/reconciler/project_reconciler_test.go
  • components/manifests/base/rbac/control-plane-clusterrole.yaml
  • components/pr-test/install-standard.sh
  • components/runners/ambient-runner/.openshell-ref/PLAN.md
  • components/runners/ambient-runner/.openshell-ref/build-openshell.sh
  • components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh
  • components/runners/ambient-runner/.openshell-ref/policy.rego
  • components/runners/ambient-runner/.openshell-ref/policy.yaml
  • components/runners/ambient-runner/Dockerfile
  • components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
  • components/runners/ambient-runner/openshell-claude-wrapper.sh
  • docs/internal/agents/openshell-runner-adaptation.md
  • docs/internal/agents/openshell-security-analysis.md
  • specs/agents/runner.spec.md
  • specs/security/openshell-sandbox.spec.md

Comment on lines +310 to +312
if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not continue after failing to grant image build RBAC.

Line 310 logs the error and continues. That leaves the namespace partially provisioned and can fail later at build time while session provisioning still reports success. Return an error from this path (or aggregate it explicitly) instead of warning-only continuation.

Proposed fix
 		if err := r.ensureImagePullAccess(ctx, namespace); err != nil {
 			r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image pull access")
 		}
-		if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
-			r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access")
-		}
+		if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
+			return fmt.Errorf("granting image build access in %s: %w", namespace, err)
+		}
 	}

As per coding guidelines, "Never silently swallow partial failures; every error path must propagate or be explicitly collected, never discarded."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
r.logger.Warn().Err(err).Str("namespace", namespace).Msg("failed to grant image build access")
}
if err := r.ensureImageBuildAccess(ctx, namespace); err != nil {
return fmt.Errorf("granting image build access in %s: %w", namespace, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 310 - 312, The ensureImageBuildAccess method call currently logs a
warning when it fails but continues execution, which can leave the namespace in
a partially provisioned state. Instead of just logging the error, you need to
return the error from this code path so that namespace provisioning fails
completely rather than continuing and potentially failing later at build time.
Remove the warning-only logging pattern around the ensureImageBuildAccess call
and either directly return the error or aggregate it with other errors as needed
to ensure the failure is properly propagated up the call chain.

Source: Coding guidelines

Comment on lines +777 to +783
if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil {
return nil
}

src, err := r.nsKube().GetConfigMap(ctx, r.cfg.CPRuntimeNamespace, policyName)
if err != nil {
return fmt.Errorf("reading openshell policy configmap %s/%s: %w", r.cfg.CPRuntimeNamespace, policyName, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate non-NotFound errors when checking target policy ConfigMap.

Line 777 treats any GetConfigMap error as “not present” and proceeds to copy from source. This masks root-cause failures (e.g., RBAC/transport errors) and can lead to confusing follow-on behavior.

Proposed fix
-	if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil {
-		return nil
-	}
+	if _, err := r.nsKube().GetConfigMap(ctx, namespace, policyName); err == nil {
+		return nil
+	} else if !k8serrors.IsNotFound(err) {
+		return fmt.Errorf("checking openshell policy configmap %s/%s: %w", namespace, policyName, err)
+	}

As per coding guidelines, "Never silently swallow partial failures; every error path must propagate or be explicitly collected, never discarded."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 777 - 783, The GetConfigMap call at line 777 on the target
namespace treats any error as "not present" and continues to copy from the
source. This masks non-NotFound errors like RBAC or transport issues. Instead of
checking err == nil, explicitly check if the error is a NotFound error using the
appropriate error type checking (such as k8s apierrors.IsNotFound). Only proceed
to copy from the source if the ConfigMap is truly not found; for any other error
from r.nsKube().GetConfigMap on the target namespace, propagate the error up
immediately rather than silently continuing.

Source: Coding guidelines

Comment on lines 851 to 858
if useMCPSidecar {
env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
if r.cfg.OpenShellEnabled {
env = append(env, envVarFromFieldRef("POD_IP", "status.podIP"))
env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort)))
} else {
env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Inject POD_IP whenever OpenShell is enabled, not only when MCP sidecar is enabled.

Line 851 currently gates POD_IP injection on useMCPSidecar, but Line 1225 builds credential sidecar URLs with $(POD_IP) whenever OpenShell is enabled. If MCP sidecar is disabled and credential sidecars are enabled, those URLs won’t resolve correctly.

Proposed fix
-	if useMCPSidecar {
-		if r.cfg.OpenShellEnabled {
-			env = append(env, envVarFromFieldRef("POD_IP", "status.podIP"))
-			env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort)))
-		} else {
-			env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
-		}
-	}
+	if r.cfg.OpenShellEnabled {
+		env = append(env, envVarFromFieldRef("POD_IP", "status.podIP"))
+	}
+	if useMCPSidecar {
+		if r.cfg.OpenShellEnabled {
+			env = append(env, envVar("AMBIENT_MCP_URL", fmt.Sprintf("http://$(POD_IP):%d", mcpSidecarPort)))
+		} else {
+			env = append(env, envVar("AMBIENT_MCP_URL", mcpSidecarURL))
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/ambient-control-plane/internal/reconciler/kube_reconciler.go`
around lines 851 - 858, The POD_IP environment variable injection using
envVarFromFieldRef is currently nested inside the useMCPSidecar condition block,
but it needs to be injected whenever OpenShellEnabled is true, regardless of
whether the MCP sidecar is enabled. Move the POD_IP injection outside of the
useMCPSidecar condition and gate it directly on r.cfg.OpenShellEnabled so that
credential sidecar URLs can properly resolve the $(POD_IP) placeholder in all
cases where OpenShell is enabled.

Comment on lines +144 to +146
- apiGroups: [""]
resources: ["secrets", "serviceaccounts", "services", "pods", "pods/log", "pods/exec", "configmaps"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete", "deletecollection"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

pods/exec grants arbitrary command execution in any pod.

This ClusterRole includes pods/exec which is not present in the base control-plane-clusterrole.yaml. While this may be intentional for OpenShell debugging/admin scenarios, it's a significant privilege escalation that allows executing arbitrary commands inside any pod in namespaces the control-plane manages.

If this is required for OpenShell functionality, consider documenting the security rationale. If it's for debugging only, consider removing it from the production reference script.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/runners/ambient-runner/.openshell-ref/deploy-openshell.sh` around
lines 144 - 146, The ClusterRole definition in the deploy-openshell.sh script
includes pods/exec in the resources array with dangerous verbs like create,
delete, and patch, which grants arbitrary command execution in any pod. Either
remove pods/exec from the resources list in the ClusterRole to restrict this
privilege, or if it is truly required for OpenShell functionality, add explicit
documentation above the ClusterRole definition (in comments within the script)
explaining the security rationale and why this elevated privilege is necessary
for production use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant