feat: add dataSource PVC mount and per-proposal timeoutMinutes by sakshipatels98-byte · Pull Request #17 · openshift/lightspeed-agentic-operator

sakshipatels98-byte · 2026-05-28T16:22:57Z

Summary

Enable IntelliAide-style deep RCA by letting Proposals reference pre-loaded
diagnostic data and run long analysis steps in the sandbox.

PVC data source (spec.dataSource) — optional, immutable reference to a
pre-existing PVC in the Proposal namespace. The operator mounts it read-only
at /data/input in the sandbox pod (via sandbox template patching).
Configurable sandbox timeout (spec.timeoutMinutes) — optional 1–60
minutes per step (default 5). Used for analysis, execution, verification,
and escalation so long-running RCA can override the default.
Analysis prompt updates — analysis_query.tmpl requires skill discovery
from /app/skills before generic analysis; when dataSource is set, tells
the agent diagnostic data is available at /data/input.
RBAC — regenerate ClusterRole: PVC get/list/watch; result CR
*/status sub-resources.
Examples — examples/setup/09-intelliaide-proposals.yaml with
dataSource, timeoutMinutes: 30, and analysis-only IntelliAide workflows.

Companion PR: agentic-skills adds the intelliaide/ skill pipeline that
reads from /data/input.

API Changes

Field	Type	Validation	Notes
`spec.dataSource.claimName`	string	required, DNS subdomain, immutable	PVC name in same namespace
`spec.timeoutMinutes`	int32	min=1, max=60, mutable	Per-step sandbox timeout

Test Plan

make test passes (unit tests updated for new signatures)
make manifests produces clean CRD with both new fields
Deploy on dev cluster: create PVC + Proposal with dataSource → verify volume mounted at /data/input
Verify timeout override: set timeoutMinutes: 30 → operator waits 30m
IntelliAide routing: submit RCA request → analysis agent reads SKILL.md and invokes pipeline

Summary by CodeRabbit

New Features
- Added per-step timeout configuration (1-60 minutes) for proposal workflow steps.
- Introduced DataSource support to reference pre-existing PVCs, enabling input data mounting at /data/input in sandbox pods.
- Enhanced tools configuration to include optional dataSource PVC references.
Documentation
- Added must-gather mode example demonstrating RCA analysis with mounted input data bundles.

openshift-ci · 2026-05-28T16:23:02Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-28T16:23:06Z

📝 Walkthrough

Walkthrough

This PR introduces per-step sandbox timeout support and optional PVC mounting for input data across the proposal workflow. It extends the API types with ProposalStep.timeoutMinutes and a new DataSource type, updates the AgentCaller interface to accept timeouts, propagates timeouts through proposal resolution and workflow handlers, implements timeout handling in the HTTP client and sandbox agent, and adds template patching to mount DataSource PVCs into sandbox containers. New tests validate timeout propagation and DataSource mounting behavior.

Changes

Per-Step Timeout and DataSource Implementation

Layer / File(s)	Summary
API Types - Timeout and DataSource Contracts `api/v1alpha1/proposal_types.go`, `api/v1alpha1/tools_types.go`	`ProposalStep.timeoutMinutes` (optional, 1-60 min), `DataSource` type with validated `claimName`, and `ToolsSpec.DataSource` field added with updated `IsZero()` logic. Documentation describes per-step and default PVC mounting behavior.
AgentCaller Interface - Timeout Parameter `controller/proposal/agent.go`	`AgentCaller` interface methods (`Analyze`, `Execute`, `Verify`, `Escalate`) and `StubAgentCaller` implementations extended with trailing `timeout time.Duration` parameter.
HTTP Client - Timeout Serialization `controller/proposal/client.go`, `controller/proposal/client_test.go`	`NewAgentHTTPClient` accepts explicit timeout, defaults non-positive to `defaultSandboxTimeout`, and sets `http.Client.Timeout`. `Run` serializes timeout to `TimeoutMs` in request. Tests updated and new timeout serialization tests added.
Proposal Resolver - Timeout Propagation `controller/proposal/resolve.go`	`resolvedStep` extended with `TimeoutMinutes` field; `resolveProposal` now assigns timeouts from `proposal.Spec.*.TimeoutMinutes` to resolved steps.
SandboxAgentCaller - Timeout Implementation `controller/proposal/sandbox_agent.go`	`ClientFactory` signature extended to accept timeout. `Analyze`, `Execute`, `Verify`, `Escalate` methods accept and forward timeout to `callWithSandbox`. `callWithSandbox` defaults timeout, applies to HTTP client creation, and keeps `WaitReady` capped at `defaultSandboxTimeout`. New `stepTimeout()` helper extracts per-step timeout with fallback.
Workflow Handlers - Apply Per-Step Timeouts `controller/proposal/handlers.go`	`handleAnalysis`, `handleRevision`, `handleExecution`, `handleVerification`, and `handleEscalation` compute `stepTimeout()` and pass to respective agent invocations.
SandboxAgentCaller Tests - Timeout Behavior `controller/proposal/sandbox_agent_test.go`	`mockSandboxProvider` tracks `lastWaitReadyTimeout`. Test helpers updated for timeout-aware `ClientFactory`. All method calls updated to pass `defaultSandboxTimeout`. New `TestSandboxAgentCaller_TimeoutPropagation` verifies timeout flow: `WaitReady` receives `defaultSandboxTimeout`, `ClientFactory` receives caller-supplied timeout.
Reconciler Tests - Timeout Integration `controller/proposal/reconciler_test.go`, `controller/proposal/reconciler.go`	`testAgentCaller` tracks per-method timeouts; `mockSandboxProvider.ClientFactory` updated to accept timeout. RBAC annotation added for `persistentvolumeclaims` read permissions. New `TestReconcile_PropagatesStepTimeout` validates minutes-to-duration conversion and forwarding to `Analyze`.
Sandbox Templates - DataSource Mounting `controller/proposal/sandbox_templates.go`	Constants added for `dataSourceMountPath` and `dataSourceVolumeName`. `templateHashInput` extended with `DataSource` field. `computeTemplateHash` accepts and incorporates `dataSource` into hash. `EnsureAgentTemplate` extended with `dataSource` parameter, extracts from `tools.DataSource`, and conditionally patches template with PVC volume and read-only mount via new `addPVCVolume` and `patchDataSource` helpers.
Sandbox Template Tests - DataSource Validation `controller/proposal/sandbox_templates_test.go`	Hash tests updated to new `computeTemplateHash` signature. New `TestComputeTemplateHash_DataSource` asserts hash changes for nil vs non-nil `DataSource` and different `ClaimName` values. New `TestPatchDataSource_MountsReadOnly` verifies PVC volume upsert and read-only mount at expected path.
Examples and Documentation `examples/setup/09-intelliaide-proposals.yaml`, `.ai/spec/how/reconciler.md`	Five IntelliAide RCA Proposal examples added (etcd, update-failure, router-latency, general degraded, must-gather-mode) with `analysis.agent: smart` and `timeoutMinutes: 30`. Must-gather proposal includes `tools.dataSource.claimName` reference. Reconciler.md updated to document `dataSource` extraction and optional PVC mount.
Repository Configuration `.gitignore`	Added `/oc-agentic` ignore pattern at repository root.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.53% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the two main features added: dataSource PVC mount and per-proposal timeoutMinutes support, matching the primary changes across the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

controller/proposal/sandbox_agent.go (1)

177-210: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use one shared deadline for the whole step.

timeout is spent once in WaitReady and then a second time in the HTTP client, so timeoutMinutes: 5 can block for roughly 10 minutes before failing. If this field is meant to cap a step end-to-end, the later phases need to use the remaining budget, not the original budget again.

⏱️ Possible fix

 func (s *SandboxAgentCaller) callWithSandbox(
 	ctx context.Context,
 	proposal *agenticv1alpha1.Proposal,
 	stepName string,
@@
 ) (json.RawMessage, error) {
 	if timeout <= 0 {
 		timeout = defaultSandboxTimeout
 	}
+	deadline := time.Now().Add(timeout)
 
 	templateName, err := EnsureAgentTemplate(ctx, s.K8sClient, defaultBaseTemplateName, s.Namespace, stepName, step.Agent, step.LLM, step.Tools, proposal.Spec.DataSource)
 	if err != nil {
@@
-	endpoint, err := s.Sandbox.WaitReady(ctx, claimName, timeout)
+	endpoint, err := s.Sandbox.WaitReady(ctx, claimName, time.Until(deadline))
 	if err != nil {
 		return nil, fmt.Errorf("wait for sandbox: %w", err)
 	}
@@
-	client := s.ClientFactory(agentURL, timeout)
+	remaining := time.Until(deadline)
+	if remaining <= 0 {
+		return nil, fmt.Errorf("step timeout exceeded before agent call")
+	}
+	client := s.ClientFactory(agentURL, remaining)
 	resp, err := client.Run(ctx, "", query, schema, agentCtx)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_agent.go` around lines 177 - 210, The step
currently reuses the full original timeout for WaitReady and then again for the
HTTP client, doubling the effective wait; fix by creating a single
deadline-based context and deriving the remaining budget for the client: wrap
the incoming ctx with context.WithTimeout/context.WithDeadline using the
provided timeout (store the returned cancel and deadline), pass that ctx to
s.Sandbox.WaitReady, then compute remaining := time.Until(deadline) (clamp to a
minimum and fallback if non-positive) and pass remaining to
s.ClientFactory(agentURL, remaining) (and ensure to call cancel()). Reference
EnsureAgentTemplate, s.Sandbox.Claim, s.Sandbox.WaitReady,
s.ClientFactory(...).Run and the timeout variable when applying this change.

🧹 Nitpick comments (4)

controller/proposal/reconciler_test.go (1)
39-64: ⚡ Quick win

Capture the timeout passed by the reconciler.

testAgentCaller drops the new argument everywhere, so nothing in this file proves that proposalTimeout(proposal) is actually passed into Analyze/Execute/Verify/Escalate. A small field like lastAnalyzeTimeout plus one case with Spec.TimeoutMinutes = 30 would cover the new handler plumbing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/reconciler_test.go` around lines 39 - 64, The
testAgentCaller currently ignores the timeout parameter so tests don't verify
that proposalTimeout(proposal) is propagated; add fields on testAgentCaller
(e.g., lastAnalyzeTimeout, lastExecuteTimeout, lastVerifyTimeout,
lastEscalateTimeout) and assign the incoming timeout argument in the respective
methods Analyze, Execute, Verify, Escalate; then add/update a test that sets
Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler path, and asserts
each last*Timeout equals the expected duration returned by
proposalTimeout(proposal) to prove the plumbing forwards the timeout.
controller/proposal/sandbox_agent_test.go (1)
57-82: ⚡ Quick win

Record and assert the propagated timeout in these helpers.

The new time.Duration parameter is ignored here, so the tests only prove the signatures compile. Capturing the value passed into ClientFactory (and similarly into WaitReady) would let this suite verify that non-default proposal timeouts actually reach the sandbox layer.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_agent_test.go` around lines 57 - 82, Update
newTestSandboxAgentCaller and newTestSandboxAgentCallerWithProposal so the
ClientFactory closure captures and records the time.Duration it receives and the
mockSandboxProvider's WaitReady call records its timeout; specifically, modify
the ClientFactory func(_ string, d time.Duration) to store d onto the provided
mockHTTPClient (e.g., a LastTimeout field) before returning it, and ensure the
mockSandboxProvider exposes a WaitReady(timeout time.Duration) recorder that
tests can assert against; this will let tests verify that non-default proposal
timeouts are propagated to both ClientFactory and WaitReady.
controller/proposal/client_test.go (1)
38-172: ⚡ Quick win

Add assertions for the new timeout contract.

These updates only make the tests compile against the new constructor. There’s still no coverage that a custom timeout is serialized into timeout_ms, or that timeout <= 0 falls back to defaultSandboxTimeout, so the new behavior can regress silently.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/client_test.go` around lines 38 - 172, Tests for
NewAgentHTTPClient lack assertions that the serialized request contains the
timeout_ms field and that timeout <= 0 falls back to defaultSandboxTimeout;
update relevant tests that decode agentRunRequest (e.g., inside the handlers in
TestAgentHTTPClient_RunWithExecutionResult,
TestAgentHTTPClient_RunWithoutExecutionResult,
TestAgentHTTPClient_RunWithContext and the basic Run test) to assert
req.TimeoutMs equals the provided custom timeout when
NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs
equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative
timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient
constructor to locate where to add these checks.
controller/proposal/sandbox_templates_test.go (1)
68-70: ⚡ Quick win

Add coverage for non-nil dataSource here.

All updated hash calls still pass nil for dataSource, so this file never asserts that changing claimName changes the derived template name. A focused test for that — plus one for the read-only /data/input PVC mount — would cover the new behavior directly.

Also applies to: 490-524
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_templates_test.go` around lines 68 - 70, Update
tests to exercise a non-nil dataSource when calling computeTemplateHash: modify
mustHash (and the test cases around it) to pass a populated
agenticv1alpha1.DataSource with a distinct ClaimName and verify that changing
ClaimName alters the returned hash; additionally add an assertion that the
computed template includes the expected PVC mount mode (readOnly) for
/data/input by constructing a DataSource whose MountPath is /data/input and
verifying the template metadata or generated mount flags reflect read-only.
Target the helper mustHash and calls to computeTemplateHash so the new
assertions cover the branch where dataSource is non-nil.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/v1alpha1/proposal_types.go`:
- Line 297: The XValidation rule for the dataSource field currently allows
nil->non-nil updates; replace the rule so only two cases are permitted: both old
and new lack dataSource, or both have it and it is equal. Update the kubebuilder
annotation on the dataSource validation (the comment containing XValidation) to
use an expression such as "(has(oldSelf.dataSource) && has(self.dataSource) &&
self.dataSource == oldSelf.dataSource) || (!has(oldSelf.dataSource) &&
!has(self.dataSource))" so adding a dataSource after creation is disallowed.

In `@controller/proposal/sandbox_templates.go`:
- Around line 523-551: The current patchDataSource unconditionally uses the name
"data-source" and addPVCVolume will overwrite any existing volume with that
name; modify addPVCVolume and patchDataSource so they do not clobber an
unrelated pre-existing volume: in addPVCVolume, when finding an existing volume
with the same name check whether it already has a persistentVolumeClaim with the
same claimName and only replace if it matches; otherwise return a sentinel (or
error) to signal a name conflict. In patchDataSource, detect that conflict and
instead pick a non-colliding name (e.g., "data-source-<uniqueSuffix>") and use
that name for both addPVCVolume and addVolumeMount, or surface the error to the
caller; reference the functions addPVCVolume and patchDataSource and ensure the
chosen name is threaded through to addVolumeMount so mounts reference the
correct volume.

In `@controller/proposal/templates/analysis_query.tmpl`:
- Around line 3-27: The template fails to tell the agent that PVC-backed input
may already be mounted at /data/input when spec.dataSource is set, so update the
contract and template: modify buildAnalysisQuery in
controller/proposal/helpers.go to accept a flag/field (e.g., hasDataSource or
includeMountedInput) derived from the Request/spec.dataSource (so callers can
indicate PVC-backed input is present), and change
controller/proposal/templates/analysis_query.tmpl to conditionally include a
short directive like "Input data is mounted read-only at /data/input; prefer
analyzing that bundle before collecting live data" when that flag is true;
ensure the helpers.go change sets the flag when spec.dataSource is present so
PVC-backed RCA is handled as bundle-first analysis.
- Around line 19-25: The skill-loading branch is inconsistent: "cat
/app/skills/intelliaide/SKILL.md" cannot "return one or more paths." Update the
logic around the atomic command (the literal string "cat
/app/skills/intelliaide/SKILL.md" and references to SKILL.md) so it is
internally consistent — either change the command to a listing/find command that
can return paths (e.g., use a glob/find/ls conceptual step) or change the prose
to state that if the cat command fails because the file does not exist you must
stop and return a JSON error, and otherwise read that SKILL.md and follow its
workflow exactly; ensure the branch no longer claims cat can return multiple
paths.

In `@examples/setup/09-intelliaide-proposals.yaml`:
- Around line 21-31: Add one concrete example of the PVC-backed input to the
IntelliAide proposals by updating a Proposal's spec to include a dataSource
block using the exact shape spec.dataSource.claimName and a placeholder PVC name
(e.g., <pre-existing-pvc>); locate any Proposal objects in this file that
currently run in live mode and add a dataSource: claimName: <pre-existing-pvc>
under their spec so the new /data/input workflow and PVC-backed input are
discoverable (ensure the YAML key is spec.dataSource.claimName and that only the
example uses a placeholder to avoid changing runtime behavior).
- Around line 33-36: The docs instruct building/pushing
quay.io/<org>/intelliaide-skills:latest but all Proposal manifests reference
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest,
so examples fail; either (A) update each Proposal image field to
quay.io/<org>/intelliaide-skills:latest (search for the string
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
in the Proposal resources and replace) or (B) add the missing retag/push steps
after the build: retag quay image to the in-cluster registry name and push it
(i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest
image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
and push), and apply this fix for all occurrences noted (the other blocks at the
same pattern: lines ~70-72, 101-103, 147-149, 183-185).

---

Outside diff comments:
In `@controller/proposal/sandbox_agent.go`:
- Around line 177-210: The step currently reuses the full original timeout for
WaitReady and then again for the HTTP client, doubling the effective wait; fix
by creating a single deadline-based context and deriving the remaining budget
for the client: wrap the incoming ctx with
context.WithTimeout/context.WithDeadline using the provided timeout (store the
returned cancel and deadline), pass that ctx to s.Sandbox.WaitReady, then
compute remaining := time.Until(deadline) (clamp to a minimum and fallback if
non-positive) and pass remaining to s.ClientFactory(agentURL, remaining) (and
ensure to call cancel()). Reference EnsureAgentTemplate, s.Sandbox.Claim,
s.Sandbox.WaitReady, s.ClientFactory(...).Run and the timeout variable when
applying this change.

---

Nitpick comments:
In `@controller/proposal/client_test.go`:
- Around line 38-172: Tests for NewAgentHTTPClient lack assertions that the
serialized request contains the timeout_ms field and that timeout <= 0 falls
back to defaultSandboxTimeout; update relevant tests that decode agentRunRequest
(e.g., inside the handlers in TestAgentHTTPClient_RunWithExecutionResult,
TestAgentHTTPClient_RunWithoutExecutionResult,
TestAgentHTTPClient_RunWithContext and the basic Run test) to assert
req.TimeoutMs equals the provided custom timeout when
NewAgentHTTPClient(server.URL, X) is called with X>0, and assert req.TimeoutMs
equals defaultSandboxTimeout when NewAgentHTTPClient(..., 0) or a negative
timeout is used; reference the agentRunRequest struct and NewAgentHTTPClient
constructor to locate where to add these checks.

In `@controller/proposal/reconciler_test.go`:
- Around line 39-64: The testAgentCaller currently ignores the timeout parameter
so tests don't verify that proposalTimeout(proposal) is propagated; add fields
on testAgentCaller (e.g., lastAnalyzeTimeout, lastExecuteTimeout,
lastVerifyTimeout, lastEscalateTimeout) and assign the incoming timeout argument
in the respective methods Analyze, Execute, Verify, Escalate; then add/update a
test that sets Spec.TimeoutMinutes = 30 on a Proposal, calls the reconciler
path, and asserts each last*Timeout equals the expected duration returned by
proposalTimeout(proposal) to prove the plumbing forwards the timeout.

In `@controller/proposal/sandbox_agent_test.go`:
- Around line 57-82: Update newTestSandboxAgentCaller and
newTestSandboxAgentCallerWithProposal so the ClientFactory closure captures and
records the time.Duration it receives and the mockSandboxProvider's WaitReady
call records its timeout; specifically, modify the ClientFactory func(_ string,
d time.Duration) to store d onto the provided mockHTTPClient (e.g., a
LastTimeout field) before returning it, and ensure the mockSandboxProvider
exposes a WaitReady(timeout time.Duration) recorder that tests can assert
against; this will let tests verify that non-default proposal timeouts are
propagated to both ClientFactory and WaitReady.

In `@controller/proposal/sandbox_templates_test.go`:
- Around line 68-70: Update tests to exercise a non-nil dataSource when calling
computeTemplateHash: modify mustHash (and the test cases around it) to pass a
populated agenticv1alpha1.DataSource with a distinct ClaimName and verify that
changing ClaimName alters the returned hash; additionally add an assertion that
the computed template includes the expected PVC mount mode (readOnly) for
/data/input by constructing a DataSource whose MountPath is /data/input and
verifying the template metadata or generated mount flags reflect read-only.
Target the helper mustHash and calls to computeTemplateHash so the new
assertions cover the branch where dataSource is non-nil.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4659329-5901-4895-8a4c-9eb649291e97

📥 Commits

Reviewing files that changed from the base of the PR and between 44bd61b and 31261be.

⛔ Files ignored due to path filters (3)

api/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated.deepcopy.go
config/crd/bases/agentic.openshift.io_proposals.yaml is excluded by !config/crd/bases/**
config/rbac/role.yaml is excluded by !config/rbac/role.yaml

📒 Files selected for processing (14)

.gitignore
api/v1alpha1/proposal_types.go
controller/proposal/agent.go
controller/proposal/client.go
controller/proposal/client_test.go
controller/proposal/handlers.go
controller/proposal/reconciler.go
controller/proposal/reconciler_test.go
controller/proposal/sandbox_agent.go
controller/proposal/sandbox_agent_test.go
controller/proposal/sandbox_templates.go
controller/proposal/sandbox_templates_test.go
controller/proposal/templates/analysis_query.tmpl
examples/setup/09-intelliaide-proposals.yaml

coderabbitai · 2026-05-29T09:34:23Z

 // +kubebuilder:validation:XValidation:rule="!has(oldSelf.analysis) || (has(self.analysis) && self.analysis == oldSelf.analysis)",message="analysis is immutable once set"
 // +kubebuilder:validation:XValidation:rule="!has(oldSelf.execution) || (has(self.execution) && self.execution == oldSelf.execution)",message="execution is immutable once set"
 // +kubebuilder:validation:XValidation:rule="!has(oldSelf.verification) || (has(self.verification) && self.verification == oldSelf.verification)",message="verification is immutable once set"
+// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disallow adding dataSource after creation.

This rule still permits nil -> non-nil updates because the first branch is true whenever oldSelf.dataSource is unset. That breaks the “fixed at creation” contract and allows an in-flight Proposal to change which PVC gets mounted.

🛠️ Possible fix

-// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set" +// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) || (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"

// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/v1alpha1/proposal_types.go` at line 297, The XValidation rule for the dataSource field currently allows nil->non-nil updates; replace the rule so only two cases are permitted: both old and new lack dataSource, or both have it and it is equal. Update the kubebuilder annotation on the dataSource validation (the comment containing XValidation) to use an expression such as "(has(oldSelf.dataSource) && has(self.dataSource) && self.dataSource == oldSelf.dataSource) || (!has(oldSelf.dataSource) && !has(self.dataSource))" so adding a dataSource after creation is disallowed.

coderabbitai · 2026-05-29T09:34:23Z

+## Skills
+
+A specialist deep-RCA pipeline is available at `/app/skills/intelliaide/SKILL.md`.
+
+Use it ONLY when the request calls for:
+- Root cause analysis (RCA) or deeper / ML-assisted troubleshooting of a cluster issue
+- Must-gather collection or analysis
+- Investigating pod failures, etcd degradation, networking problems, storage issues, etc.
+- Any request that explicitly mentions "deeper analysis", "deeper troubleshooting", "root cause", "RCA", "must-gather", or "IntelliAide"
+
+For routine inspection (checking pod/node status, listing events, summarising resource state,
+describing objects), use `kubectl`/`oc` commands directly — do NOT invoke the IntelliAide pipeline.
+
+**Decision rule — apply before doing anything else:**
+1. Read the `## Request` section below.
+2. If it is a routine inspection query → proceed with `kubectl`/`oc` directly.
+3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
+   ```
+   cat /app/skills/intelliaide/SKILL.md
+   ```
+   If the command returns one or more paths, read the most relevant SKILL.md with `cat`
+   and follow its workflow **exactly** instead of the instructions below.
+   If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.
+
+## Analysis requirements


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Surface /data/input in the prompt when spec.dataSource is set.

This template never tells the agent that PVC-backed input may already be mounted at /data/input. The operator now mounts spec.dataSource there, but buildAnalysisQuery(...) only passes Request, HasExecution, and HasVerification, so PVC-backed RCA will still look like live-mode analysis and may ignore the supplied bundle entirely.

This needs a small contract change in controller/proposal/helpers.go plus a conditional instruction here, e.g. “input data is mounted read-only at /data/input; prefer analyzing that before collecting new live data.”

Also applies to: 27-42

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/templates/analysis_query.tmpl` around lines 3 - 27, The template fails to tell the agent that PVC-backed input may already be mounted at /data/input when spec.dataSource is set, so update the contract and template: modify buildAnalysisQuery in controller/proposal/helpers.go to accept a flag/field (e.g., hasDataSource or includeMountedInput) derived from the Request/spec.dataSource (so callers can indicate PVC-backed input is present), and change controller/proposal/templates/analysis_query.tmpl to conditionally include a short directive like "Input data is mounted read-only at /data/input; prefer analyzing that bundle before collecting live data" when that flag is true; ensure the helpers.go change sets the flag when spec.dataSource is present so PVC-backed RCA is handled as bundle-first analysis.

coderabbitai · 2026-05-29T09:34:23Z

+3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:
+   ```
+   cat /app/skills/intelliaide/SKILL.md
+   ```
+   If the command returns one or more paths, read the most relevant SKILL.md with `cat`
+   and follow its workflow **exactly** instead of the instructions below.
+   If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the skill-loading branch internally consistent.

Line 21 uses cat /app/skills/intelliaide/SKILL.md, but Lines 23-25 talk about that command “returning one or more paths.” cat returns file contents or an error, so this branch is impossible as written and can derail the RCA routing logic.

Suggested prompt fix

3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command: ``` cat /app/skills/intelliaide/SKILL.md ``` - If the command returns one or more paths, read the most relevant SKILL.md with `cat` - and follow its workflow **exactly** instead of the instructions below. - If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration. + If that command fails because the file does not exist, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration. + Follow its workflow **exactly** instead of the instructions below.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:

```

cat /app/skills/intelliaide/SKILL.md

```

If the command returns one or more paths, read the most relevant SKILL.md with `cat`

and follow its workflow **exactly** instead of the instructions below.

If no SKILL.md files are found, stop immediately and return a JSON error response — skills are required and their absence is a fatal misconfiguration.

3. If it is a deep-RCA or troubleshooting request → read the skill file with ONE atomic command:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/templates/analysis_query.tmpl` around lines 19 - 25, The skill-loading branch is inconsistent: "cat /app/skills/intelliaide/SKILL.md" cannot "return one or more paths." Update the logic around the atomic command (the literal string "cat /app/skills/intelliaide/SKILL.md" and references to SKILL.md) so it is internally consistent — either change the command to a listing/find command that can return paths (e.g., use a glob/find/ls conceptual step) or change the prose to state that if the cat command fails because the file does not exist you must stop and return a JSON error, and otherwise read that SKILL.md and follow its workflow exactly; ensure the branch no longer claims cat can return multiple paths.

coderabbitai · 2026-05-29T09:34:23Z

+# IntelliAide triggering:
+#   The analysis prompt (analysis_query.tmpl) reads the request text and invokes
+#   the IntelliAide pipeline when the request contains RCA keywords such as
+#   "root cause analysis", "RCA", "deeper analysis", "must-gather", or "IntelliAide".
+#   All requests in this file are phrased to trigger IntelliAide automatically.
+#
+# Modes:
+#   Live mode (default) — IntelliAide reads live cluster state via kubectl/oc.
+#   Must-gather mode    — set RUN_MUST_GATHER=1 env var, or phrase the request
+#                         with "using must-gather" / "collect must-gather".
+#


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add one concrete dataSource example to this manifest.

This PR introduces PVC-backed input, but every Proposal here is live-mode only and none show the required spec.dataSource.claimName shape. That makes the new /data/input workflow hard to discover from the only example file shipped with the feature.

A single example with:

dataSource: claimName: <pre-existing-pvc>

would make the new API immediately usable.

Also applies to: 46-188

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/setup/09-intelliaide-proposals.yaml` around lines 21 - 31, Add one concrete example of the PVC-backed input to the IntelliAide proposals by updating a Proposal's spec to include a dataSource block using the exact shape spec.dataSource.claimName and a placeholder PVC name (e.g., <pre-existing-pvc>); locate any Proposal objects in this file that currently run in live mode and add a dataSource: claimName: <pre-existing-pvc> under their spec so the new /data/input workflow and PVC-backed input are discoverable (ensure the YAML key is spec.dataSource.claimName and that only the example uses a placeholder to avoid changing runtime behavior).

coderabbitai · 2026-05-29T09:34:23Z

+#   1. Build and push skills image:
+#        podman build -f Dockerfile.skills \
+#          -t quay.io/<org>/intelliaide-skills:latest .
+#        podman push quay.io/<org>/intelliaide-skills:latest


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the documented image build/push steps with the image each Proposal references.

Lines 33-36 tell users to publish quay.io/<org>/intelliaide-skills:latest, but every Proposal points at image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest. Following this file verbatim leaves the referenced image unavailable, so the examples do not work out-of-the-box.

Please either update the Proposal manifests to use the documented image, or add the missing retag/push step for the in-cluster registry.

Also applies to: 70-72, 101-103, 147-149, 183-185

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/setup/09-intelliaide-proposals.yaml` around lines 33 - 36, The docs instruct building/pushing quay.io/<org>/intelliaide-skills:latest but all Proposal manifests reference image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest, so examples fail; either (A) update each Proposal image field to quay.io/<org>/intelliaide-skills:latest (search for the string image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest in the Proposal resources and replace) or (B) add the missing retag/push steps after the build: retag quay image to the in-cluster registry name and push it (i.e., docker/podman tag quay.io/<org>/intelliaide-skills:latest image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest and push), and apply this fix for all occurrences noted (the other blocks at the same pattern: lines ~70-72, 101-103, 147-149, 183-185).

harche · 2026-05-29T12:57:17Z

+	// +optional
+	// +kubebuilder:validation:Minimum=1
+	// +kubebuilder:validation:Maximum=60
+	TimeoutMinutes *int32 `json:"timeoutMinutes,omitempty"`


This should be on ProposalStep, not on ProposalSpec. Right now all four steps (analysis, execution, verification, escalation) get the same timeout — but IntelliAide RCA analysis might need 30 minutes while a verification step that runs kubectl get pods needs 2. A single proposal-level knob forces users to over-provision the timeout for every step just to accommodate the slowest one.

Would you consider moving it to ProposalStep? That would also align with how agent and tools are already scoped — per-step, not per-proposal.

harche · 2026-05-29T13:02:05Z

@@ -1,5 +1,31 @@
 You are an analysis agent. Your job is to diagnose the problem, determine the root cause, and propose one or more remediation options. Do NOT attempt to fix, patch, or execute any changes — only analyze and propose.

+## Skills


The operator is skill-agnostic by design — it orchestrates Proposal lifecycles without knowing which skills are mounted. Hardcoding intelliaide references and keyword-routing here breaks that separation, and making skill absence fatal would break any Proposal that doesn't mount skills.

No changes to the analysis template are needed. spec.request already carries the user's intent to the agent — the adapter creating the Proposal can embed whatever context it needs there (e.g., "perform deep RCA on diagnostic data at /data/input"). The agent sees the request, sees the skills in its filesystem, and decides what to use. The operator shouldn't be involved in that decision.

yeah this makes sense - so if we are triggering the proposal manually, this is not needed and if we want it to be automatic we need an adapter. either way this is not necessary.

harche · 2026-05-29T13:05:20Z

+// pre-populated with data before the Proposal is created. The operator
+// mounts it read-only at a well-known path (/data/input) accessible to
+// all skills in the sandbox pod.
+type DataSource struct {


Do we need PVC support here? The sandbox pods are ephemeral — if the agent needs must-gather data, it can collect it during its run (e.g., oc adm must-gather -o /tmp/must-gather) and work with it locally. The skill or spec.request can instruct the agent to do that.

This avoids adding new API surface (DataSource type, immutability rules, PVC RBAC, volume mount patching in sandbox templates) for something the agent can already do with its existing tools and a writable /tmp.

@harche while that can be done, it restricts the ability to bring a must-gather from outside and analyze it, i.e customer cases.

harche · 2026-06-01T11:41:54Z

+	//
+	// Immutable: input data source is fixed at creation.
+	// +optional
+	DataSource *DataSource `json:"dataSource,omitzero"`


If you set add the DataSource here then all the steps (i.e. analyse, execution etc) would need mount that data source, but that may not be necessary in all use cases. Consider moving this to per-step, so data source is attached only to the steps that need it.

Or maybe consider making it part of ToolsSpec above, that way it will be available at the proposal as well as step level.

openshift-ci · 2026-06-03T12:02:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xrajesh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

examples/setup/09-intelliaide-proposals.yaml (1)

37-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the documented image push target with the image referenced in Proposal specs.

Line 37–40 tells users to publish quay.io/<org>/intelliaide-skills:latest, but Lines 73, 104, 149, 184, and 220 reference a different in-cluster image path. Following this file verbatim can leave the referenced Proposal image unavailable.

Proposed minimal fix (make Proposal images match docs)

@@
-      - image: image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
+      - image: quay.io/<org>/intelliaide-skills:latest
@@
-      - image: image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
+      - image: quay.io/<org>/intelliaide-skills:latest
@@
-      - image: image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
+      - image: quay.io/<org>/intelliaide-skills:latest
@@
-      - image: image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
+      - image: quay.io/<org>/intelliaide-skills:latest
@@
-      - image: image-registry.openshift-image-registry.svc:5000/openshift-lightspeed/lightspeed-skills:latest
+      - image: quay.io/<org>/intelliaide-skills:latest

Also applies to: 73-74, 104-105, 149-150, 184-185, 220-220

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/setup/09-intelliaide-proposals.yaml` around lines 37 - 40, The
documented push target quay.io/<org>/intelliaide-skills:latest is inconsistent
with the image path used in the Proposal specs; update either the push command
or the Proposal YAMLs so they match. Specifically, change the podman/push target
in the build instructions (the quay.io/<org>/intelliaide-skills:latest line) to
the exact in-cluster image path referenced by the Proposal entries (the image
string used in the Proposal specs at lines referencing the Proposal images) or
vice versa so all references use the same image name and tag across the file.

🧹 Nitpick comments (1)

controller/proposal/sandbox_templates.go (1)

282-287: ⚡ Quick win

Use constant error identifiers for the newly added wrapped errors.

Several new wraps (for example Line 282 and Line 573) use inline string literals in fmt.Errorf. Please move these messages to const error identifiers and keep %w wrapping.

Proposed pattern

+const (
+    ErrSetLightspeedProvider   = "set LIGHTSPEED_PROVIDER"
+    ErrAddDataSourcePVCVolume  = "add data source PVC volume"
+)
...
- return fmt.Errorf("set LIGHTSPEED_PROVIDER: %w", err)
+ return fmt.Errorf("%s: %w", ErrSetLightspeedProvider, err)
...
- return fmt.Errorf("add data source PVC volume: %w", err)
+ return fmt.Errorf("%s: %w", ErrAddDataSourcePVCVolume, err)

As per coding guidelines, **/*.go: Define errors as constants using const ErrFoo = "…" and wrap them with fmt.Errorf("%s: %w", …) for context.

Also applies to: 301-303, 307-315, 317-319, 323-325, 333-339, 343-349, 573-575

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/proposal/sandbox_templates.go` around lines 282 - 287, Replace
inline error message strings in fmt.Errorf wraps with file-level const error
identifiers and preserve %w wrapping; for each occurrence (e.g., the calls
wrapping errors from setEnvVar in the block using LIGHTSPEED_PROVIDER and
LIGHTSPEED_MODEL and the other listed lines) define a const like
ErrSetLightspeedProvider = "set LIGHTSPEED_PROVIDER" and ErrSetLightspeedModel =
"set LIGHTSPEED_MODEL" (and similar names for the other cases), then change the
wraps to fmt.Errorf("%s: %w", ErrSetLightspeedProvider, err) etc., keeping the
same wrapped original error and use the existing symbols (setEnvVar, tmpl,
model, providerTypeString, llm.Spec.Type) to locate the sites to update.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/v1alpha1/proposal_types.go`:
- Around line 262-272: The step-level immutability check currently compares
entire objects (spec.analysis, spec.execution, spec.verification) so changing
TimeoutMinutes (*TimeoutMinutes) gets rejected; update the immutability
validation to exclude timeoutMinutes from the equality check (i.e., allow
changes where only spec.*.timeoutMinutes hasChanged). Modify the webhook/CEL
rule or validation logic that enforces immutability so it compares individual
fields (or uses hasChanged(...) to whitelist spec.analysis.timeoutMinutes,
spec.execution.timeoutMinutes, spec.verification.timeoutMinutes) instead of
comparing the whole object, referencing the TimeoutMinutes field in
proposal_types.go and the spec.analysis/spec.execution/spec.verification
immutability checks.

In `@controller/proposal/handlers.go`:
- Around line 510-512: The escalation path loses the configured timeout because
the temporary resolvedStep used for override agents omits TimeoutMinutes;
compute the timeout once from stepTimeout(step) (or copy step.TimeoutMinutes)
before building resolvedStep and either set resolvedStep.TimeoutMinutes =
step.TimeoutMinutes or pass the precomputed timeout into r.Agent.Escalate so the
override path uses the same timeout value; update the logic around
stepTimeout(step), resolvedStep, and the r.Agent.Escalate call to preserve the
original timeout.

In `@controller/proposal/sandbox_agent.go`:
- Around line 197-201: Replace the hard-coded defaultSandboxTimeout passed to
s.Sandbox.WaitReady so sandbox readiness respects the step's configured timeout:
compute the timeout for WaitReady from the step's timeout (e.g., the step's
TimeoutMinutes or the remaining time on ctx's deadline), then pass that derived
duration instead of defaultSandboxTimeout to s.Sandbox.WaitReady (use the same
claimName and ctx). Ensure you handle zero/negative remaining time and fall back
to a sensible minimum if needed.

---

Duplicate comments:
In `@examples/setup/09-intelliaide-proposals.yaml`:
- Around line 37-40: The documented push target
quay.io/<org>/intelliaide-skills:latest is inconsistent with the image path used
in the Proposal specs; update either the push command or the Proposal YAMLs so
they match. Specifically, change the podman/push target in the build
instructions (the quay.io/<org>/intelliaide-skills:latest line) to the exact
in-cluster image path referenced by the Proposal entries (the image string used
in the Proposal specs at lines referencing the Proposal images) or vice versa so
all references use the same image name and tag across the file.

---

Nitpick comments:
In `@controller/proposal/sandbox_templates.go`:
- Around line 282-287: Replace inline error message strings in fmt.Errorf wraps
with file-level const error identifiers and preserve %w wrapping; for each
occurrence (e.g., the calls wrapping errors from setEnvVar in the block using
LIGHTSPEED_PROVIDER and LIGHTSPEED_MODEL and the other listed lines) define a
const like ErrSetLightspeedProvider = "set LIGHTSPEED_PROVIDER" and
ErrSetLightspeedModel = "set LIGHTSPEED_MODEL" (and similar names for the other
cases), then change the wraps to fmt.Errorf("%s: %w", ErrSetLightspeedProvider,
err) etc., keeping the same wrapped original error and use the existing symbols
(setEnvVar, tmpl, model, providerTypeString, llm.Spec.Type) to locate the sites
to update.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea29bc9e-57fb-4b1c-869c-c5572b79c808

📥 Commits

Reviewing files that changed from the base of the PR and between 31261be and 89cd2a4.

⛔ Files ignored due to path filters (4)

api/v1alpha1/zz_generated.deepcopy.go is excluded by !**/zz_generated.deepcopy.go
config/crd/bases/agentic.openshift.io_proposals.yaml is excluded by !config/crd/bases/**
config/rbac/role.yaml is excluded by !config/rbac/role.yaml
go.sum is excluded by !**/*.sum, !go.sum

📒 Files selected for processing (16)

.ai/spec/how/reconciler.md
.gitignore
api/v1alpha1/proposal_types.go
api/v1alpha1/tools_types.go
controller/proposal/agent.go
controller/proposal/client.go
controller/proposal/client_test.go
controller/proposal/handlers.go
controller/proposal/reconciler.go
controller/proposal/reconciler_test.go
controller/proposal/resolve.go
controller/proposal/sandbox_agent.go
controller/proposal/sandbox_agent_test.go
controller/proposal/sandbox_templates.go
controller/proposal/sandbox_templates_test.go
examples/setup/09-intelliaide-proposals.yaml

✅ Files skipped from review due to trivial changes (2)

.gitignore
.ai/spec/how/reconciler.md

coderabbitai · 2026-06-03T12:05:35Z

+	// timeoutMinutes sets the timeout for this step's sandbox agent call.
+	// This controls how long the operator waits for the sandbox pod to
+	// become ready and for the agent to complete its work. Increase this
+	// for long-running tools (e.g., IntelliAide RCA takes 10-30 minutes).
+	// Defaults to 5 minutes when omitted.
+	//
+	// Mutable: can be adjusted before approving a step.
+	// +optional
+	// +kubebuilder:validation:Minimum=1
+	// +kubebuilder:validation:Maximum=60
+	TimeoutMinutes *int32 `json:"timeoutMinutes,omitempty"`


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

timeoutMinutes is documented as mutable but blocked by current CEL immutability.

Line 262 introduces a mutable field, but the existing step-level immutability rules on spec.analysis, spec.execution, and spec.verification compare whole objects, so changing only timeoutMinutes will still be rejected on update.

💡 Suggested direction

- // +kubebuilder:validation:XValidation:rule="!has(oldSelf.analysis) || (has(self.analysis) && self.analysis == oldSelf.analysis)",message="analysis is immutable once set" - // +kubebuilder:validation:XValidation:rule="!has(oldSelf.execution) || (has(self.execution) && self.execution == oldSelf.execution)",message="execution is immutable once set" - // +kubebuilder:validation:XValidation:rule="!has(oldSelf.verification) || (has(self.verification) && self.verification == oldSelf.verification)",message="verification is immutable once set" + // Keep agent/tools immutable, but allow timeoutMinutes updates. + // (e.g., compare only agent/tools in these rules or move immutability checks to ProposalStep fields)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/v1alpha1/proposal_types.go` around lines 262 - 272, The step-level immutability check currently compares entire objects (spec.analysis, spec.execution, spec.verification) so changing TimeoutMinutes (*TimeoutMinutes) gets rejected; update the immutability validation to exclude timeoutMinutes from the equality check (i.e., allow changes where only spec.*.timeoutMinutes hasChanged). Modify the webhook/CEL rule or validation logic that enforces immutability so it compares individual fields (or uses hasChanged(...) to whitelist spec.analysis.timeoutMinutes, spec.execution.timeoutMinutes, spec.verification.timeoutMinutes) instead of comparing the whole object, referencing the TimeoutMinutes field in proposal_types.go and the spec.analysis/spec.execution/spec.verification immutability checks.

coderabbitai · 2026-06-03T12:05:35Z

+	timeout := stepTimeout(step)
+	escalationResult, err := r.Agent.Escalate(ctx, proposal, step, escalationText, timeout)
 	if err != nil {


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Escalation override path drops configured timeout.

stepTimeout(step) at Line 510 uses step.TimeoutMinutes, but when an override agent is set the temporary resolvedStep omits TimeoutMinutes, so escalation falls back to default 5 minutes.

💡 Suggested fix

- step = resolvedStep{Agent: &agent, LLM: &llm, Tools: step.Tools} + step = resolvedStep{Agent: &agent, LLM: &llm, Tools: step.Tools, TimeoutMinutes: step.TimeoutMinutes}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/handlers.go` around lines 510 - 512, The escalation path loses the configured timeout because the temporary resolvedStep used for override agents omits TimeoutMinutes; compute the timeout once from stepTimeout(step) (or copy step.TimeoutMinutes) before building resolvedStep and either set resolvedStep.TimeoutMinutes = step.TimeoutMinutes or pass the precomputed timeout into r.Agent.Escalate so the override path uses the same timeout value; update the logic around stepTimeout(step), resolvedStep, and the r.Agent.Escalate call to preserve the original timeout.

coderabbitai · 2026-06-03T12:05:35Z

+	// Pod startup is an infrastructure concern unrelated to the agent's work
+	// budget. Using a fixed ceiling here ensures that the full user-configured
+	// timeout is available for the agent call itself, and avoids the effective
+	// wall-clock time being 2× the configured value.
+	endpoint, err := s.Sandbox.WaitReady(ctx, claimName, defaultSandboxTimeout)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Step timeout is not applied to sandbox readiness.

Line 201 hard-caps WaitReady at 5 minutes, so a proposal with timeoutMinutes: 30 can still fail during pod startup at 5 minutes before the agent call begins.

💡 Suggested fix

- endpoint, err := s.Sandbox.WaitReady(ctx, claimName, defaultSandboxTimeout) + endpoint, err := s.Sandbox.WaitReady(ctx, claimName, timeout)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Pod startup is an infrastructure concern unrelated to the agent's work

// budget. Using a fixed ceiling here ensures that the full user-configured

// timeout is available for the agent call itself, and avoids the effective

// wall-clock time being 2× the configured value.

endpoint, err := s.Sandbox.WaitReady(ctx, claimName, defaultSandboxTimeout)

// Pod startup is an infrastructure concern unrelated to the agent's work

// budget. Using a fixed ceiling here ensures that the full user-configured

// timeout is available for the agent call itself, and avoids the effective

// wall-clock time being 2× the configured value.

endpoint, err := s.Sandbox.WaitReady(ctx, claimName, timeout)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controller/proposal/sandbox_agent.go` around lines 197 - 201, Replace the hard-coded defaultSandboxTimeout passed to s.Sandbox.WaitReady so sandbox readiness respects the step's configured timeout: compute the timeout for WaitReady from the step's timeout (e.g., the step's TimeoutMinutes or the remaining time on ctx's deadline), then pass that derived duration instead of defaultSandboxTimeout to s.Sandbox.WaitReady (use the same claimName and ctx). Ensure you handle zero/negative remaining time and fall back to a sensible minimum if needed.

openshift-ci · 2026-06-03T12:08:18Z

@sakshipatels98-byte: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/api-lint	`89cd2a4`	link	true	`/test api-lint`
ci/prow/generate	`89cd2a4`	link	true	`/test generate`
ci/prow/unit	`89cd2a4`	link	true	`/test unit`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2026

sakshipatels98-byte force-pushed the must-gather-version branch from 2a89d44 to f71f525 Compare May 29, 2026 08:40

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2026

sakshipatels98-byte force-pushed the must-gather-version branch from f71f525 to 31261be Compare May 29, 2026 09:09

sakshipatels98-byte marked this pull request as ready for review May 29, 2026 09:25

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 29, 2026

openshift-ci Bot requested review from harche and joshuawilson May 29, 2026 09:26

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

harche reviewed May 29, 2026

View reviewed changes

harche mentioned this pull request May 29, 2026

intelliaide: add skill metadata, OWNERS, config, and README openshift/agentic-skills#28

Open

8 tasks

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2026

harche reviewed Jun 1, 2026

View reviewed changes

sakshipatels98-byte added 3 commits June 3, 2026 17:20

feat: add per-proposal timeoutMinutes and configurable sandbox timeout

d81e987

feat: add dataSource PVC mount and IntelliAide proposal templates

c718372

fix: per-step timeoutMinutes, tools.dataSource, and review follow-ups

89cd2a4

sakshipatels98-byte force-pushed the must-gather-version branch from 31261be to 89cd2a4 Compare June 3, 2026 11:59

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

	// +kubebuilder:validation:XValidation:rule="!has(oldSelf.dataSource) \|\| (has(self.dataSource) && self.dataSource == oldSelf.dataSource)",message="dataSource is immutable once set"
	// +kubebuilder:validation:XValidation:rule="has(oldSelf.dataSource) ? (has(self.dataSource) && self.dataSource == oldSelf.dataSource) : !has(self.dataSource)",message="dataSource is immutable after creation"

		@@ -1,5 +1,31 @@
		You are an analysis agent. Your job is to diagnose the problem, determine the root cause, and propose one or more remediation options. Do NOT attempt to fix, patch, or execute any changes — only analyze and propose.

		## Skills

Conversation

sakshipatels98-byte commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API Changes

Test Plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

harche May 29, 2026

Choose a reason for hiding this comment

Uh oh!

harche May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Prashanth684 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

harche May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Prashanth684 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

harche Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

harche Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

sakshipatels98-byte commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

harche Jun 1, 2026 •

edited

Loading