Skip to content

Add bounded async runner for long running ops#876

Open
sujeet01 wants to merge 1 commit into
ironcore-dev:mainfrom
opensovereigncloud:osc/enh/async-long-ops
Open

Add bounded async runner for long running ops#876
sujeet01 wants to merge 1 commit into
ironcore-dev:mainfrom
opensovereigncloud:osc/enh/async-long-ops

Conversation

@sujeet01

@sujeet01 sujeet01 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Proposed Changes

This PR moves long-running snapshot and image work (IronCore populate, flatten-on-delete, child-flatten) to a bounded async runner so reconcile workers are not blocked. A shared runner runs these long operations with capped concurrency (--async-max-workers, default 10), deduplicates by key, and requeues reconcilers when async work finishes.

Fixes #822
Ref #837 (comment)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added asynchronous deletion processing for images by deferring child flattening, with a new “FlatteningChildren” state (shown as pending where applicable).
    • Added asynchronous snapshot lifecycle support, including “Populating” and “Flattening/Flattened” states.
    • Introduced --async-max-workers (default: 10) to control async concurrency.
  • Chores

    • Updated Kubernetes/controller-runtime, OpenTelemetry, Prometheus, and other dependencies.
  • Tests

    • Adjusted integration test suite async worker configuration to align with the new default.

@sujeet01 sujeet01 self-assigned this Jun 22, 2026
@sujeet01 sujeet01 requested a review from a team as a code owner June 22, 2026 08:09
@github-actions github-actions Bot added size/XXL enhancement New feature or request labels Jun 22, 2026
@sujeet01 sujeet01 requested a review from gonzolino June 22, 2026 08:10
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an async runner with bounded concurrency and rewires image and snapshot reconciliation to submit child flattening and snapshot population asynchronously. New API states, runner wiring, app configuration, and volume-server mappings are added, along with dependency updates.

Changes

Async runner and controller integration

Layer / File(s) Summary
API state constants
api/image.go, api/snapshot.go, internal/volumeserver/snapshot.go, internal/volumeserver/volume.go
Adds ImageStateFlatteningChildren, SnapshotStatePopulating, SnapshotStateFlattening, and SnapshotStateFlattened, and maps the new states to pending IRI states.
Async runner
internal/async/runner.go
Adds the async runner types, sentinel errors, lifecycle methods, dispatcher loop, worker execution, listener notification, and context-aware submission.
Async key helpers
internal/controllers/common.go
Adds async key prefixes and key builder/parser helpers, and removes the synchronous child-flattening helper.
Image async flattening
internal/controllers/image_async.go, internal/controllers/image_controller.go
Adds async image flattening, wires AsyncRunner into the image reconciler, registers completion handling, and switches image deletion to async child flattening.
Snapshot async populate and flattening
internal/controllers/snapshot_async.go
Adds async snapshot populate and deletion-time flattening logic, retry handling, store resync gating, and state update helpers.
Snapshot controller async wiring
internal/controllers/snapshot_controller.go
Injects the async runner, registers completion handling, changes deleted-at reconciliation, resubmits in-progress work, routes IronCore-backed snapshots through async populate, and persists Ready for volume-backed snapshots.
App runner wiring
cmd/volumeprovider/app/app.go, tests/integration/integration_suite_test.go
Adds async worker configuration, validates it, constructs and starts the shared runner, injects it into reconcilers, and sets the integration test worker limit.

Dependency updates

Layer / File(s) Summary
Module version bumps
go.mod
Bumps controller-utils, Kubernetes modules, controller-runtime, golang.org/x/sync, grpc-gateway, and indirect observability, protobuf, and Kubernetes-related dependencies.

Sequence Diagram(s)

sequenceDiagram
  participant App as volumeprovider app
  participant Runner as async.Runner
  participant Reconciler as ImageReconciler / SnapshotReconciler
  participant Store as store
  participant Ceph as Ceph RBD

  App->>Runner: New(maxWorkers)
  App->>Runner: Start(ctx)
  Reconciler->>Runner: Submit(ctx, key, op)
  Runner->>Store: accept work state
  Runner->>Ceph: run flatten / populate operation
  Ceph-->>Runner: completion
  Runner->>Reconciler: HandleDone(evt)
  Reconciler->>Store: requeue / persist state changes
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • ironcore-dev/ceph-provider#813: Introduced the synchronous child-flattening path that this PR replaces with async flattening and new lifecycle states.

Suggested labels

size/XL, area/storage, integration-tests

Suggested reviewers

  • gonzolino
  • balpert89
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a bounded async runner for long-running operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR includes the required Proposed Changes section and a Fixes reference; only the template's bullet formatting is missing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/async/runner.go (2)

137-139: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider logging a warning if active underflows.

The defensive check if active > 0 before decrementing prevents underflow, but silently ignores the case where active is already zero when a done event arrives. Since each spawned operation should send exactly one done event, active reaching zero prematurely would indicate a bug in the runner logic.

Adding a warning log when active == 0 would aid debugging without changing behavior:

📊 Suggested observability improvement
 case evt := <-r.doneCh:
 	delete(inFlight, evt.Key)
-	if active > 0 {
+	if active > 0 {
 		active--
+	} else {
+		r.log.V(1).Info("Received done event with no active operations", "key", evt.Key)
 	}
 	r.notify(evt)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/async/runner.go` around lines 137 - 139, In the runner's event
handling logic where the active counter is decremented, the current code
silently skips the decrement when active is already zero. Add a warning log
statement in an else clause or separate condition that triggers when active is
zero to help detect premature done events, which could indicate a bug in the
runner logic. This preserves the existing underflow prevention behavior while
adding observability for debugging.

142-154: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider simplifying the switch statement to if-else for clarity.

The switch statement at line 144 uses an inline function func() bool { _, ok := inFlight[req.key]; return ok }() to check if a key exists, which is unnecessarily complex compared to a straightforward if-else chain.

♻️ Suggested refactor for readability
 case req := <-r.submitCh:
-	switch {
-	case func() bool { _, ok := inFlight[req.key]; return ok }():
+	if _, ok := inFlight[req.key]; ok {
 		req.result <- ErrInProgress
-	case active >= r.maxWorkers:
+	} else if active >= r.maxWorkers {
 		req.result <- ErrAtCapacity
-	default:
+	} else {
 		inFlight[req.key] = struct{}{}
 		active++
 		go r.run(ctx, req.key, req.op)
 		req.result <- nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/async/runner.go` around lines 142 - 154, Replace the switch
statement in the case handling for req from r.submitCh with a series of if-else
statements to improve readability. Instead of using an inline function func()
bool { _, ok := inFlight[req.key]; return ok }() to check if req.key exists in
the inFlight map, use a simple if condition directly. Check for the
inFlight[req.key] existence first, then check if active >= r.maxWorkers, and
finally handle the default case. This will eliminate the unnecessary function
wrapper and make the control flow clearer and more maintainable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/async/runner.go`:
- Around line 196-216: Add a godoc comment to the Submit method that documents
its context lifetime requirements. The comment should clarify that callers must
use a context that will be cancelled when the runner stops (or use a timeout),
and explain that passing a context with a different lifetime could cause
blocking due to the time-of-check-time-of-use race between the running flag
check and the submitCh send. This documentation will prevent accidental misuse
of the API by future developers.

In `@internal/controllers/snapshot_controller.go`:
- Around line 217-240: The defer statement for closeImage(log, img) on line 223
will execute when the function returns, but the image is explicitly closed on
line 235 when snapshot.Source.IronCoreImage is not empty, causing a
double-close. Replace the defer closeImage call with a closure that checks if
img is not nil before closing: defer func() { if img != nil { closeImage(log,
img) } }(), and after the explicit closeImage call on line 235, set img to nil
so the deferred function will skip the close when it eventually executes.

---

Nitpick comments:
In `@internal/async/runner.go`:
- Around line 137-139: In the runner's event handling logic where the active
counter is decremented, the current code silently skips the decrement when
active is already zero. Add a warning log statement in an else clause or
separate condition that triggers when active is zero to help detect premature
done events, which could indicate a bug in the runner logic. This preserves the
existing underflow prevention behavior while adding observability for debugging.
- Around line 142-154: Replace the switch statement in the case handling for req
from r.submitCh with a series of if-else statements to improve readability.
Instead of using an inline function func() bool { _, ok := inFlight[req.key];
return ok }() to check if req.key exists in the inFlight map, use a simple if
condition directly. Check for the inFlight[req.key] existence first, then check
if active >= r.maxWorkers, and finally handle the default case. This will
eliminate the unnecessary function wrapper and make the control flow clearer and
more maintainable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e30b12fc-02fd-4815-9be2-69924e5e3717

📥 Commits

Reviewing files that changed from the base of the PR and between f763581 and 43b7b11.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (13)
  • api/image.go
  • api/snapshot.go
  • cmd/volumeprovider/app/app.go
  • go.mod
  • internal/async/runner.go
  • internal/controllers/common.go
  • internal/controllers/image_async.go
  • internal/controllers/image_controller.go
  • internal/controllers/snapshot_async.go
  • internal/controllers/snapshot_controller.go
  • internal/volumeserver/snapshot.go
  • internal/volumeserver/volume.go
  • tests/integration/integration_suite_test.go

Comment thread internal/async/runner.go
Comment thread internal/controllers/snapshot_controller.go
@opensovereigncloud-user opensovereigncloud-user force-pushed the osc/enh/async-long-ops branch 2 times, most recently from 34d6aec to 35317f8 Compare June 24, 2026 12:32
Signed-off-by: sujeet01 <phadtare.sujeet@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Move long-running ceph operations out of controllers

1 participant