Skip to content

[Feature/Bug] Managed reconciler needs a mechanism to express "async operation in progress" without triggering ReconcileSuccess or Update() #941

Description

@jacob-hudson

This issue is filed against crossplane/crossplane-runtime because the root constraint exists here, not in upjet or any provider.

The managed reconciler in pkg/reconciler/managed unconditionally calls status.MarkConditions(xpv1.ReconcileSuccess()) whenever Observe() returns ResourceUpToDate: true with no error. There is currently no way for an ExternalClient to signal "I am deliberately returning ResourceUpToDate: true to suppress Update(), but reconciliation has not actually succeeded — an async operation is in progress." The Synced condition is therefore always overwritten to True in this path, regardless of what the provider set before returning.

Upjet exploits ResourceUpToDate: true as the mechanism to suppress Update() during async operations, but pays the cost of a permanently false Synced=True signal. This is not a bug in upjet's logic — it is an absence of a contract in crossplane-runtime that would allow the two concerns (suppress Update, signal pending) to be expressed independently.

A corresponding issue has been filed at crossplane/upjet: [link — fill in once filed].


What happened?

When an upjet-based provider (e.g. upbound/provider-aws-kafka) is performing a long-running async operation — such as an MSK cluster broker instance type change taking 60+ minutes — the managed resource's Synced condition remains True for the entire duration.

The AsyncOperation=False (Reason: Ongoing) condition is correctly set by upjet throughout. But Synced continues to read True with Reason: ReconcileSuccess, actively misleading operators, alerting systems, and automation that polls Synced to determine whether a resource has converged.

Expected: Synced=False while an async operation is in progress, transitioning back to True when it completes.

Actual: Synced=True/ReconcileSuccess is actively written on every reconcile pass for the full duration of the operation.

Root cause

In pkg/reconciler/managed/reconciler.go, the reconcile loop unconditionally sets ReconcileSuccess when Observe() returns {ResourceExists: true, ResourceUpToDate: true} with no error:

// Pseudocode — see pkg/reconciler/managed/reconciler.go for the full implementation
observation, err := external.Observe(ctx, managed)
if err != nil {
    // ...error path sets ReconcileError...
}
if observation.ResourceUpToDate {
    status.MarkConditions(xpv1.ReconcileSuccess())
    return reconcile.Result{RequeueAfter: r.pollInterval}, nil
}
// ... else calls Update()

There is no branch for "ResourceUpToDate=true but not actually reconciled successfully." The signal is binary: either the resource is up to date (success) or it needs updating (call Update()).

Upjet's ASyncInProgress branch in pkg/controller/external.go is forced to return ResourceUpToDate: true to prevent the reconciler from calling Update() — which could trigger a redundant or harmful second async operation. The consequence is that crossplane-runtime then overwrites any pending/in-progress condition upjet attempted to set, with ReconcileSuccess.

Why this cannot be fixed in upjet alone

The two naive fixes are both broken:

Returning ResourceUpToDate: false: Causes the reconciler to call Update(), potentially queuing a second async operation against an already in-flight AWS change.

Setting ReconcileError before returning: Fails for two independent reasons. First, crossplane-runtime overwrites it with ReconcileSuccess unconditionally on the same pass (since ResourceUpToDate: true is returned). Second, ReconcileError triggers exponential backoff, which is the opposite of what is needed — frequent requeues are necessary so that async completion is detected promptly.

There is also no ReconcilePending constructor in crossplane-runtime v1.x. The only Synced-related constructors available are ReconcileSuccess, ReconcileError, and ReconcilePaused — none model "deferred pending an in-flight async operation."

Proposed solutions

Option 1: Add ReconcilePending condition constructor

Add a new constructor to crossplane-runtime:

// ReconcilePending returns a condition indicating reconciliation is deferred
// pending an in-flight async operation. Unlike ReconcileError, this does not
// trigger exponential backoff.
func ReconcilePending(msg string) xpv1.Condition {
    return xpv1.Condition{
        Type:    xpv1.TypeSynced,
        Status:  corev1.ConditionFalse,
        Reason:  "ReconcilePending",
        Message: msg,
    }
}

The managed reconciler would need a corresponding hook — for example, checking whether the provider set a ReconcilePending condition after Observe() returns but before the ResourceUpToDate branch overwrites it, and if so, honouring it without calling status.MarkConditions(ReconcileSuccess()).

This correctly models the semantic distinction: "reconciliation failed" vs "reconciliation is deferred pending an in-flight operation." However, it requires condition inspection inside the reconciler, which is slightly awkward — see Option 2 for a cleaner alternative.

Option 2 (preferred): Extend ExternalObservation with AsyncOperationInProgress bool

ExternalObservation is defined in crossplane-runtime. A new field would allow providers to express both signals simultaneously — suppress Update() and suppress ReconcileSuccess — without the reconciler needing to inspect conditions written as side effects during Observe():

type ExternalObservation struct {
    ResourceExists           bool
    ResourceUpToDate         bool
    AsyncOperationInProgress bool // suppress Update() AND suppress ReconcileSuccess
    // ...
}

The managed reconciler would handle this in the ResourceUpToDate branch:

if observation.ResourceUpToDate {
    if observation.AsyncOperationInProgress {
        status.MarkConditions(xpv1.ReconcilePending("Async operation in progress"))
        // r.pollInterval is illustrative; a shorter interval would be preferable
        // to detect async completion promptly
        return reconcile.Result{RequeueAfter: r.pollInterval}, nil
    }
    status.MarkConditions(xpv1.ReconcileSuccess())
    return reconcile.Result{RequeueAfter: r.pollInterval}, nil
}

Upjet would then set AsyncOperationInProgress: true in its ASyncInProgress branch instead of relying on ResourceUpToDate: true alone. This keeps semantics within the observation contract and avoids condition inspection logic in the reconciler.

Option 3 (upjet-only, partial): Dedicated async condition type

Accept that Synced semantics are constrained by crossplane-runtime and document the AsyncOperation=False (Reason: Ongoing) condition as the canonical signal for async-in-progress state. This is the least invasive option but leaves Synced incorrect and does not address the root issue. Not recommended as a resolution.

Operational impact

For operators running alerting on Synced=False persisting beyond a threshold, the current behaviour silently swallows the alert: Synced=True is written on every reconcile pass during a genuine in-progress mutation, so the alert condition never triggers. Any tooling or automation that polls Synced to determine whether a resource has converged will receive a false positive for the full duration of the async update.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions