Skip to content

Fix RetryError not matched through ChainError wrapping#387

Open
databus23 wants to merge 1 commit into
masterfrom
fix/retry-error-matching
Open

Fix RetryError not matched through ChainError wrapping#387
databus23 wants to merge 1 commit into
masterfrom
fix/retry-error-matching

Conversation

@databus23

@databus23 databus23 commented May 22, 2026

Copy link
Copy Markdown
Member

Summary

  • Use errors.As instead of errors.Is when checking for *plugin.RetryError in state.Apply() and controllers.ApplyProfiles(). errors.Is with pointer types uses == comparison, which never matches different instances of the same type through an error chain.
  • Change the eviction plugin to return RetryError for drain failures instead of hard errors. Hard errors abort the reconcile and skip the node patch, preventing cordon from taking effect on the API server.

Root cause

When a trigger plugin (e.g. eviction/drain) encounters a PDB violation, it returned a hard error wrapped in ChainError. The callers tried errors.Is(err, &plugin.RetryError{}) which always returned false (pointer identity), causing the error to be treated as fatal. This aborted the reconcile loop before the node patch, so cordon was never applied — pods got rescheduled back, creating an infinite eviction loop.

Test plan

  • Added unit tests verifying errors.As extracts RetryError through ChainError
  • Added unit test confirming errors.Is does NOT match (documenting the pitfall)
  • All existing state, plugin, and plugin/impl tests pass
  • Controller and ESX integration tests pass (via gmake build/cover.out)

errors.Is with pointer types (without an Is() method) uses pointer
identity (==), so two different *RetryError instances never match.
This caused RetryErrors from trigger plugins (wrapped in ChainError)
to be treated as hard errors, which aborted the reconcile and skipped
the node patch — preventing cordon from taking effect during drain.

Fix by using errors.As which matches by type through the error chain.
Also change the eviction plugin to return RetryError for drain failures
instead of hard errors, allowing the reconcile patch to succeed so that
the cordon is applied to the API server.
@github-actions

Copy link
Copy Markdown

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/sapcc/maintenance-controller/controllers 85.42% (ø)
github.com/sapcc/maintenance-controller/plugin 91.16% (ø)
github.com/sapcc/maintenance-controller/plugin/impl 81.03% (-0.11%) 👎
github.com/sapcc/maintenance-controller/state 93.33% (+0.04%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/sapcc/maintenance-controller/controllers/node_handler.go 91.18% (ø) 68 62 6
github.com/sapcc/maintenance-controller/plugin/impl/eviction.go 72.73% (-3.46%) 22 (+1) 16 6 (+1) 👎
github.com/sapcc/maintenance-controller/state/state.go 92.54% (+0.06%) 134 (+1) 124 (+1) 10 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/sapcc/maintenance-controller/plugin/plugin_test.go

continue
}
if errors.Is(err, retryErr) {
if errors.As(err, &retryErr) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw since go 1.26.0 we have new errors.AsType() func

if retryErr, ok := errors.AsType[*plugin.RetryError](err); ok {
    ...
}

Maybe you find it pretty :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants