Isolate per-connection failures in graph reconciler#422
Merged
Conversation
A single connection's runtime error (e.g. a live query against a workload) aborted the whole graph build for an object, so unrelated edges (offshoot, auth_secret, catalog, ...) were dropped and never persisted. ListConnectedObjectIDs now skips a failing connection and returns the edges that did resolve along with an aggregated error. The reconciler persists the partial result, retries discovery errors fast, and returns other errors for exponential-backoff retry. Signed-off-by: Arnob kumar saha <arnob@appscode.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The graph adjacency cache (
objGraph) is built per-object by the reconciler inpkg/graph/reconciler.go.ListConnectedObjectIDsevaluated every connection inrd.Spec.Connections, and on the first non-NotFound/NoMatch error it didreturn nil, err— aborting the whole object. The reconciler then returned without callingobjGraph.Update, so the object got zero edges.Observed in the wild: a PostgreSQL 11 instance whose
ui.kubedb.comquery view runs apg_stat_statementsquery usingtotal_exec_time(a column that only exists in PG ≥ 13). The query fails withpq: column "total_exec_time" does not exist, which poisoned the entire graph build —offshoot,auth_secret,catalog, and the working views were all dropped, even though they resolve purely from the K8s API and have nothing to do with the SQL view.Fix
ListConnectedObjectIDs: a failing connection is now skipped (error wrapped withsrc -> targetcontext and accumulated), the remaining connections still resolve, and the function returns(edges, utilerrors.NewAggregate(errs)).Reconcile:anyDiscoveryErrorhelper (the k8s aggregate type implementsIsbut notAs, soIsDiscoveryErrorcan't see through it).objGraph.Updatefirst, then the error is returned for exponential-backoff retry.Net: one bad connection no longer wipes an object's whole graph.
Out of scope
The kubedb-side query bug (
total_exec_timevstotal_timefor PG < 13) lives in a different repo and is unchanged.Test
go build ./pkg/graph/...,go vet,gofmtcleango test ./pkg/graph/...passes