Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911
Merged
Merged
Conversation
…& Elvis falsy semantics * BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod. * v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name. * v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a". * banyandb otel-rules: PT15S -> PT1M rate window. * Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest.
ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for these actions, so the stale pins were rejected and the CI workflow failed with startup_failure. Updated to the newest approved v4 SHA each: * docker/login-action v3.7.0 -> v4.2.0 (650006c6) * docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5) * docker/setup-qemu-action v3.6.0 -> v4.1.0 (06116385) * dorny/paths-filter v3.0.2 -> v4.0.1 (fbd0ab8f)
The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on each counter's own sample name instead of the rule-level context.metricName. MALExpressionExecutionTest relied on context.metricName (set to a unique sourceFile/metricName) to keep each rule's prime/real pair isolated in the process-wide CounterWindow.INSTANCE singleton — the new keying ignores that field, so leftover samples from one rule leaked into the next across the ~1350 sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6, a lower bound pulled from an earlier rule). Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest already uses via @beforeeach) and drop the now-dead setMetricName scaffolding (context.metricName has no readers after the keying change). No production code or expected values changed; 1350/1350 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wankai123
previously approved these changes
Jun 15, 2026
The runtime-rule schema-cache self-heal only filled a MISSING MetadataRegistry entry, never refreshed a STALE one. A cluster peer applies schema changes with withoutSchemaChange (inspectBackend=false), whose contract says the installer "must populate local caches from the declared model" — but whenCreating gated ahead of isExists and skipped that populate. Combined with an insert-only registry that never evicts, a reshape (remove+add) left the peer translating writes with the old shape; a drop left a stale entry behind. C-1: ModelInstaller.whenCreating now calls a new RPC-free populateLocalCacheOnly hook on the inspectBackend=false branch (honoring the flag contract). BanyanDBIndexInstaller overrides it -> registerLocallyByKind, a blind overwrite, so a reshape's re-add refreshes the entry. No-op for ES/JDBC. C-2: ModelInstaller.whenRemoving now calls a new evictLocalCache hook on both the peer (skip-drop) and main (post-dropTable) paths. BanyanDBIndexInstaller overrides it -> MetadataRegistry.evict(model), keyed exactly as findMetadata, so a dropped/reshaped model leaves no stale translation. The read-side self-heal stays as a defensive backstop. Also refresh CLAUDE.md tip #16: etcd was removed; schema now lives in BanyanDB's _schema property store, mod_revision is a client-stamped UnixNano timestamp, and data-node propagation is async (WatchSchemas + 30s reconcile), so the fence is still required. API names unchanged; SchemaWatcher lives in OAP's in-tree client. Tests: ModelInstallerNoInitTest +3 (populate-on-peer-create, evict-on-peer-remove, drop-then-evict-on-main); MetadataRegistryTest +1 (evict across all key branches). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A runtime-rule file changes dozens of rules at once, but the post-DDL fence (SchemaWatcher.awaitRevisionApplied) ran once per metric/downsampling — a large file did K×M sequential <=2s fences, overrunning the apply's REST budget on a laggy cluster. Add StorageManipulationOpt.withSchemaChangeDeferredFence(): same flags as withSchemaChange() plus a deferFence toggle + a DeferredFence callback holder. Under deferFence, BanyanDBIndexInstaller.fenceOnRevision records each resource's mod_revision and registers a single flush instead of fencing inline; the apply (MalFileApplier, after MetricConvert) runs StorageManipulationOpt.runDeferredFence() ONCE on the cumulative max revision — collapsing the whole file to one barrier. The main apply paths (DSLManager picker + applyNowForRuleFile, DSLRuntimeDelete revert) switch to the deferred-fence opt. Drops keep fencing inline (doFenceOnRevision) — a deletion's visibility is per-key and must not ride a batched revision flush. Peer / withoutSchemaChange applies are unaffected (no revision recorded -> runDeferredFence is a no-op). Tests: StorageManipulationOptTest (5) covers the deferred-fence mechanics (same-flags, run-once, no-op-when-empty, exception propagation, latest-wins). Verified: full -Pall build + javadoc, checkstyle, license; runtime-rule suite (MalFileApplierTest etc.) + MetadataRegistryTest + ModelInstallerNoInitTest green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In-memory owner of runtime-rule apply progress on the cluster main: each apply opens a status entry keyed by a generated applyId and advances through ApplyPhase (PENDING→VALIDATING→DDL→FENCING→ROLLING_OUT→APPLIED) with DEGRADED (committed but fence-unconfirmed) and FAILED (pre-commit error) off-ramps. Two indexes back the two query shapes: by applyId (live handle) and by (catalog,name)→latest applyId (content-based path, resolved against the durable content hash for when the apply-id is gone after a refresh / main restart). Immutable ApplyStatus snapshots in a ConcurrentHashMap — single-writer per apply (apply orchestration serializes per file), lock-free concurrent reads. Clock is injectable for deterministic tests and the timed watch added later. Self-contained building block; the apply-lifecycle wiring, DSLRuntimeState failureReason + per-node breakdown, the GetApplyStatus query surface, and the background convergence watch with TTL eviction land in Phase 3. State is in-memory by design — the durable content hash reconstructs truth after restart. Tests: SchemaApplyCoordinatorTest (8) — begin/index, phase transitions + updatedAt, terminal markApplied/markDegraded/markFailed + reason, forward-transition clears stale reason, unknown-id no-ops, content-hash-gated content lookup, latest-wins. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add SchemaApplyCoordinator.INSTANCE (process-wide, mirrors MetadataRegistry.INSTANCE) so the apply / cluster-RPC / reconcile paths share one coordinator without constructor threading; tests keep using new SchemaApplyCoordinator(clock). RuntimeRuleService.applyStructural now begins an apply (keyed by content hash) right before the apply attempt and marks a terminal phase on every exit: APPLIED on success, FAILED (with the specific reason — layer conflict, apply threw, getLastApplyError, persist-failed) on the pre-commit/failure paths, and DEGRADED on commit-deferred (DB persisted but this node's commit tail threw — durable, peers converge, not a revert). A missed branch leaves only a stale PENDING the background watch reaps (Phase 3c) — not a correctness bug. Filter-only applies do no DDL/fence, so they are not tracked here. The query surface (GetApplyStatus RPC + REST progress endpoint), the async apply-id response, and the background convergence watch land in 3b/3c. Verified: RuntimeRuleRestHandlerTest (20) unchanged, SchemaApplyCoordinatorTest (8), MalFileApplierTest (13) green; checkstyle + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose the apply-status the main tracks (Phase 3a) to the UI/operator: - proto: GetApplyStatus(ApplyStatusRequest) -> ApplyStatusResponse on RuntimeRuleClusterService, with ApplyStatusPhase mirroring the Java ApplyPhase. - RuntimeRuleClusterServiceImpl.getApplyStatus: main-served; reads SchemaApplyCoordinator.INSTANCE by apply_id, else by (catalog,name) gated on content_hash; maps to the response (found=false / UNKNOWN when nothing matches). - RuntimeRuleClusterClient.getApplyStatus: routes the read to the deterministic main (MainRouter.mainPeer); null on unreachable -> caller degrades. - RuntimeRuleService.queryApplyStatus + GET /runtime/rule/status: self-main (or single-process) reads the local coordinator; otherwise routes to the main. Query by applyId (live handle) or catalog+name(+contentHash) once it's gone (page refresh / main restart). Always 200 JSON; found=false for no match. Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20) unchanged, SchemaApplyCoordinatorTest (8), MainRouterTest green; checkstyle + license clean. Background convergence watch + TTL eviction land in 3c. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SchemaApplyCoordinator.evictExpired(ttlMs) reaps tracked applies whose last update is older than the retention window — terminal ones linger long enough for a post-apply UI poll / post-refresh content query, then are dropped to bound memory; a stale PENDING left by a missed apply branch is reaped too (a later query then returns UNKNOWN and the caller falls back to the durable content hash). The (catalog,name)->latest index entry is cleared only when it still points at an evicted apply, so a newer apply for the same file keeps its mapping. RuntimeRuleModuleProvider schedules the sweep on the existing reconciler executor (every 5 min, 1 h TTL). The live DEGRADED->APPLIED re-fence is intentionally NOT done here: it needs BanyanDB-client access runtime-rule does not hold, and BanyanDB's own 30s reconcile converges the actual schema regardless — an operator re-query reflects it. Tests: SchemaApplyCoordinatorTest +2 (evict reaps old + clears its file index; newer apply keeps the index when the older is evicted). checkstyle + license clean. Completes Phase 3 (3a wiring + 3b query surface + 3c eviction). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After a successful structural commit the main broadcasts a new NotifyApplied
admin-internal RPC so peers converge NOW instead of waiting up to one ~30s
refresh tick to notice the new DB row.
- proto: NotifyApplied(NotifyAppliedRequest{catalog,name,content_hash,sender,ts})
-> NotifyAppliedAck.
- RuntimeRuleClusterServiceImpl.notifyApplied: self-broadcast suppressed; else
submits a full reconcile (dslManager.tick) to a single daemon executor off the
gRPC thread and acks immediately. The reconcile is per-file-locked + idempotent
(unchanged files short-circuit on hash); a lost/failed notify is non-fatal — the
peer self-converges on its own tick.
- RuntimeRuleClusterClient.broadcastNotifyApplied: best-effort fan-out to non-self
peers, same sequential-with-deadline transport as Suspend/Resume.
- RuntimeRuleService: on the drained success path, broadcastNotifyApplied with the
committed content hash (the !drained force-no-change path keeps its Resume).
Design note: the apply correlation rides this POST-commit notify, not Suspend —
Suspend is broadcast before the apply runs, when the apply-id/revision don't yet
exist. Tightens the convergence window without a hard peer->main dependency.
Deferred follow-up: per-node failure breakdown (DSLRuntimeState.failureReason
aggregated into GetApplyStatus) — the status is main-orchestrated today.
Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20), MalFileApplierTest (13),
SchemaApplyCoordinatorTest (10), MainRouterTest green; checkstyle + license clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ete drop fence Two HIGH issues found reviewing the batched-fence work on this branch: - NC2: StorageManipulationOpt.runDeferredFence() was not one-shot. A reconciler tick reuses ONE opt across every rule file (RuleSync#runOnce), so a later file that performed no DDL re-ran the previous file's stale fence, and the cumulative revision made each file over-fence on prior files' DDL. runDeferredFence now clears the closure and resets the accumulated revision after it runs (in finally, so a transport failure still isolates the next file); each file fences on its own DDL only. The closure reads getMaxModRevision() during await, so reset happens after. - NC1: BanyanDBIndexInstaller drop fence decided rev>0 vs AwaitSchemaDeleted on opt.getMaxModRevision() (cumulative across the shared opt). An earlier create/ binding revision on the same opt made a tombstone-less primary delete take the revision branch and silently skip the deletion barrier. dropTable now captures the primary resource's OWN delete revision and threads it to fenceOnRevisionOrDeletion, which decides on that value (0 for trace/property, whose delete RPCs have no revision variant — those always key-fence). Added doFenceOnRevisionValue(client, rev, ctx) as the value-based core. Tests: 3 new StorageManipulationOptTest cases (one-shot across files; revision seen during await then reset; reset even when the fence throws). Changelog: the batched-fence bullet now describes the one-shot flush and per-delete drop fence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fixes Addresses the review findings on the apply-status / batched-fence work and adds the configurable async schema fence the design called for. Async configurable schema fence (operator decision: "3 min, generous progress"): - New receiver-runtime-rule.deferredFenceTimeoutSeconds (default 180), carried onto StorageManipulationOpt (fenceTimeoutMs; 0 = installer's short 2s default). The REST operator apply runs the deferred fence in the BACKGROUND after the durable commit + peer resume (fenceRunByCaller + a daemon fence executor), so a slow cluster never blocks the apply or holds peers suspended. The reconciler tick keeps the short inline 2s fence; inline/static/delete fences are unchanged. - POST /addOrUpdate returns its applyId immediately at ROLLING_OUT; the background fence drives FENCING -> APPLIED, or DEGRADED with the lagging data-node ids (fenceLaggards) surfaced on the status + proto (new repeated fence_laggards) + JSON. The fence-phase listener on the opt lets the installer emit FENCING the instant it starts blocking. Phase machine (H1/#3): wire transition(DDL) before the apply and transition(ROLLING_OUT) after persist; re-add FENCING (now a real, observable long wait) and trim the never-emitted VALIDATING from the enum/proto/toProtoPhase so the contract matches the code. Content-hash fallback (H2/#4): queryApplyStatus degrades to the durable rule row on coordinator-miss / main-unreachable / found=false — an ACTIVE row whose hash matches reports APPLIED (derivedFrom=durable-dao), persist-is-commit. Notify (#1/#2/M2): main-side NotifyApplied is fire-and-forget off the REST thread; the commit_deferred (durable) path also notifies; peer-side reconcile nudges coalesce a burst to one queued tick. Lows: applyId in the structural_applied envelope (#5); servedBy parity on the local-path JSON (L1); defensive stripApplyPhasePrefix maps unknown -> UNKNOWN (NC3); provider shutdown stops the notify + fence executors (L3). Tests: StorageManipulationOptTest one-shot/reset cases; RuntimeRuleRestHandlerTest + GuardrailIntegrationTest track the structural apply's 3-arg (fence-opt) overload. Verified: server-core 306/306, runtime-rule 139/139, checkstyle + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…doc link
Closes the review-flagged coverage gap on the new apply-status surfaces, and fixes a
dangling javadoc reference the whole-project javadoc build caught.
- RuntimeRuleClusterServiceImplTest (new): getApplyStatus maps every ApplyPhase to proto
along the happy path (PENDING→DDL→ROLLING_OUT→FENCING→APPLIED), surfaces DEGRADED with
the laggard node ids, maps FAILED, returns UNKNOWN/found=false when nothing is tracked,
and resolves by (catalog,name) when applyId is absent; notifyApplied suppresses a node's
own broadcast (no reconcile) and schedules an off-thread reconcile for a peer notify.
- MalFileApplierTest: a deferred-fence transport failure rolls the apply back and carries
the metrics registered before the fence for the caller's rollback set.
- RuntimeRuleRestHandlerTest: queryApplyStatus degrades to the durable rule row when the
live status is gone (ACTIVE row → APPLIED, derivedFrom=durable-dao), and returns
UNKNOWN/found=false when neither a live status nor a row exists.
- DSLManager: drop the stale {@link #newDeferredFenceOpt()} (method removed when the REST
orchestrator took over building the deferred-fence opt) so javadoc resolves.
Verified: server-core 306/306, runtime-rule 139/139 (+ the 10 new cases above), whole-project
checkstyle + compile + javadoc + license all clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…) + review fixes Second review round on the apply-status branch. HIGH — write-safety regression fixed. The async fence resumed dispatch (finalizeCommit + peer notify) BEFORE confirming schema propagation, but an un-propagated write is silently dropped at the data node (CLAUDE.md tip #16), so this lost data during the propagation window. The fence now GATES the resume: after persist the apply marks FENCING and returns the applyId immediately, but the background task (fenceThenResume) fences FIRST, then finalizes the local commit + resumes/notifies peers. Dispatch stays suspended through FENCING — a clean collection pause, not dropped writes. On a genuine laggard it resumes anyway after the budget and marks DEGRADED + the laggard ids (a stuck node must not park the metric forever). Phase order corrected to PENDING → DDL → FENCING → ROLLING_OUT → APPLIED across the enum, proto, status docs, and changelog. The executor-rejected fallback runs the fence inline so the suspend bracket never leaks. MEDIUM — shared-tick opt isolation. runDeferredFence now resets the accumulated revision unconditionally (even a no-DDL file that registered no closure), so a shared tick opt isolates each file; documented that a later commit-tail's drop revisions are inline-fenced and benign (the next file's own create revision is monotonically higher and dominates). LOW — executor shutdown. Added RuntimeRuleClusterServiceImpl.shutdown() (parity with RuntimeRuleService); corrected both shutdown() docs to state the framework's ModuleProvider has no stop hook, so they are daemon-thread + best-effort + for test teardown — not the prior false "called from provider shutdown". MEDIUM — MAL Elvis. Documented the eager-fallback as a known limitation at the codegen site (Javassist cannot lazy-eval without a Supplier-companion pass; real MAL fallbacks are pure, cheap reads so it is benign) rather than risk a codegen rewrite. Removed the now-redundant FencePhaseListener machinery (FENCING is marked synchronously before scheduling). Tests: StorageManipulationOptTest no-closure-reset case; the cluster test walks the corrected phase order. Verified: server-core 9/9, runtime-rule 149/149, whole-project checkstyle + compile + javadoc + license all clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rash-safe) Third review round. Root cause of the crash-recovery holes: the apply persisted the rule row BEFORE the background fence, so "durable" did not imply "schema propagated". A main crash after persist but before the fence left a durable row that peers then converged to via the periodic scan — which uses withoutSchemaChange (no fence) — so they resumed dispatch against a schema no node had confirmed propagated, and un-propagated writes are silently dropped (CLAUDE.md tip #16). Fix: reorder to suspend -> DDL -> fence -> persist -> commit -> resume. persistRuleSync moves into the background tail (fenceThenPersistThenResume), AFTER the fence. Now any durable rule row is guaranteed fence-confirmed, so peer / crash-recovery convergence is always safe; a crash before persist leaves no row (the orphaned measure from the DDL is inert) and the cluster stays on the prior, already-fenced content. The HTTP call returns applyId at FENCING = accepted, not yet durable; the operator polls for the rest. Also fixes the commit-tail bug (#3): finalizeCommit throwing left drained=false and fell into broadcastResume (peers ran the OLD bundle) — but the row is durable, so peers must converge to it. Now `commitFailure != null || drained` -> broadcastNotifyApplied; only a genuine no-change (force re-apply) does Resume. Persist failure -> FAILED status (rolled back); fence laggard / commit-tail -> DEGRADED; all polled (the POST already returned at FENCING). Docs synced to fence-then-persist (#5): the concept doc runtime-rule-hot-update.md (structural-path + schema-fence + failure-handling sections), the admin-API doc, the changelog, application.yml, and the ApplyPhase javadoc. Verified: runtime-rule 149/149, whole-project checkstyle + compile + javadoc + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…kill Documents the after-merge sync + feature-branch deletion flow (prune, ff-only master, verify the squash landed, then -D), with the note that `git branch -d` reporting "not fully merged" is expected for SkyWalking's squash-merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atus) /delete?mode=revertToBundled now begins a SchemaApplyCoordinator apply, returns 200 reverted_to_bundled + applyId immediately, and runs the revert pipeline + row delete on the fence executor, mapping the orchestrator outcome to a terminal phase. Mirrors the structural /addOrUpdate async model so a UI can poll /runtime/rule/status for the same progress. Precondition rejections (inactivate-first / no_bundled_twin / requires_revert_to_bundled) stay synchronous; revert-pipeline failures surface as /status FAILED. Revert never broadcasts Suspend, so the background tail needs no Resume.
A structural addOrUpdate now returns immediately at FENCING and persists the rule row in the BACKGROUND (fence -> persist -> commit -> resume), so the old read-back right after the 2xx raced the background persist and saw an empty /list (the Storage Elasticsearch job failed at Phase 1 CREATE). This is backend-independent: the persist itself is deferred, so it bit ES even though ES has synchronous DDL. - mal-storage/runtime-rule-flow.sh: add await_apply_terminal, which polls the new GET /runtime/rule/status?applyId=&catalog=&name= surface to a terminal phase (APPLIED/DEGRADED = durable; FAILED = fail) after every structural post_rule, and give list_row an optional contentHash-advance gate so the Phase 2/3 hash assertions wait for the new content rather than reading the stale pre-apply hash (status stays ACTIVE across an update). swctl has no runtime-rule `status` subcommand, so the poll goes through curl. - lal/lal-flow.sh: make the single-shot list reads poll - await_status for the NEW (async) log-mal and LAL v1 applies, await_hash_changed for the v1->v2 swap. - cluster/cluster-flow.sh: Phase 2's local hash read had the same status-unchanged-ACTIVE stale-hash race; add await_hash_change (which also surfaces lastApplyError on timeout).
wankai123
previously approved these changes
Jun 16, 2026
RuntimeRuleModuleConfig gained deferredFenceTimeoutSeconds (default 180); the storage config-dump e2e compares the full config map, so the golden file needs the new key (sorted before refreshRulesPeriod). Value matches the code default.
wankai123
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics
Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo.
1. BanyanDB schema-cache self-heal. Peer nodes flooded
<metric> is not registeredwhen a node held a live persist worker but never populated its localMetadataRegistryschema cache for that model (awithoutSchemaChangepeer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (MetadataRegistry.repopulateLocally) before failing, and the no-init defer poll loop retries a transient backend probe error (isRetryableNoInitProbeFailure— defaultfalse, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod.2. v2 MAL
CounterWindowkey collision.rate()/increase()/irate()keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after.sum(...)therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name.BanyanDBErrorRateReproTestreproduces it with the real frozen values (966 → 0 after the fix).3. v2 MAL Elvis
?:falsy semantics. Compiled toOptional.ofNullable(primary).orElse(fallback), applying the fallback only onnull, so an empty-string primary kept""(a BanyanDB liaisonServiceInstancestorednode_type=""rather thann/a, because.sum([...,'node_type'])fills an absent group-by label with""). Now single-evaluated throughMalRuntimeHelper.elvis/isTruthy, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array).MALElvisFalsyTestcovers empty/null/non-empty/side-effecting primaries.4. banyandb otel-rules.
PT15S→PT1Mrate window to match the collector scrape / OAP minute-bucket cadence (MALrate()is a two-pointCounterWindowdelta, not PromQL).CHANGESlog.