Skip to content

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911

Merged
wankai123 merged 20 commits into
masterfrom
fix/runtime-rule-schema-cache-self-heal
Jun 16, 2026
Merged

Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics#13911
wankai123 merged 20 commits into
masterfrom
fix/runtime-rule-schema-cache-self-heal

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Fix BanyanDB runtime-rule schema-cache flood + v2 MAL CounterWindow collision and Elvis falsy semantics

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Bundles related correctness fixes surfaced while validating BanyanDB self-observability against the live demo.

1. BanyanDB schema-cache self-heal. Peer nodes flooded <metric> is not registered when a node held a live persist worker but never populated its local MetadataRegistry schema cache for that model (a withoutSchemaChange peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate; the registry never evicts and the 30s reconcile only covers runtime-rule rows). The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation (MetadataRegistry.repopulateLocally) before failing, and the no-init defer poll loop retries a transient backend probe error (isRetryableNoInitProbeFailure — default false, BanyanDB opts in for transient gRPC codes) instead of crash-looping the pod.

2. v2 MAL CounterWindow key collision. rate() / increase() / irate() keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name. Two or more counters that reduce to the same label set after .sum(...) therefore shared one window slot and computed rates against each other's values — fabricating non-zero rates from unchanged counters (the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). Fixed by keying on the counter's own metric name. BanyanDBErrorRateReproTest reproduces it with the real frozen values (966 → 0 after the fix).

3. v2 MAL Elvis ?: falsy semantics. Compiled to Optional.ofNullable(primary).orElse(fallback), applying the fallback only on null, so an empty-string primary kept "" (a BanyanDB liaison ServiceInstance stored node_type="" rather than n/a, because .sum([...,'node_type']) fills an absent group-by label with ""). Now single-evaluated through MalRuntimeHelper.elvis / isTruthy, matching Groovy falsy (null, false, numeric zero, empty string/collection/map/array). MALElvisFalsyTest covers empty/null/non-empty/side-effecting primaries.

4. banyandb otel-rules. PT15SPT1M rate window to match the collector scrape / OAP minute-bucket cadence (MAL rate() is a two-point CounterWindow delta, not PromQL).

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

…& Elvis falsy semantics

* BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod.
* v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name.
* v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a".
* banyandb otel-rules: PT15S -> PT1M rate window.
* Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest.
@wu-sheng wu-sheng closed this Jun 14, 2026
@wu-sheng wu-sheng reopened this Jun 14, 2026
ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for
these actions, so the stale pins were rejected and the CI workflow failed
with startup_failure. Updated to the newest approved v4 SHA each:

* docker/login-action       v3.7.0  -> v4.2.0 (650006c6)
* docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5)
* docker/setup-qemu-action   v3.6.0  -> v4.1.0 (06116385)
* dorny/paths-filter         v3.0.2  -> v4.0.1 (fbd0ab8f)
@wu-sheng wu-sheng added bug Something isn't working and you are sure it's a bug! backend OAP backend related. labels Jun 14, 2026
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 14, 2026
The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on
each counter's own sample name instead of the rule-level context.metricName.
MALExpressionExecutionTest relied on context.metricName (set to a unique
sourceFile/metricName) to keep each rule's prime/real pair isolated in the
process-wide CounterWindow.INSTANCE singleton — the new keying ignores that
field, so leftover samples from one rule leaked into the next across the ~1350
sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6,
a lower bound pulled from an earlier rule).

Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest
already uses via @beforeeach) and drop the now-dead setMetricName scaffolding
(context.metricName has no readers after the keying change). No production code
or expected values changed; 1350/1350 tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wankai123
wankai123 previously approved these changes Jun 15, 2026
The runtime-rule schema-cache self-heal only filled a MISSING MetadataRegistry
entry, never refreshed a STALE one. A cluster peer applies schema changes with
withoutSchemaChange (inspectBackend=false), whose contract says the installer
"must populate local caches from the declared model" — but whenCreating gated
ahead of isExists and skipped that populate. Combined with an insert-only
registry that never evicts, a reshape (remove+add) left the peer translating
writes with the old shape; a drop left a stale entry behind.

C-1: ModelInstaller.whenCreating now calls a new RPC-free populateLocalCacheOnly
hook on the inspectBackend=false branch (honoring the flag contract).
BanyanDBIndexInstaller overrides it -> registerLocallyByKind, a blind overwrite,
so a reshape's re-add refreshes the entry. No-op for ES/JDBC.

C-2: ModelInstaller.whenRemoving now calls a new evictLocalCache hook on both
the peer (skip-drop) and main (post-dropTable) paths. BanyanDBIndexInstaller
overrides it -> MetadataRegistry.evict(model), keyed exactly as findMetadata, so
a dropped/reshaped model leaves no stale translation. The read-side self-heal
stays as a defensive backstop.

Also refresh CLAUDE.md tip #16: etcd was removed; schema now lives in BanyanDB's
_schema property store, mod_revision is a client-stamped UnixNano timestamp, and
data-node propagation is async (WatchSchemas + 30s reconcile), so the fence is
still required. API names unchanged; SchemaWatcher lives in OAP's in-tree client.

Tests: ModelInstallerNoInitTest +3 (populate-on-peer-create, evict-on-peer-remove,
drop-then-evict-on-main); MetadataRegistryTest +1 (evict across all key branches).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wu-sheng and others added 15 commits June 15, 2026 11:08
A runtime-rule file changes dozens of rules at once, but the post-DDL fence
(SchemaWatcher.awaitRevisionApplied) ran once per metric/downsampling — a large
file did K×M sequential <=2s fences, overrunning the apply's REST budget on a
laggy cluster.

Add StorageManipulationOpt.withSchemaChangeDeferredFence(): same flags as
withSchemaChange() plus a deferFence toggle + a DeferredFence callback holder.
Under deferFence, BanyanDBIndexInstaller.fenceOnRevision records each resource's
mod_revision and registers a single flush instead of fencing inline; the apply
(MalFileApplier, after MetricConvert) runs StorageManipulationOpt.runDeferredFence()
ONCE on the cumulative max revision — collapsing the whole file to one barrier.
The main apply paths (DSLManager picker + applyNowForRuleFile, DSLRuntimeDelete
revert) switch to the deferred-fence opt.

Drops keep fencing inline (doFenceOnRevision) — a deletion's visibility is
per-key and must not ride a batched revision flush. Peer / withoutSchemaChange
applies are unaffected (no revision recorded -> runDeferredFence is a no-op).

Tests: StorageManipulationOptTest (5) covers the deferred-fence mechanics
(same-flags, run-once, no-op-when-empty, exception propagation, latest-wins).
Verified: full -Pall build + javadoc, checkstyle, license; runtime-rule suite
(MalFileApplierTest etc.) + MetadataRegistryTest + ModelInstallerNoInitTest green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In-memory owner of runtime-rule apply progress on the cluster main: each apply
opens a status entry keyed by a generated applyId and advances through ApplyPhase
(PENDING→VALIDATING→DDL→FENCING→ROLLING_OUT→APPLIED) with DEGRADED (committed but
fence-unconfirmed) and FAILED (pre-commit error) off-ramps. Two indexes back the
two query shapes: by applyId (live handle) and by (catalog,name)→latest applyId
(content-based path, resolved against the durable content hash for when the
apply-id is gone after a refresh / main restart).

Immutable ApplyStatus snapshots in a ConcurrentHashMap — single-writer per apply
(apply orchestration serializes per file), lock-free concurrent reads. Clock is
injectable for deterministic tests and the timed watch added later.

Self-contained building block; the apply-lifecycle wiring, DSLRuntimeState
failureReason + per-node breakdown, the GetApplyStatus query surface, and the
background convergence watch with TTL eviction land in Phase 3. State is
in-memory by design — the durable content hash reconstructs truth after restart.

Tests: SchemaApplyCoordinatorTest (8) — begin/index, phase transitions + updatedAt,
terminal markApplied/markDegraded/markFailed + reason, forward-transition clears
stale reason, unknown-id no-ops, content-hash-gated content lookup, latest-wins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add SchemaApplyCoordinator.INSTANCE (process-wide, mirrors MetadataRegistry.INSTANCE)
so the apply / cluster-RPC / reconcile paths share one coordinator without
constructor threading; tests keep using new SchemaApplyCoordinator(clock).

RuntimeRuleService.applyStructural now begins an apply (keyed by content hash)
right before the apply attempt and marks a terminal phase on every exit:
APPLIED on success, FAILED (with the specific reason — layer conflict, apply
threw, getLastApplyError, persist-failed) on the pre-commit/failure paths, and
DEGRADED on commit-deferred (DB persisted but this node's commit tail threw —
durable, peers converge, not a revert). A missed branch leaves only a stale
PENDING the background watch reaps (Phase 3c) — not a correctness bug. Filter-only
applies do no DDL/fence, so they are not tracked here.

The query surface (GetApplyStatus RPC + REST progress endpoint), the async
apply-id response, and the background convergence watch land in 3b/3c.

Verified: RuntimeRuleRestHandlerTest (20) unchanged, SchemaApplyCoordinatorTest (8),
MalFileApplierTest (13) green; checkstyle + license clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose the apply-status the main tracks (Phase 3a) to the UI/operator:

- proto: GetApplyStatus(ApplyStatusRequest) -> ApplyStatusResponse on
  RuntimeRuleClusterService, with ApplyStatusPhase mirroring the Java ApplyPhase.
- RuntimeRuleClusterServiceImpl.getApplyStatus: main-served; reads
  SchemaApplyCoordinator.INSTANCE by apply_id, else by (catalog,name) gated on
  content_hash; maps to the response (found=false / UNKNOWN when nothing matches).
- RuntimeRuleClusterClient.getApplyStatus: routes the read to the deterministic
  main (MainRouter.mainPeer); null on unreachable -> caller degrades.
- RuntimeRuleService.queryApplyStatus + GET /runtime/rule/status: self-main (or
  single-process) reads the local coordinator; otherwise routes to the main.
  Query by applyId (live handle) or catalog+name(+contentHash) once it's gone
  (page refresh / main restart). Always 200 JSON; found=false for no match.

Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20) unchanged,
SchemaApplyCoordinatorTest (8), MainRouterTest green; checkstyle + license clean.
Background convergence watch + TTL eviction land in 3c.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SchemaApplyCoordinator.evictExpired(ttlMs) reaps tracked applies whose last
update is older than the retention window — terminal ones linger long enough for
a post-apply UI poll / post-refresh content query, then are dropped to bound
memory; a stale PENDING left by a missed apply branch is reaped too (a later
query then returns UNKNOWN and the caller falls back to the durable content
hash). The (catalog,name)->latest index entry is cleared only when it still
points at an evicted apply, so a newer apply for the same file keeps its mapping.

RuntimeRuleModuleProvider schedules the sweep on the existing reconciler executor
(every 5 min, 1 h TTL). The live DEGRADED->APPLIED re-fence is intentionally NOT
done here: it needs BanyanDB-client access runtime-rule does not hold, and
BanyanDB's own 30s reconcile converges the actual schema regardless — an operator
re-query reflects it.

Tests: SchemaApplyCoordinatorTest +2 (evict reaps old + clears its file index;
newer apply keeps the index when the older is evicted). checkstyle + license clean.
Completes Phase 3 (3a wiring + 3b query surface + 3c eviction).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After a successful structural commit the main broadcasts a new NotifyApplied
admin-internal RPC so peers converge NOW instead of waiting up to one ~30s
refresh tick to notice the new DB row.

- proto: NotifyApplied(NotifyAppliedRequest{catalog,name,content_hash,sender,ts})
  -> NotifyAppliedAck.
- RuntimeRuleClusterServiceImpl.notifyApplied: self-broadcast suppressed; else
  submits a full reconcile (dslManager.tick) to a single daemon executor off the
  gRPC thread and acks immediately. The reconcile is per-file-locked + idempotent
  (unchanged files short-circuit on hash); a lost/failed notify is non-fatal — the
  peer self-converges on its own tick.
- RuntimeRuleClusterClient.broadcastNotifyApplied: best-effort fan-out to non-self
  peers, same sequential-with-deadline transport as Suspend/Resume.
- RuntimeRuleService: on the drained success path, broadcastNotifyApplied with the
  committed content hash (the !drained force-no-change path keeps its Resume).

Design note: the apply correlation rides this POST-commit notify, not Suspend —
Suspend is broadcast before the apply runs, when the apply-id/revision don't yet
exist. Tightens the convergence window without a hard peer->main dependency.

Deferred follow-up: per-node failure breakdown (DSLRuntimeState.failureReason
aggregated into GetApplyStatus) — the status is main-orchestrated today.

Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20), MalFileApplierTest (13),
SchemaApplyCoordinatorTest (10), MainRouterTest green; checkstyle + license clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ete drop fence

Two HIGH issues found reviewing the batched-fence work on this branch:

- NC2: StorageManipulationOpt.runDeferredFence() was not one-shot. A reconciler
  tick reuses ONE opt across every rule file (RuleSync#runOnce), so a later file
  that performed no DDL re-ran the previous file's stale fence, and the cumulative
  revision made each file over-fence on prior files' DDL. runDeferredFence now
  clears the closure and resets the accumulated revision after it runs (in finally,
  so a transport failure still isolates the next file); each file fences on its own
  DDL only. The closure reads getMaxModRevision() during await, so reset happens
  after.

- NC1: BanyanDBIndexInstaller drop fence decided rev>0 vs AwaitSchemaDeleted on
  opt.getMaxModRevision() (cumulative across the shared opt). An earlier create/
  binding revision on the same opt made a tombstone-less primary delete take the
  revision branch and silently skip the deletion barrier. dropTable now captures
  the primary resource's OWN delete revision and threads it to
  fenceOnRevisionOrDeletion, which decides on that value (0 for trace/property,
  whose delete RPCs have no revision variant — those always key-fence). Added
  doFenceOnRevisionValue(client, rev, ctx) as the value-based core.

Tests: 3 new StorageManipulationOptTest cases (one-shot across files; revision
seen during await then reset; reset even when the fence throws). Changelog: the
batched-fence bullet now describes the one-shot flush and per-delete drop fence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fixes

Addresses the review findings on the apply-status / batched-fence work and adds the
configurable async schema fence the design called for.

Async configurable schema fence (operator decision: "3 min, generous progress"):
- New receiver-runtime-rule.deferredFenceTimeoutSeconds (default 180), carried onto
  StorageManipulationOpt (fenceTimeoutMs; 0 = installer's short 2s default). The REST
  operator apply runs the deferred fence in the BACKGROUND after the durable commit +
  peer resume (fenceRunByCaller + a daemon fence executor), so a slow cluster never
  blocks the apply or holds peers suspended. The reconciler tick keeps the short inline
  2s fence; inline/static/delete fences are unchanged.
- POST /addOrUpdate returns its applyId immediately at ROLLING_OUT; the background fence
  drives FENCING -> APPLIED, or DEGRADED with the lagging data-node ids (fenceLaggards)
  surfaced on the status + proto (new repeated fence_laggards) + JSON. The fence-phase
  listener on the opt lets the installer emit FENCING the instant it starts blocking.

Phase machine (H1/#3): wire transition(DDL) before the apply and transition(ROLLING_OUT)
after persist; re-add FENCING (now a real, observable long wait) and trim the never-emitted
VALIDATING from the enum/proto/toProtoPhase so the contract matches the code.

Content-hash fallback (H2/#4): queryApplyStatus degrades to the durable rule row on
coordinator-miss / main-unreachable / found=false — an ACTIVE row whose hash matches reports
APPLIED (derivedFrom=durable-dao), persist-is-commit.

Notify (#1/#2/M2): main-side NotifyApplied is fire-and-forget off the REST thread; the
commit_deferred (durable) path also notifies; peer-side reconcile nudges coalesce a burst
to one queued tick.

Lows: applyId in the structural_applied envelope (#5); servedBy parity on the local-path
JSON (L1); defensive stripApplyPhasePrefix maps unknown -> UNKNOWN (NC3); provider shutdown
stops the notify + fence executors (L3).

Tests: StorageManipulationOptTest one-shot/reset cases; RuntimeRuleRestHandlerTest +
GuardrailIntegrationTest track the structural apply's 3-arg (fence-opt) overload.
Verified: server-core 306/306, runtime-rule 139/139, checkstyle + license clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…doc link

Closes the review-flagged coverage gap on the new apply-status surfaces, and fixes a
dangling javadoc reference the whole-project javadoc build caught.

- RuntimeRuleClusterServiceImplTest (new): getApplyStatus maps every ApplyPhase to proto
  along the happy path (PENDING→DDL→ROLLING_OUT→FENCING→APPLIED), surfaces DEGRADED with
  the laggard node ids, maps FAILED, returns UNKNOWN/found=false when nothing is tracked,
  and resolves by (catalog,name) when applyId is absent; notifyApplied suppresses a node's
  own broadcast (no reconcile) and schedules an off-thread reconcile for a peer notify.
- MalFileApplierTest: a deferred-fence transport failure rolls the apply back and carries
  the metrics registered before the fence for the caller's rollback set.
- RuntimeRuleRestHandlerTest: queryApplyStatus degrades to the durable rule row when the
  live status is gone (ACTIVE row → APPLIED, derivedFrom=durable-dao), and returns
  UNKNOWN/found=false when neither a live status nor a row exists.
- DSLManager: drop the stale {@link #newDeferredFenceOpt()} (method removed when the REST
  orchestrator took over building the deferred-fence opt) so javadoc resolves.

Verified: server-core 306/306, runtime-rule 139/139 (+ the 10 new cases above), whole-project
checkstyle + compile + javadoc + license all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…) + review fixes

Second review round on the apply-status branch.

HIGH — write-safety regression fixed. The async fence resumed dispatch (finalizeCommit +
peer notify) BEFORE confirming schema propagation, but an un-propagated write is silently
dropped at the data node (CLAUDE.md tip #16), so this lost data during the propagation
window. The fence now GATES the resume: after persist the apply marks FENCING and returns
the applyId immediately, but the background task (fenceThenResume) fences FIRST, then
finalizes the local commit + resumes/notifies peers. Dispatch stays suspended through
FENCING — a clean collection pause, not dropped writes. On a genuine laggard it resumes
anyway after the budget and marks DEGRADED + the laggard ids (a stuck node must not park the
metric forever). Phase order corrected to PENDING → DDL → FENCING → ROLLING_OUT → APPLIED
across the enum, proto, status docs, and changelog. The executor-rejected fallback runs the
fence inline so the suspend bracket never leaks.

MEDIUM — shared-tick opt isolation. runDeferredFence now resets the accumulated revision
unconditionally (even a no-DDL file that registered no closure), so a shared tick opt
isolates each file; documented that a later commit-tail's drop revisions are inline-fenced
and benign (the next file's own create revision is monotonically higher and dominates).

LOW — executor shutdown. Added RuntimeRuleClusterServiceImpl.shutdown() (parity with
RuntimeRuleService); corrected both shutdown() docs to state the framework's ModuleProvider
has no stop hook, so they are daemon-thread + best-effort + for test teardown — not the
prior false "called from provider shutdown".

MEDIUM — MAL Elvis. Documented the eager-fallback as a known limitation at the codegen site
(Javassist cannot lazy-eval without a Supplier-companion pass; real MAL fallbacks are pure,
cheap reads so it is benign) rather than risk a codegen rewrite.

Removed the now-redundant FencePhaseListener machinery (FENCING is marked synchronously
before scheduling). Tests: StorageManipulationOptTest no-closure-reset case; the cluster
test walks the corrected phase order. Verified: server-core 9/9, runtime-rule 149/149,
whole-project checkstyle + compile + javadoc + license all clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rash-safe)

Third review round. Root cause of the crash-recovery holes: the apply persisted the
rule row BEFORE the background fence, so "durable" did not imply "schema propagated".
A main crash after persist but before the fence left a durable row that peers then
converged to via the periodic scan — which uses withoutSchemaChange (no fence) — so
they resumed dispatch against a schema no node had confirmed propagated, and
un-propagated writes are silently dropped (CLAUDE.md tip #16).

Fix: reorder to suspend -> DDL -> fence -> persist -> commit -> resume. persistRuleSync
moves into the background tail (fenceThenPersistThenResume), AFTER the fence. Now any
durable rule row is guaranteed fence-confirmed, so peer / crash-recovery convergence is
always safe; a crash before persist leaves no row (the orphaned measure from the DDL is
inert) and the cluster stays on the prior, already-fenced content. The HTTP call returns
applyId at FENCING = accepted, not yet durable; the operator polls for the rest.

Also fixes the commit-tail bug (#3): finalizeCommit throwing left drained=false and
fell into broadcastResume (peers ran the OLD bundle) — but the row is durable, so peers
must converge to it. Now `commitFailure != null || drained` -> broadcastNotifyApplied;
only a genuine no-change (force re-apply) does Resume. Persist failure -> FAILED status
(rolled back); fence laggard / commit-tail -> DEGRADED; all polled (the POST already
returned at FENCING).

Docs synced to fence-then-persist (#5): the concept doc runtime-rule-hot-update.md
(structural-path + schema-fence + failure-handling sections), the admin-API doc, the
changelog, application.yml, and the ApplyPhase javadoc.

Verified: runtime-rule 149/149, whole-project checkstyle + compile + javadoc + license clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…kill

Documents the after-merge sync + feature-branch deletion flow (prune, ff-only
master, verify the squash landed, then -D), with the note that `git branch -d`
reporting "not fully merged" is expected for SkyWalking's squash-merges.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atus)

/delete?mode=revertToBundled now begins a SchemaApplyCoordinator apply, returns 200 reverted_to_bundled + applyId immediately, and runs the revert pipeline + row delete on the fence executor, mapping the orchestrator outcome to a terminal phase. Mirrors the structural /addOrUpdate async model so a UI can poll /runtime/rule/status for the same progress. Precondition rejections (inactivate-first / no_bundled_twin / requires_revert_to_bundled) stay synchronous; revert-pipeline failures surface as /status FAILED. Revert never broadcasts Suspend, so the background tail needs no Resume.
A structural addOrUpdate now returns immediately at FENCING and persists the
rule row in the BACKGROUND (fence -> persist -> commit -> resume), so the old
read-back right after the 2xx raced the background persist and saw an empty
/list (the Storage Elasticsearch job failed at Phase 1 CREATE). This is
backend-independent: the persist itself is deferred, so it bit ES even though
ES has synchronous DDL.

- mal-storage/runtime-rule-flow.sh: add await_apply_terminal, which polls the
  new GET /runtime/rule/status?applyId=&catalog=&name= surface to a terminal
  phase (APPLIED/DEGRADED = durable; FAILED = fail) after every structural
  post_rule, and give list_row an optional contentHash-advance gate so the
  Phase 2/3 hash assertions wait for the new content rather than reading the
  stale pre-apply hash (status stays ACTIVE across an update). swctl has no
  runtime-rule `status` subcommand, so the poll goes through curl.
- lal/lal-flow.sh: make the single-shot list reads poll - await_status for the
  NEW (async) log-mal and LAL v1 applies, await_hash_changed for the v1->v2
  swap.
- cluster/cluster-flow.sh: Phase 2's local hash read had the same
  status-unchanged-ACTIVE stale-hash race; add await_hash_change (which also
  surfaces lastApplyError on timeout).
wankai123
wankai123 previously approved these changes Jun 16, 2026
RuntimeRuleModuleConfig gained deferredFenceTimeoutSeconds (default 180); the
storage config-dump e2e compares the full config map, so the golden file needs
the new key (sorted before refreshRulesPeriod). Value matches the code default.
@wankai123 wankai123 merged commit fe6f04e into master Jun 16, 2026
648 of 658 checks passed
@wankai123 wankai123 deleted the fix/runtime-rule-schema-cache-self-heal branch June 16, 2026 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. bug Something isn't working and you are sure it's a bug!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants