apache · wankai123 · Jun 16, 2026 · Jun 14, 2026 · Jun 14, 2026 · Jun 14, 2026
diff --git a/.claude/skills/gh-pull-request/SKILL.md b/.claude/skills/gh-pull-request/SKILL.md
@@ -150,3 +150,35 @@ EOF
 - Add `copilot` as a reviewer: `gh pr edit <number> --add-reviewer copilot`
 - Do NOT add AI assistant as co-author. Code responsibility is on the committer's hands.
 - Return the PR URL when done.
+
+## After the PR is merged
+
+Once the PR is merged, sync the default branch and clean up the feature branch:
+
+```bash
+# 1. Prune stale remote refs. GitHub auto-deletes the PR's branch on merge, so
+#    the remote feature branch is usually already gone; --prune removes the
+#    dangling local tracking ref.
+git fetch origin --prune
+
+# 2. Switch back to the default branch and fast-forward it to include the merge.
+git checkout master
+git pull --ff-only origin master
+
+# 3. Confirm the change actually landed in master before deleting anything —
+#    `git log --oneline -1` should show the merge/squash commit with the PR
+#    number, or grep for a symbol the PR introduced.
+git log --oneline -1
+
+# 4. Delete the local feature branch. SkyWalking SQUASH-merges PRs, so the
+#    feature branch's commit is NOT an ancestor of master (master gets a new
+#    squash commit instead). `git branch -d` therefore reports "not fully
+#    merged" — that is expected, not an error. After confirming the content is
+#    in master (step 3), force-delete:
+git branch -d <branch> 2>/dev/null || git branch -D <branch>
+```
+
+Notes:
+- `git branch -d` failing with "not fully merged" on a squash-merged PR is normal — the squash commit has a different SHA than the feature commit. Verify via step 3, then `-D`.
+- If the remote branch was not auto-deleted (some repo settings), remove it explicitly: `git push origin --delete <branch>`.
+- Do NOT skip step 3. Force-deleting a local branch whose work didn't actually merge loses it.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -328,7 +328,7 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
 13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
 14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
 15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
-16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
+16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: schema no longer lives in etcd — etcd was removed. It is now stored as property documents in BanyanDB's own `_schema` store and propagated to each data node's in-memory cache **asynchronously** (a `SchemaUpdateService.WatchSchemas` gRPC stream + a 30s reconcile sync, see `banyand/metadata/schema/property/client.go`). DDL `Create` / `Update` / `Delete` broadcast only to the schema-server (`ROLE_META`) nodes and **return immediately with a `mod_revision`** — now a client-stamped `time.Now().UnixNano()` timestamp, NOT an etcd index (which is why it still changes on every content-changing DDL; it is `0` on a delete that didn't record a tombstone) — **without waiting for data nodes**. A data write whose schema hasn't propagated yet is dropped at the data-node executor (`cannot find measure definition`, logged + skipped, `banyand/measure/write_standalone.go`), so the fence is still required — arguably more than under the old etcd-watch model. After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks (liaison-side, all alive nodes, bounded by the timeout) until every data node's `notifiedModRevision` watermark reaches `maxRev`, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom predates the `SchemaBarrierService` proto and has been replaced — do not reintroduce it. The `SchemaWatcher` class is in OAP's **in-tree** `library-banyandb-client` (`org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher`); the standalone `skywalking-banyandb-java-client` repo does NOT have these symbols, so don't conclude the API is gone from grepping it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
 
 ## Analysis and Design Principles
 

diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md
@@ -242,6 +242,12 @@
   admin-host only" entry above for the public REST retirement.
 
 #### OAP Server
+* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence).
+* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart.
+* Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable.
+* Fix BanyanDB peer nodes permanently flooding `<metric> is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod.
+* Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name.
+* Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers.
 * SWIP-15: rebuild BanyanDB self-observability around the cluster / container / group model (requires BanyanDB 0.11+). A BanyanDB cluster is modeled as one `Service`, each container as a `ServiceInstance` (role/tier as attributes), and each storage group as an `Endpoint`. The `otel-rules/banyandb/` rules are category-separated by role (`node_*` / `liaison_*` / `data_*` / `lifecycle_*`) and by data type (`measure_*` / `stream_*` / `trace_*` / `property_*`), mirroring the upstream FODC-proxy Grafana boards, and include queue batch/message granularity (apache/skywalking-banyandb#1169). Adds a `SERVICE_INSTANCE_RELATION` MAL scope and `serviceInstanceRelation(...)` builder powering a new intra-cluster pod-to-pod deployment topology (`banyandb-instance-relation.yaml`). The stale single-node `host_name` model is removed.
 * Runtime MAL/LAL hot-update rules can declare `layerDefinitions:` to introduce new
   layers. Ordinals are operator-pinned in the `100_000+` tier; the layer is