Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
4598124
Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision …
wu-sheng Jun 14, 2026
a684d61
CI: bump GHA third-party actions to ASF-approved v4 SHAs
wu-sheng Jun 14, 2026
0032d96
Fix MALExpressionExecutionTest isolation after CounterWindow key change
wu-sheng Jun 14, 2026
18bcf0f
Couple BanyanDB local schema-cache to worker lifecycle; refresh tip #16
wu-sheng Jun 15, 2026
512bab9
Merge branch 'master' into fix/runtime-rule-schema-cache-self-heal
wu-sheng Jun 15, 2026
24e7d74
Batch the BanyanDB schema fence per runtime-rule apply (Phase 1)
wu-sheng Jun 15, 2026
6a340f6
Add SchemaApplyCoordinator apply-status state machine (Phase 2)
wu-sheng Jun 15, 2026
e58326c
Wire apply lifecycle into SchemaApplyCoordinator (Phase 3a)
wu-sheng Jun 15, 2026
4f684f8
Add GetApplyStatus RPC + /runtime/rule/status query surface (Phase 3b)
wu-sheng Jun 15, 2026
bd10ea9
Add apply-status TTL eviction sweep (Phase 3c)
wu-sheng Jun 15, 2026
7cb9b60
Push runtime-rule convergence to peers on commit (Phase 4)
wu-sheng Jun 15, 2026
6c6325f
Fix runtime-rule fence correctness: one-shot deferred flush + per-del…
wu-sheng Jun 15, 2026
c2ea08b
Runtime-rule apply status: async 3-min schema fence + review-finding …
wu-sheng Jun 15, 2026
8e602b4
Add tests for the runtime-rule query/notify/fence surfaces + fix java…
wu-sheng Jun 15, 2026
a6fd9f9
Make the runtime-rule schema fence gate dispatch resume (write-safety…
wu-sheng Jun 15, 2026
5b1a5a0
Runtime-rule apply: fence before persist so durable implies fenced (c…
wu-sheng Jun 15, 2026
6982709
docs(skill): add post-merge branch-cleanup steps to gh-pull-request s…
wu-sheng Jun 15, 2026
a4c2696
Track runtime-rule revert-to-bundled as an async apply (applyId + /st…
wu-sheng Jun 16, 2026
254d763
Adopt the async runtime-rule apply mode in the e2e flow scripts
wu-sheng Jun 16, 2026
98389bd
Add deferredFenceTimeoutSeconds to the e2e config-dump golden file
wu-sheng Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions .claude/skills/gh-pull-request/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,3 +150,35 @@ EOF
- Add `copilot` as a reviewer: `gh pr edit <number> --add-reviewer copilot`
- Do NOT add AI assistant as co-author. Code responsibility is on the committer's hands.
- Return the PR URL when done.

## After the PR is merged

Once the PR is merged, sync the default branch and clean up the feature branch:

```bash
# 1. Prune stale remote refs. GitHub auto-deletes the PR's branch on merge, so
# the remote feature branch is usually already gone; --prune removes the
# dangling local tracking ref.
git fetch origin --prune

# 2. Switch back to the default branch and fast-forward it to include the merge.
git checkout master
git pull --ff-only origin master

# 3. Confirm the change actually landed in master before deleting anything —
# `git log --oneline -1` should show the merge/squash commit with the PR
# number, or grep for a symbol the PR introduced.
git log --oneline -1

# 4. Delete the local feature branch. SkyWalking SQUASH-merges PRs, so the
# feature branch's commit is NOT an ancestor of master (master gets a new
# squash commit instead). `git branch -d` therefore reports "not fully
# merged" — that is expected, not an error. After confirming the content is
# in master (step 3), force-delete:
git branch -d <branch> 2>/dev/null || git branch -D <branch>
```

Notes:
- `git branch -d` failing with "not fully merged" on a squash-merged PR is normal — the squash commit has a different SHA than the feature commit. Verify via step 3, then `-D`.
- If the remote branch was not auto-deleted (some repo settings), remove it explicitly: `git push origin --delete <branch>`.
- Do NOT skip step 3. Force-deleting a local branch whose work didn't actually merge loses it.
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all
13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup).
14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly.
15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines.
16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.
16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: schema no longer lives in etcd — etcd was removed. It is now stored as property documents in BanyanDB's own `_schema` store and propagated to each data node's in-memory cache **asynchronously** (a `SchemaUpdateService.WatchSchemas` gRPC stream + a 30s reconcile sync, see `banyand/metadata/schema/property/client.go`). DDL `Create` / `Update` / `Delete` broadcast only to the schema-server (`ROLE_META`) nodes and **return immediately with a `mod_revision`** — now a client-stamped `time.Now().UnixNano()` timestamp, NOT an etcd index (which is why it still changes on every content-changing DDL; it is `0` on a delete that didn't record a tombstone) — **without waiting for data nodes**. A data write whose schema hasn't propagated yet is dropped at the data-node executor (`cannot find measure definition`, logged + skipped, `banyand/measure/write_standalone.go`), so the fence is still required — arguably more than under the old etcd-watch model. After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks (liaison-side, all alive nodes, bounded by the timeout) until every data node's `notifiedModRevision` watermark reaches `maxRev`, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom predates the `SchemaBarrierService` proto and has been replaced — do not reintroduce it. The `SchemaWatcher` class is in OAP's **in-tree** `library-banyandb-client` (`org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher`); the standalone `skywalking-banyandb-java-client` repo does NOT have these symbols, so don't conclude the API is gone from grepping it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence.

## Analysis and Design Principles

Expand Down
6 changes: 6 additions & 0 deletions docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,12 @@
admin-host only" entry above for the public REST retirement.

#### OAP Server
* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence).
* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart.
* Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable.
* Fix BanyanDB peer nodes permanently flooding `<metric> is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod.
* Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name.
* Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers.
* SWIP-15: rebuild BanyanDB self-observability around the cluster / container / group model (requires BanyanDB 0.11+). A BanyanDB cluster is modeled as one `Service`, each container as a `ServiceInstance` (role/tier as attributes), and each storage group as an `Endpoint`. The `otel-rules/banyandb/` rules are category-separated by role (`node_*` / `liaison_*` / `data_*` / `lifecycle_*`) and by data type (`measure_*` / `stream_*` / `trace_*` / `property_*`), mirroring the upstream FODC-proxy Grafana boards, and include queue batch/message granularity (apache/skywalking-banyandb#1169). Adds a `SERVICE_INSTANCE_RELATION` MAL scope and `serviceInstanceRelation(...)` builder powering a new intra-cluster pod-to-pod deployment topology (`banyandb-instance-relation.yaml`). The stale single-node `host_name` model is removed.
* Runtime MAL/LAL hot-update rules can declare `layerDefinitions:` to introduce new
layers. Ordinals are operator-pinned in the `100_000+` tier; the layer is
Expand Down
Loading
Loading