diff --git a/.claude/skills/gh-pull-request/SKILL.md b/.claude/skills/gh-pull-request/SKILL.md index 657cd50c42e3..f26e76ad202d 100644 --- a/.claude/skills/gh-pull-request/SKILL.md +++ b/.claude/skills/gh-pull-request/SKILL.md @@ -150,3 +150,35 @@ EOF - Add `copilot` as a reviewer: `gh pr edit --add-reviewer copilot` - Do NOT add AI assistant as co-author. Code responsibility is on the committer's hands. - Return the PR URL when done. + +## After the PR is merged + +Once the PR is merged, sync the default branch and clean up the feature branch: + +```bash +# 1. Prune stale remote refs. GitHub auto-deletes the PR's branch on merge, so +# the remote feature branch is usually already gone; --prune removes the +# dangling local tracking ref. +git fetch origin --prune + +# 2. Switch back to the default branch and fast-forward it to include the merge. +git checkout master +git pull --ff-only origin master + +# 3. Confirm the change actually landed in master before deleting anything — +# `git log --oneline -1` should show the merge/squash commit with the PR +# number, or grep for a symbol the PR introduced. +git log --oneline -1 + +# 4. Delete the local feature branch. SkyWalking SQUASH-merges PRs, so the +# feature branch's commit is NOT an ancestor of master (master gets a new +# squash commit instead). `git branch -d` therefore reports "not fully +# merged" — that is expected, not an error. After confirming the content is +# in master (step 3), force-delete: +git branch -d 2>/dev/null || git branch -D +``` + +Notes: +- `git branch -d` failing with "not fully merged" on a squash-merged PR is normal — the squash commit has a different SHA than the feature commit. Verify via step 3, then `-D`. +- If the remote branch was not auto-deleted (some repo settings), remove it explicitly: `git push origin --delete `. +- Do NOT skip step 3. Force-deleting a local branch whose work didn't actually merge loses it. diff --git a/CLAUDE.md b/CLAUDE.md index 7503cdf517e8..99a004e7964e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -328,7 +328,7 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all 13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup). 14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly. 15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines. -16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence. +16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: schema no longer lives in etcd — etcd was removed. It is now stored as property documents in BanyanDB's own `_schema` store and propagated to each data node's in-memory cache **asynchronously** (a `SchemaUpdateService.WatchSchemas` gRPC stream + a 30s reconcile sync, see `banyand/metadata/schema/property/client.go`). DDL `Create` / `Update` / `Delete` broadcast only to the schema-server (`ROLE_META`) nodes and **return immediately with a `mod_revision`** — now a client-stamped `time.Now().UnixNano()` timestamp, NOT an etcd index (which is why it still changes on every content-changing DDL; it is `0` on a delete that didn't record a tombstone) — **without waiting for data nodes**. A data write whose schema hasn't propagated yet is dropped at the data-node executor (`cannot find measure definition`, logged + skipped, `banyand/measure/write_standalone.go`), so the fence is still required — arguably more than under the old etcd-watch model. After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks (liaison-side, all alive nodes, bounded by the timeout) until every data node's `notifiedModRevision` watermark reaches `maxRev`, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom predates the `SchemaBarrierService` proto and has been replaced — do not reintroduce it. The `SchemaWatcher` class is in OAP's **in-tree** `library-banyandb-client` (`org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher`); the standalone `skywalking-banyandb-java-client` repo does NOT have these symbols, so don't conclude the API is gone from grepping it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence. ## Analysis and Design Principles diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index c11a51d92582..0de690baf9c7 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,6 +242,12 @@ admin-host only" entry above for the public REST retirement. #### OAP Server +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). +* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. +* Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. +* Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. +* Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. +* Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. * SWIP-15: rebuild BanyanDB self-observability around the cluster / container / group model (requires BanyanDB 0.11+). A BanyanDB cluster is modeled as one `Service`, each container as a `ServiceInstance` (role/tier as attributes), and each storage group as an `Endpoint`. The `otel-rules/banyandb/` rules are category-separated by role (`node_*` / `liaison_*` / `data_*` / `lifecycle_*`) and by data type (`measure_*` / `stream_*` / `trace_*` / `property_*`), mirroring the upstream FODC-proxy Grafana boards, and include queue batch/message granularity (apache/skywalking-banyandb#1169). Adds a `SERVICE_INSTANCE_RELATION` MAL scope and `serviceInstanceRelation(...)` builder powering a new intra-cluster pod-to-pod deployment topology (`banyandb-instance-relation.yaml`). The stale single-node `host_name` model is removed. * Runtime MAL/LAL hot-update rules can declare `layerDefinitions:` to introduce new layers. Ordinals are operator-pinned in the `100_000+` tier; the layer is diff --git a/docs/en/concepts-and-designs/runtime-rule-hot-update.md b/docs/en/concepts-and-designs/runtime-rule-hot-update.md index 4abf49f7dd1f..48ad38202433 100644 --- a/docs/en/concepts-and-designs/runtime-rule-hot-update.md +++ b/docs/en/concepts-and-designs/runtime-rule-hot-update.md @@ -220,29 +220,51 @@ Two paths, picked from the diff between the new content and the current entry: the same fast path. No cluster pause, no backend schema change, no alarm reset. - **Structural path** — anything that moves metric identity (metric set added or removed, scope or downsampling function changed, LAL `(layer, ruleName)` set - changed). The main runs: - 1. **Pause the cluster** — broadcast a pause to every peer over the cluster bus. + changed). The main runs, in order: + 1. **Pause the cluster** — self-suspend first (so a concurrent peer apply is + detected, not merged), then broadcast a pause to every peer over the cluster bus. Peers stop dispatching samples for the affected metrics and drain in-flight batches. Unreachable peers are logged and skipped; they self-recover via the periodic scan. - 2. **Update backend storage on this node**, including the schema-visibility fence on - BanyanDB (see below). - 3. **Persist the entry** — this is the cluster-wide commit point. - 4. **Resume the cluster** — broadcast a resume so peers re-open dispatch. Peers - that missed the resume self-heal within 60 s. - 5. **Reset alarm windows** for any metric whose identity changed, so accumulated - state doesn't carry across the change. - -If any step before persist fails, the entry is **not** advanced, the local node -rolls back to the previous rule state, peers self-heal back to the old content within -60 s, and the operator gets `HTTP 500` with `applyStatus` indicating the failure. - -If persist itself fails, the same rollback happens — the durable state never moved, -so neither does the cluster. - -If persist succeeds but the local finishing step fails (a rare path), the operator -gets `HTTP 500 commit_deferred`: storage holds the new content (peers will converge -on it), but this node hasn't fully applied it yet and will retry on its next scan. + 2. **Fire the backend DDL on this node** — create / update the BanyanDB measures + for the metrics this rule produces. This returns the schema revision but does + **not** yet make the rule the cluster's truth. + + At this point the HTTP call returns `applyId` with phase `FENCING` — **accepted, not + yet durable** — and the remaining steps run in the background so a slow data node + never blocks the operator. The operator polls `GET /runtime/rule/status` (see the + admin-API doc) to watch the rest: + 3. **Schema-visibility fence** — wait (up to a configurable budget, default 3 min) + for every BanyanDB data node to apply the new measure schema (see below). + 4. **Persist the entry** — this is the cluster-wide commit point, and it happens + **after** the fence, so a durable entry always implies "schema propagated + cluster-wide". + 5. **Finalize + resume** — finalize the local commit (swap the bundle, reset alarm + windows for any metric whose identity changed, unpark local dispatch) and + resume / notify peers so they converge to the new entry. Peers that missed the + notify self-heal within 60 s. + +Because persist (step 4) is gated behind the fence (step 3), the ordering is +**pause → DDL → fence → persist → commit → resume**, and crash recovery is safe at +every point: a crash before persist leaves no entry — peers and the recovered main +stay on the old content (the orphaned measure from the DDL is inert) — and any entry +that *is* durable was fence-confirmed before it was written, so a peer converging to +it via the periodic scan never resumes dispatch against an unpropagated schema. + +Outcomes (all observed by polling `/runtime/rule/status`, not by blocking the HTTP +call, which already returned at `FENCING`): + +- **Pre-DDL error** (compile / verify) — phase `FAILED`; the entry was never advanced + and the cluster keeps serving the prior rule. The HTTP call returns the error + synchronously (it happens before the `FENCING` return). +- **Persist fails** — phase `FAILED`; the local node rolls back to the prior rule and + resumes peers. The durable state never moved, so neither does the cluster. +- **Fence does not confirm within the budget** — phase `DEGRADED` with the lagging + node ids; dispatch resumes anyway (a stuck node must not park the metric forever), + and the schema converges through BanyanDB's own watcher. +- **Local finalize fails after persist** — phase `DEGRADED`: storage holds the new + content and peers converge to it, but this node will retry the local finish on its + next scan. ### Lifecycle @@ -367,36 +389,43 @@ gate — without it the system still converges. ## Schema-visibility fence (BanyanDB) -BanyanDB's distributed mode propagates registry writes from the meta-server to -every data node asynchronously. A naive flow — register the schema, immediately -resume dispatch — has a race: the registry holds the new measure but a data node -may not yet have caught up, so the first sample after the apply lands on an -unprepared node. - -For runtime hot-updates this would mean the operator's `200 OK` could come back -before the cluster's data boundary actually moved. The runtime-rule install path -narrows the gap on a best-effort basis: every BanyanDB schema write returns an -etcd `mod_revision`, and the installer waits — synchronously, before resuming -dispatch, up to a bounded timeout (default 2s) — for every BanyanDB data node -to catch up to the highest revision the apply produced. - -The visible contract for operators is: - -- Between operator request and `200 OK`, all sample dispatch for the affected - metric is paused on every node. In-flight samples are dropped (this is by - design: a structural change means the schema is moving and in-flight data has - no valid landing). -- When all data nodes confirm within the bounded window, the `200 OK` marks the - moment the cluster's data boundary moves: samples written at or after the `200` - use the new shape; samples written before use the old shape. -- When one or more nodes haven't applied within the window, OAP logs a warning - naming the laggards and resumes dispatch anyway. The schema is already - authoritative in etcd, so late nodes apply it asynchronously through their - watcher — until they do, samples landing on those specific nodes for that - metric may be rejected by the local data node briefly. This trades strict - cluster-wide cutover for not wedging an apply behind a single slow node; - operators who need strict behavior should fix the slow node, not loosen the - timeout. +BanyanDB's distributed mode propagates schema writes from the meta-server to every +data node asynchronously. A naive flow — register the schema, immediately resume +dispatch — has a race: the registry holds the new measure but a data node may not +yet have caught up, so a sample written before that node applies the schema is +**silently dropped** at the data node (not retried). The fence is therefore a +write-safety barrier, not just an observability check: it must gate dispatch resume. + +Every BanyanDB schema write returns a `mod_revision`. After firing the DDL the main +waits — on a configurable budget (`deferredFenceTimeoutSeconds`, default 3 min) — for +every data node to catch up to the highest revision the apply produced, and only then +persists the rule entry, finalizes the local commit, and resumes dispatch. The wait +runs in the **background**: the HTTP call has already returned `applyId` at phase +`FENCING`, so a slow data node never blocks the operator, yet nothing durable or +visible happens until the schema is confirmed. + +The contract for operators is: + +- The HTTP call returns at `FENCING` (accepted, not yet durable). Dispatch for the + affected metric stays paused on every node from the pause broadcast through the + fence; this is a clean collection gap, not dropped writes (no node is writing the + new shape yet). In-flight samples drained at pause are dropped by design — a + structural change moves the schema and in-flight data has no valid landing. +- When all data nodes confirm within the budget, the entry is persisted and dispatch + resumes — the cluster's data boundary moves at that moment. The status advances + `FENCING → ROLLING_OUT → APPLIED`. +- When one or more nodes haven't applied within the budget, OAP logs a warning naming + the laggards, persists + resumes **anyway**, and reports `DEGRADED` with the laggard + ids. The schema is already authoritative, so late nodes apply it through their own + watcher; until they do, samples landing on those specific nodes may be dropped + briefly. This trades strict cluster-wide cutover for not wedging an apply behind a + single slow node; operators who need strict behavior fix the slow node. + +Because persist is gated behind the fence, the fence's guarantee survives a main +crash: a durable entry is always one whose schema was confirmed propagated, so a peer +converging to it (via the periodic scan, which performs no backend RPC of its own) +never resumes against an unpropagated schema. A crash before persist simply leaves no +entry, and the cluster stays on the prior, already-fenced content. Elasticsearch and JDBC don't have multi-node schema fan-out; their storage change is visible when the call returns, so the fence is a no-op for those backends. @@ -413,23 +442,29 @@ recovery path is the same path operators already use. storage_change_requires_explicit_approval`. No pause broadcast, no persist, no side effects. Re-push with `?allowStorageChange=true` if the change is intentional. -- **Backend storage verification failed mid-apply** — `HTTP 500 ddl_verify_failed`. - Newly added metrics are rolled back so the backend doesn't accumulate orphans; the prior +- **Backend storage verification failed mid-apply** — `HTTP 400 ddl_verify_failed` + (this happens during the synchronous DDL step, before the `FENCING` return). Newly + added metrics are rolled back so the backend doesn't accumulate orphans; the prior rule keeps serving every metric that wasn't being added or reshaped. - `lastApplyError` on `/runtime/rule/list` carries the failure message. -- **Persist failed** — `HTTP 500 persist_failed`. Local state is rolled back to - the pre-apply rule; peers self-heal within 60 s. The cluster never advanced - past the failure. -- **Persist succeeded but the local finishing step failed** — `HTTP 500 commit_deferred`. - Storage is authoritative (peers will converge), but this node will retry on - its next periodic scan. - **Cluster routing fail-safe** — `HTTP 421 cluster_view_split` when a forwarded request reaches a node that also doesn't believe it's the main. Wait for the peer-list to settle (seconds) and retry. -`GET /runtime/rule/list` is the canonical operator view of cluster state: persisted -status, per-node `localState`, and `lastApplyError` for any rule whose most recent -apply failed. There is no separate alert channel — `/list` plus the OAP log are +The errors above are returned synchronously, before the call returns `applyId` at +`FENCING`. The outcomes of the background tail (fence → persist → commit → resume) are +observed by polling `GET /runtime/rule/status?applyId=…`, not by an HTTP code: + +- **Fence didn't confirm within the budget** — phase `DEGRADED` with the lagging node + ids; dispatch resumed anyway, schema converges via the data nodes' watcher. +- **Persist failed** — phase `FAILED`; local state rolled back to the pre-apply rule, + peers self-heal within 60 s. The cluster never advanced (no durable entry). +- **Local finishing step failed after persist** — phase `DEGRADED`; storage is + authoritative (peers converge), this node retries on its next periodic scan. + +`GET /runtime/rule/list` is the canonical operator view of cluster state (persisted +status, per-node `localState`, `lastApplyError`); `GET /runtime/rule/status` reports a +specific apply's live phase / laggards (and degrades to the durable entry when the +apply-id is gone). There is no separate alert channel — those two plus the OAP log are the entire diagnostic surface. ## Dynamic layers diff --git a/docs/en/setup/backend/admin-api/runtime-rule.md b/docs/en/setup/backend/admin-api/runtime-rule.md index 4f056fb543f1..b0a8850ded8e 100644 --- a/docs/en/setup/backend/admin-api/runtime-rule.md +++ b/docs/en/setup/backend/admin-api/runtime-rule.md @@ -111,6 +111,7 @@ server returns `400 compile_failed`. | GET | `/runtime/rule/bundled?catalog=&withContent=false` | Returns bundled rules for one catalog as JSON. `withContent` defaults to true; `false` omits each YAML body. Each item includes whether an operator override exists. | | GET | `/runtime/rule/list[?catalog=]` | Returns a single JSON envelope `{generatedAt, loaderStats, rules}` merged from stored rules and this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Optional `catalog=` narrows the output; unknown values return `400 invalid_catalog`. | | GET | `/runtime/rule/dump[/]` | Downloads a tar.gz of stored runtime rules plus `manifest.yaml`. The server has no bulk import endpoint; the CLI restore command replays individual `addOrUpdate` and `inactivate` calls. | +| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `FENCING`, `ROLLING_OUT`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `FENCING` with the `applyId` — **accepted, not yet durable**: the DDL has fired but the rule row is persisted only after the fence confirms. Dispatch for that rule stays paused while the background fence waits for cluster-wide schema propagation (an un-propagated write is silently dropped); once it confirms, the row is persisted (the durable commit point) and the apply advances to `ROLLING_OUT` (finalize commit, resume + notify peers) and `APPLIED`. A persist failure → `FAILED` (rolled back, nothing durable); a laggard → `DEGRADED` with `fenceLaggards` (data-node ids that did not confirm within `deferredFenceTimeoutSeconds`), resuming dispatch anyway. Because the row is persisted only after the fence, any durable row is propagation-confirmed — so when the live status is gone (or the main is unreachable) the query safely degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | ### `/delete` storage semantics — per backend diff --git a/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java index 1d0f22a1ec18..296be78fb5b3 100644 --- a/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java +++ b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java @@ -198,33 +198,27 @@ private void executeWithInput( final Map inputSection, final Map expectedSection) { final String metricName = rule.getName(); - // Unique per file+rule to isolate CounterWindow entries across files - final String cwMetricName = rule.getSourceFile().getName() + "/" + metricName; final String expression = rule.getFullExpression(); final boolean hasIncrease = expression.contains(".increase(") || expression.contains(".rate("); - // v2 prime + v2 real (also consecutive, same delta) + // rate()/increase() resolve their lower bound from the process-wide + // CounterWindow.INSTANCE, keyed by each counter's own (name, labels) — not + // the rule-level metric name. Input counter names recur across rules and + // files, so without a reset one rule's prime/real pair would rate against + // another rule's leftover window samples. Clear it so each rule is isolated + // to its own prime (t0) + real (t0+2s) pair. + org.apache.skywalking.oap.meter.analyzer.v2.dsl.counter.CounterWindow.INSTANCE.reset(); + + // v2 prime + v2 real (consecutive scrapes 2 s apart, same delta) final Map v2Data; if (hasIncrease) { try { - final Map primeData = - buildV2MockDataFromInput(inputSection, 0.5); - for (final org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily s : primeData.values()) { - if (s != org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily.EMPTY) { - s.context.setMetricName(cwMetricName); - } - } - v2MalExpr.run(primeData); + v2MalExpr.run(buildV2MockDataFromInput(inputSection, 0.5)); } catch (Exception ignored) { } } v2Data = buildV2MockDataFromInput(inputSection, 1.0); - for (final org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily s : v2Data.values()) { - if (s != org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily.EMPTY) { - s.context.setMetricName(cwMetricName); - } - } // V2 run org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily v2Sf; diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java index 8b46f68cf91d..e12a6b7fdcf3 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java @@ -300,9 +300,24 @@ void generateClosureExpr(final StringBuilder sb, } else if (expr instanceof MALExpressionModel.ClosureElvisExpr) { final MALExpressionModel.ClosureElvisExpr elvis = (MALExpressionModel.ClosureElvisExpr) expr; - sb.append("java.util.Optional.ofNullable("); + // Groovy `?:` applies the fallback when the primary is falsy (null, + // empty string/container, numeric zero, false), not only when null. + // Keep the primary single-evaluated so expressions such as tags.remove(...) + // do not observe different values between the truth check and result. + // + // KNOWN LIMITATION (eager fallback): elvis(primary, fallback) is a plain method call, + // so Java evaluates BOTH arguments before elvis() runs — the fallback is always + // computed, even when the primary is truthy. This matches Groovy's falsy-SELECTION but + // not its lazy EVALUATION: a fallback that mutates state or is expensive still runs. + // True laziness would need the fallback wrapped in a Supplier (or an inlined ternary + // over a generated temp), which Javassist cannot emit (no lambdas; expression context + // has no statement slot for a temp). Accepted because real MAL fallbacks are pure, + // cheap reads (`tags['x'] ?: tags['y']`, `metricA ?: metricB`) — side-effecting/expensive + // fallbacks do not occur in practice. Revisit via a Supplier-companion codegen pass if + // that ever changes. + sb.append(MALCodegenHelper.RUNTIME_HELPER_FQCN).append(".elvis("); generateClosureExpr(sb, elvis.getPrimary(), paramName, beanMode); - sb.append(").orElse("); + sb.append(", "); generateClosureExpr(sb, elvis.getFallback(), paramName, beanMode); sb.append(")"); } else if (expr instanceof MALExpressionModel.ClosureRegexMatchExpr) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java index 669be6ef7cec..7275e11db459 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java @@ -17,6 +17,9 @@ package org.apache.skywalking.oap.meter.analyzer.v2.compiler.rt; +import java.lang.reflect.Array; +import java.util.Collection; +import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.Sample; @@ -55,12 +58,9 @@ public static String[][] regexMatch(final String input, final String regex) { return new String[][] {row}; } - /** - * Reverse division: computes {@code numerator / v} for each sample value {@code v}. - * Used by generated code for {@code Number / SampleFamily} expressions. - */ /** * Groovy truth check: {@code null → false}, empty string → {@code false}, + * numeric zero → {@code false}, empty collection/map/array → {@code false}, * {@code Boolean.FALSE → false}, everything else → {@code true}. * Used by generated filter code for standalone expressions in boolean context * (e.g., {@code tags.ApiId || tags.ApiName}). @@ -75,9 +75,29 @@ public static boolean isTruthy(final Object value) { if (value instanceof CharSequence) { return ((CharSequence) value).length() > 0; } + if (value instanceof Number) { + return ((Number) value).doubleValue() != 0.0D; + } + if (value instanceof Collection) { + return !((Collection) value).isEmpty(); + } + if (value instanceof Map) { + return !((Map) value).isEmpty(); + } + if (value.getClass().isArray()) { + return Array.getLength(value) > 0; + } return true; } + public static T elvis(final T primary, final T fallback) { + return isTruthy(primary) ? primary : fallback; + } + + /** + * Reverse division: computes {@code numerator / v} for each sample value {@code v}. + * Used by generated code for {@code Number / SampleFamily} expressions. + */ public static SampleFamily divReverse(final double numerator, final SampleFamily sf) { if (sf == SampleFamily.EMPTY) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java index e3d1aec7b3a7..f07ccc8bf38c 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java @@ -431,7 +431,7 @@ public SampleFamily increase(String range) { Arrays.stream(samples) .map(sample -> sample.increase( range, - context.metricName, + sample.getName(), (lowerBoundValue, unused) -> sample.value - lowerBoundValue )) .toArray(Sample[]::new) @@ -448,7 +448,7 @@ public SampleFamily rate(String range) { Arrays.stream(samples) .map(sample -> sample.increase( range, - context.metricName, + sample.getName(), (lowerBoundValue, lowerBoundTime) -> { final long timeDiff = (sample.timestamp - lowerBoundTime) / 1000; return timeDiff < 1L ? 0.0 : (sample.value - lowerBoundValue) / timeDiff; @@ -466,7 +466,7 @@ public SampleFamily irate() { this.context, Arrays.stream(samples) .map(sample -> sample.increase( - context.metricName, + sample.getName(), (lowerBoundValue, lowerBoundTime) -> { final long timeDiff = (sample.timestamp - lowerBoundTime) / 1000; return timeDiff < 1L ? 0.0 : (sample.value - lowerBoundValue) / timeDiff; diff --git a/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java new file mode 100644 index 000000000000..361a35b9fff0 --- /dev/null +++ b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2.dsl; + +import com.google.common.collect.ImmutableMap; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.skywalking.oap.meter.analyzer.v2.dsl.counter.CounterWindow; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Reproduces the BanyanDB liaison_grpc_error_rate fabrication using the EXACT rule expression and the + * real (frozen) counter values scraped from the live demo FODC proxy. Counters never change across the + * simulated scrapes, so every rate term — and the summed result — MUST be 0. Any non-zero output proves + * the CounterWindow key collision: the three distinct error counters reduce to identical labels after + * .sum([...]) and, because the rate keys on the (shared, rule-level) context.metricName instead of each + * counter's own name, they share one CounterWindow slot and rate against each other's values. + */ +public class BanyanDBErrorRateReproTest { + + private static final String GROUP_BY = "['cluster','pod_name','container_name','node_role','node_type']"; + + // Verbatim from otel-rules/banyandb/banyandb-instance.yaml : liaison_grpc_error_rate (value part). + private static final String EXPR = + "(banyandb_liaison_grpc_total_err.sum(" + GROUP_BY + ").rate('PT1M')" + + " + banyandb_liaison_grpc_total_registry_err.sum(" + GROUP_BY + ").rate('PT1M')" + + " + banyandb_liaison_grpc_total_stream_msg_received_err.sum(" + GROUP_BY + ").rate('PT1M')) * 60"; + + @BeforeEach + void resetWindow() { + CounterWindow.INSTANCE.reset(); + } + + private static Sample s(final String name, final double value, final long ts, final String... kv) { + final ImmutableMap.Builder b = ImmutableMap.builder(); + for (int i = 0; i < kv.length; i += 2) { + b.put(kv[i], kv[i + 1]); + } + return Sample.builder().name(name).labels(b.build()).value(value).timestamp(ts).build(); + } + + // The three liaison-1 families, with the real frozen values (total_err=5, registry_err=166, stream=5). + // node_type is intentionally ABSENT on liaison samples, exactly as the FODC proxy exposes them. + private Map scrape(final long ts) { + final String[] common = { + "cluster", "showcase-banyandb", + "pod_name", "demo-banyandb-liaison-1", + "container_name", "liaison", + "node_role", "ROLE_LIAISON", + }; + final List totalErr = new ArrayList<>(); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metadata"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 2, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsMinute"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsHour"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsDay"))); + + final List registryErr = new ArrayList<>(); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsHour"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsMinute"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsDay"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsDay"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsHour"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsMinute"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 2, ts, with(common, "service", "trace", "method", "get", "group", "sw_trace"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 2, ts, with(common, "service", "trace", "method", "get", "group", "sw_zipkinTrace"))); + + final List streamErr = new ArrayList<>(); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 1, ts, with(common, "service", "measure", "method", "write", "group", "sw_metadata"))); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 2, ts, with(common, "service", "trace", "method", "write", "group", "sw_trace"))); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 2, ts, with(common, "service", "stream", "method", "write", "group", "sw_recordsLog"))); + + final Map map = new HashMap<>(); + map.put("banyandb_liaison_grpc_total_err", SampleFamilyBuilder.newBuilder(totalErr.toArray(new Sample[0])).build()); + map.put("banyandb_liaison_grpc_total_registry_err", SampleFamilyBuilder.newBuilder(registryErr.toArray(new Sample[0])).build()); + map.put("banyandb_liaison_grpc_total_stream_msg_received_err", SampleFamilyBuilder.newBuilder(streamErr.toArray(new Sample[0])).build()); + return map; + } + + private static String[] with(final String[] common, final String... extra) { + final String[] out = new String[common.length + extra.length]; + System.arraycopy(common, 0, out, 0, common.length); + System.arraycopy(extra, 0, out, common.length, extra.length); + return out; + } + + @Test + void unchangedCounters_errorRate_mustBeZero() { + final Expression expr = DSL.parse("meter_banyandb_instance_liaison_grpc_error_rate", EXPR); + long ts = 1_700_000_000_000L; + final long step = 10_000L; // 10s scrape, matching the showcase collector + for (int scrape = 0; scrape < 6; scrape++, ts += step) { + final Result result = expr.run(scrape(ts)); + double maxAbs = 0.0; + if (result.isSuccess() && result.getData() != SampleFamily.EMPTY) { + for (final Sample out : result.getData().samples) { + maxAbs = Math.max(maxAbs, Math.abs(out.getValue())); + } + } + // Counters never changed -> error rate MUST be 0 on every scrape. + assertEquals(0.0, maxAbs, 1e-9, + "Unchanged counters must yield 0 error rate, but scrape " + scrape + " produced " + maxAbs); + } + } +} diff --git a/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java new file mode 100644 index 000000000000..08f334551a19 --- /dev/null +++ b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2.dsl; + +import com.google.common.collect.ImmutableMap; +import java.util.Collections; +import java.util.List; +import java.util.Map; +import org.apache.skywalking.oap.meter.analyzer.v2.compiler.rt.MalRuntimeHelper; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Groovy's Elvis `?:` applies the fallback when the primary is falsy — including the empty string. + * The v2 codegen previously emitted Optional.ofNullable(P).orElse(F), which only fires on null, so an + * empty-string primary (e.g. a label that .sum() filled with "" for an absent key) leaked "" instead + * of the fallback. This is the exact mechanism behind BanyanDB liaison instances storing node_type="" + * instead of "n/a". + */ +public class MALElvisFalsyTest { + + private static String tagAfterElvis(final String nodeTypeValue) { + final ImmutableMap labels = nodeTypeValue == null + ? ImmutableMap.of("svc", "s") + : ImmutableMap.of("svc", "s", "node_type", nodeTypeValue); + final SampleFamily sf = SampleFamilyBuilder.newBuilder( + Sample.builder().name("metric").labels(labels).value(1.0).timestamp(1L).build()).build(); + final Expression expr = DSL.parse("test_elvis", + "metric.tag({tags -> tags['nt'] = tags.node_type ?: 'n/a'})"); + final Result r = expr.run(Map.of("metric", sf)); + return r.getData().samples[0].getLabels().get("nt"); + } + + private static String tagAfterSideEffectingElvis(final String nodeTypeValue) { + final SampleFamily sf = SampleFamilyBuilder.newBuilder( + Sample.builder() + .name("metric") + .labels(ImmutableMap.of("svc", "s", "node_type", nodeTypeValue)) + .value(1.0) + .timestamp(1L) + .build()).build(); + final Expression expr = DSL.parse("test_elvis_remove", + "metric.tag({tags -> tags['nt'] = tags.remove('node_type') ?: 'n/a'})"); + final Result r = expr.run(Map.of("metric", sf)); + return r.getData().samples[0].getLabels().get("nt"); + } + + @Test + void emptyStringPrimary_usesFallback() { + assertEquals("n/a", tagAfterElvis(""), "empty-string primary must fall through to 'n/a' (Groovy-falsy)"); + } + + @Test + void absentPrimary_usesFallback() { + assertEquals("n/a", tagAfterElvis(null), "absent (null) primary must fall through to 'n/a'"); + } + + @Test + void nonEmptyPrimary_keptAsIs() { + assertEquals("hot", tagAfterElvis("hot"), "non-empty primary must be kept"); + } + + @Test + void sideEffectingPrimary_evaluatedOnce() { + assertEquals("hot", tagAfterSideEffectingElvis("hot"), + "Elvis must not evaluate the primary twice; tags.remove(...) returns a value only once"); + } + + @Test + void runtimeTruthiness_matchesGroovyFalsyValues() { + assertFalse(MalRuntimeHelper.isTruthy(0)); + assertFalse(MalRuntimeHelper.isTruthy(0.0D)); + assertFalse(MalRuntimeHelper.isTruthy(Collections.emptyList())); + assertFalse(MalRuntimeHelper.isTruthy(Collections.emptyMap())); + assertFalse(MalRuntimeHelper.isTruthy(new String[0])); + assertTrue(MalRuntimeHelper.isTruthy(-1)); + assertTrue(MalRuntimeHelper.isTruthy(List.of("value"))); + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java index 25d1c349a7e8..17cf7a5560ef 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java @@ -39,6 +39,7 @@ import org.apache.skywalking.oap.server.core.classloader.Catalog; import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.AppliedClaims; @@ -168,6 +169,24 @@ public Applied apply(final String yamlContent, final String sourceName, layerRegistry.rollback(appliedClaims); throw new ApplyException("MAL compile failed for " + sourceName, t, Collections.emptySet()); } + // All DDL for this file's metrics is now fired. If the opt deferred its schema fence + // (batched apply via withSchemaChangeDeferredFence), run the single barrier here so the + // whole file waits ONCE instead of one fence per metric/downsampling. A fence timeout is + // a non-fatal WARN inside the closure; only a barrier transport error throws, which + // aborts this apply exactly as an inline per-resource fence would have. + // + // EXCEPTION: when fenceRunByCaller is set (the runtime-rule REST apply), the orchestrator + // runs the fence itself AFTER the durable commit + peer resume, on a background thread, so + // a long (3-min) cluster-propagation wait neither blocks the apply nor holds peers + // suspended. We only fire the DDL here and leave the closure for the caller to run. + if (!storageOpt.isFenceRunByCaller()) { + try { + storageOpt.runDeferredFence(); + } catch (final StorageException e) { + layerRegistry.rollback(appliedClaims); + throw new ApplyException("schema fence failed for " + sourceName, e, metricNames); + } + } return new Applied(rule, convert, metricNames, ruleLoader, appliedClaims); } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java index 3652f17feac4..f2bc2438926b 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java @@ -29,6 +29,10 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.RuntimeRuleClusterServiceGrpc; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendRequest; @@ -113,6 +117,84 @@ private SuspendAck suspendOne(final AdminClusterChannelManager.Peer peer, final } } + /** + * Route a read-only apply-status query to the cluster main (the only node that runs applies + * and holds the status). Returns {@code null} if no main is resolvable or the call fails — + * the REST caller then degrades to a content-hash comparison against the durable DAO row. + */ + public ApplyStatusResponse getApplyStatus(final ApplyStatusRequest request) { + final AdminClusterChannelManager.Peer main = MainRouter.mainPeer(peerChannelManager); + if (main == null) { + log.warn("runtime-rule GetApplyStatus skipped: no cluster main resolvable"); + return null; + } + final ManagedChannel channel = main.getChannel(); + if (channel == null) { + log.warn("runtime-rule GetApplyStatus skipped: main {} channel not yet established", + main.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.getApplyStatus(request); + } catch (final Throwable t) { + log.warn("runtime-rule GetApplyStatus to main {} failed: {}", + main.getAddress(), t.getMessage()); + return null; + } + } + + /** + * Fan out NotifyApplied to every non-self peer after a successful commit so peers converge + * NOW (run a reconcile against the just-persisted DB row) rather than on their next ~30s tick. + * Best-effort, same sequential-with-deadline transport as the others; unreachable peers + * self-converge on their own tick. + */ + public List broadcastNotifyApplied(final String catalog, final String name, + final String contentHash) { + final List peers = peerChannelManager.getPeers(); + final List acks = new ArrayList<>(peers.size()); + for (final AdminClusterChannelManager.Peer peer : peers) { + if (peer.isSelf()) { + continue; + } + final NotifyAppliedAck ack = notifyAppliedOne(peer, catalog, name, contentHash); + if (ack != null) { + acks.add(ack); + } + } + return acks; + } + + private NotifyAppliedAck notifyAppliedOne(final AdminClusterChannelManager.Peer peer, + final String catalog, final String name, + final String contentHash) { + final ManagedChannel channel = peer.getChannel(); + if (channel == null) { + log.warn("runtime-rule NotifyApplied skipped for peer {}: channel not yet established", + peer.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.notifyApplied(NotifyAppliedRequest.newBuilder() + .setCatalog(catalog) + .setName(name) + .setContentHash(contentHash == null ? "" : contentHash) + .setSenderNodeId(selfNodeId) + .setIssuedAtMs(System.currentTimeMillis()) + .build()); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied to peer {} failed for {}/{}: {}", + peer.getAddress(), catalog, name, t.getMessage()); + return null; + } + } + /** * Fan out Resume to every non-self peer. Same transport, same sequential-with-deadline * policy as {@link #broadcastSuspend}. Called by the REST handler's failure branches so diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java index 7b01b6f5d8ea..0924bd7ad41f 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -23,10 +23,18 @@ import io.grpc.stub.StreamObserver; import java.nio.charset.StandardCharsets; import java.util.Objects; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.atomic.AtomicBoolean; import lombok.Setter; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeState; @@ -41,6 +49,9 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleService; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyStatus; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; /** * Server-side handler for the three cluster-internal runtime-rule RPCs — see @@ -97,11 +108,149 @@ public class RuntimeRuleClusterServiceImpl @Setter private volatile RuntimeRuleService runtimeRuleService; + /** Off-RPC-thread runner for notify-triggered reconciles so {@link #notifyApplied} acks + * immediately. Single daemon thread — reconciles are per-file-locked + idempotent, so + * serializing them is fine; daemon so it never blocks JVM shutdown. */ + private final ExecutorService reconcileNudgeExecutor = Executors.newSingleThreadExecutor(r -> { + final Thread t = new Thread(r, "runtime-rule-notify-reconcile"); + t.setDaemon(true); + return t; + }); + + /** Coalesces a burst of NotifyApplied into a single queued reconcile. {@code dslManager.tick()} + * is a full reconcile over every rule file, so when a multi-file apply (or several applies) + * fires many notifies, the first queued tick already converges all of them — the rest would be + * redundant full {@code dao.getAll()} scans. Set on schedule, cleared at the START of the task + * so a notify arriving while a tick runs still queues exactly one follow-up (no lost update). */ + private final AtomicBoolean tickPending = new AtomicBoolean(false); + public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String selfNodeId) { this.dslManager = dslManager; this.selfNodeId = selfNodeId; } + /** Stop the off-thread reconcile-nudge executor. The framework's {@code ModuleProvider} has no + * stop lifecycle hook, so in production this is not auto-invoked — the executor is a daemon + * thread that never blocks JVM exit. Provided for clean test teardown and for a future module + * shutdown hook; mirrors {@code RuntimeRuleService.shutdown()}. */ + public void shutdown() { + reconcileNudgeExecutor.shutdownNow(); + } + + /** + * Push-notify from the main after a successful commit: converge NOW rather than on the next + * ~30s tick. Runs a full reconcile off the gRPC thread (idempotent, per-file-locked — unchanged + * files short-circuit on hash). Best-effort: the peer self-converges on its own tick if this is + * lost, so a self-broadcast or a schedule failure is non-fatal. + */ + @Override + public void notifyApplied(final NotifyAppliedRequest request, + final StreamObserver responseObserver) { + if (Objects.equals(selfNodeId, request.getSenderNodeId())) { + responseObserver.onNext(NotifyAppliedAck.newBuilder() + .setNodeId(selfNodeId).setAccepted(false) + .setDetail("self-broadcast suppressed").build()); + responseObserver.onCompleted(); + return; + } + boolean accepted = true; + // Not final: assigned in both the try and catch arms of the schedule attempt below. + String detail; + if (tickPending.compareAndSet(false, true)) { + try { + reconcileNudgeExecutor.submit(() -> { + // Clear before running so a notify that arrives during this tick queues exactly + // one follow-up rather than being dropped. + tickPending.set(false); + try { + dslManager.tick(); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied reconcile for {}/{} failed; peer will " + + "self-converge on its next tick: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + }); + detail = "reconcile scheduled"; + } catch (final Throwable t) { + // Submit rejected (executor shut down). Release the flag so the next notify retries. + tickPending.set(false); + accepted = false; + detail = "schedule failed; self-converge on next tick"; + log.warn("runtime-rule NotifyApplied could not schedule reconcile for {}/{}: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + } else { + // A reconcile is already queued; this notify is covered by it. tick() is a full + // cluster-wide reconcile, so no per-file work is lost by coalescing. + detail = "coalesced into pending reconcile"; + } + responseObserver.onNext(NotifyAppliedAck.newBuilder() + .setNodeId(selfNodeId) + .setAccepted(accepted) + .setDetail(detail) + .build()); + responseObserver.onCompleted(); + } + + /** + * Read-only apply-status query, served by the main (only the main runs applies and holds the + * status). Resolves by apply_id when present, else by (catalog, name) gated on content_hash. + * Returns {@code found=false} / {@code APPLY_PHASE_UNKNOWN} when nothing matches — the caller + * falls back to comparing the durable content hash against the DAO row. + */ + @Override + public void getApplyStatus(final ApplyStatusRequest request, + final StreamObserver responseObserver) { + final ApplyStatus status; + if (!request.getApplyId().isEmpty()) { + status = SchemaApplyCoordinator.INSTANCE.get(request.getApplyId()); + } else { + final String hash = request.getContentHash().isEmpty() ? null : request.getContentHash(); + status = SchemaApplyCoordinator.INSTANCE.getLatestByFile( + request.getCatalog(), request.getName(), hash); + } + final ApplyStatusResponse.Builder resp = ApplyStatusResponse.newBuilder().setNodeId(selfNodeId); + if (status == null) { + resp.setFound(false).setPhase(ApplyStatusPhase.APPLY_PHASE_UNKNOWN); + } else { + resp.setFound(true) + .setApplyId(status.getApplyId()) + .setCatalog(status.getCatalog()) + .setName(status.getName()) + .setContentHash(status.getContentHash() == null ? "" : status.getContentHash()) + .setPhase(toProtoPhase(status.getPhase())) + .setFailureReason(status.getFailureReason() == null ? "" : status.getFailureReason()) + .setStartedAtMs(status.getStartedAtMs()) + .setUpdatedAtMs(status.getUpdatedAtMs()); + if (status.getFenceLaggards() != null && !status.getFenceLaggards().isEmpty()) { + resp.addAllFenceLaggards(status.getFenceLaggards()); + } + } + responseObserver.onNext(resp.build()); + responseObserver.onCompleted(); + } + + private static ApplyStatusPhase toProtoPhase(final ApplyPhase phase) { + switch (phase) { + case PENDING: + return ApplyStatusPhase.APPLY_PHASE_PENDING; + case DDL: + return ApplyStatusPhase.APPLY_PHASE_DDL; + case FENCING: + return ApplyStatusPhase.APPLY_PHASE_FENCING; + case ROLLING_OUT: + return ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT; + case APPLIED: + return ApplyStatusPhase.APPLY_PHASE_APPLIED; + case DEGRADED: + return ApplyStatusPhase.APPLY_PHASE_DEGRADED; + case FAILED: + return ApplyStatusPhase.APPLY_PHASE_FAILED; + default: + return ApplyStatusPhase.APPLY_PHASE_UNKNOWN; + } + } + @Override public void suspend(final SuspendRequest request, final StreamObserver responseObserver) { diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java index 58eec1fc78f9..4c4394105e1d 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java @@ -40,4 +40,16 @@ public class RuntimeRuleModuleConfig extends ModuleConfig { * refresh + storage replica lag + RPC jitter. Default 60 s is conservative. */ private long selfHealThresholdSeconds = 60; + /** + * Timeout (seconds) for the runtime-rule deferred/batched BanyanDB schema fence — the wait + * for every data node to apply a newly created/updated measure's schema revision. Applies + * ONLY to the operator REST apply path, where the fence runs in the background after the + * durable commit + peer resume (so it never blocks the apply or holds peers suspended), and + * the operator polls {@code GET /runtime/rule/status} to watch {@code FENCING → APPLIED} or + * {@code DEGRADED} (with the lagging node ids). The inline / static / delete fences keep their + * short 2 s constant. Default 180 s is the multi-node cluster-propagation budget for a large + * reshape; on timeout the apply is reported {@code DEGRADED} (durable + forward-progressing, + * never reverted) with the laggard node list. + */ + private long deferredFenceTimeoutSeconds = 180; } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java index ec7197d5de63..6c50beab2a97 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java @@ -41,6 +41,7 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterServiceImpl; import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleRestHandler; import org.apache.skywalking.oap.server.telemetry.TelemetryModule; import org.apache.skywalking.oap.server.telemetry.api.TelemetryRelatedContext; @@ -218,6 +219,11 @@ public class RuntimeRuleModuleProvider extends ModuleProvider { * unchanged bundles. */ private static final long SCHEDULER_INITIAL_DELAY_SECONDS = 2L; + /** Retain a tracked apply-status this long after its last update so a post-apply UI poll (and + * a post-refresh content query) still resolves, then reap it to bound memory. */ + private static final long APPLY_STATUS_TTL_MS = 3_600_000L; + /** How often the apply-status eviction sweep runs on the reconciler executor. */ + private static final long APPLY_STATUS_EVICT_INTERVAL_SECONDS = 300L; /** * Env var carrying this OAP's unique per-node identity — the Kubernetes pod UID, injected @@ -271,7 +277,8 @@ public void start() throws ServiceNotProvidedException, ModuleStartException { // RuleEngineRegistry from the per-DSL state maps it owns. dslManager = new DSLManager( getManager(), - moduleConfig.getSelfHealThresholdSeconds() * 1000L + moduleConfig.getSelfHealThresholdSeconds() * 1000L, + moduleConfig.getDeferredFenceTimeoutSeconds() * 1000L ); // Cluster-facing Suspend client: fans out to every non-self peer on the OAP cluster bus @@ -406,6 +413,16 @@ public void notifyAfterCompleted() throws ModuleStartException { ); log.info("Runtime rule dslManager scheduled: first tick in {} s, then every {} s.", SCHEDULER_INITIAL_DELAY_SECONDS, intervalSeconds); + + // Bound the apply-status coordinator's memory: reap tracked applies past the retention + // window (terminal ones linger long enough for a post-apply UI poll; a stale PENDING from + // a missed branch is reaped too — a later query then returns UNKNOWN and the caller falls + // back to the durable content hash). Reuses the same single-thread executor; the sweep is + // O(tracked) and cheap. + reconcilerExecutor.scheduleWithFixedDelay( + () -> SchemaApplyCoordinator.INSTANCE.evictExpired(APPLY_STATUS_TTL_MS), + APPLY_STATUS_EVICT_INTERVAL_SECONDS, APPLY_STATUS_EVICT_INTERVAL_SECONDS, TimeUnit.SECONDS + ); } /** diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java index 102a05ebbe85..c9d2ece6fb6f 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -150,6 +150,13 @@ public final class DSLManager { @Getter private final long selfHealThresholdMs; + /** Timeout for the runtime-rule deferred/batched BanyanDB schema fence on the operator REST + * apply path (default 3 min). Read via {@link #getDeferredFenceTimeoutMs()} by the REST + * orchestrator, which builds the deferred-fence opt with it + * ({@code RuntimeRuleModuleConfig.deferredFenceTimeoutSeconds}). */ + @Getter + private final long deferredFenceTimeoutMs; + /** Lock-observability wrapper. Owned by the DSLManager; the REST handler borrows via * {@link #getLockMetrics()} so every lock acquire path reports to the same histograms. */ @Getter @@ -183,12 +190,14 @@ public final class DSLManager { private final RuleEngineRegistry engineRegistry; public DSLManager(final ModuleManager moduleManager, - final long selfHealThresholdMs) { + final long selfHealThresholdMs, + final long deferredFenceTimeoutMs) { this.moduleManager = Objects.requireNonNull(moduleManager, "moduleManager"); this.engineRegistry = new RuleEngineRegistry(); this.engineRegistry.register(new MalRuleEngine(this.rules, this.moduleManager)); this.engineRegistry.register(new LalRuleEngine(this.rules, this.moduleManager)); this.selfHealThresholdMs = selfHealThresholdMs; + this.deferredFenceTimeoutMs = deferredFenceTimeoutMs; this.lockMetrics = new LockMetrics(moduleManager); this.suspendCoord = new SuspendResumeCoordinator( @@ -361,7 +370,7 @@ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.Runtim */ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, final boolean deferCommit) { - return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.withSchemaChange()); + return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.withSchemaChangeDeferredFence()); } /** @@ -759,10 +768,10 @@ private StorageManipulationOpt tickStorageOpt(final boolean atBoot) { selfMain = MainRouter.isSelfMain(apm); } catch (final Throwable t) { // Cluster module not wired (embedded / single-process) — always main. - return StorageManipulationOpt.withSchemaChange(); + return StorageManipulationOpt.withSchemaChangeDeferredFence(); } if (selfMain) { - return StorageManipulationOpt.withSchemaChange(); + return StorageManipulationOpt.withSchemaChangeDeferredFence(); } return atBoot ? StorageManipulationOpt.verifySchemaOnly() diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java index ce6af280a0ce..0579e9771576 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java @@ -230,7 +230,7 @@ private Result runRevert(final RuleEngine engine, new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, bundledContent, /* status */ null, /* updateTime */ 0L); final ApplyInputs withSchema = new ApplyInputs( - moduleManager, StorageManipulationOpt.withSchemaChange(), + moduleManager, StorageManipulationOpt.withSchemaChangeDeferredFence(), alarmResetter, rules); final DSLRuntimeApply.Outcome outcome = dslRuntimeApply.apply( bundledFile, Classification.STRUCTURAL, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java index 77084f614fcb..bdb4dd0c5b74 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java @@ -113,6 +113,20 @@ public HttpResponse list(@Param("catalog") @Default("") final String catalog) { return service.list(catalog); } + /** + * Apply-status query for the UI / operator. Served by the cluster main (the service routes + * there when self isn't main). Query by {@code applyId} (the live handle from a just-submitted + * apply) or, once it's gone, by {@code catalog} + {@code name} (+ optional {@code contentHash} + * — the durable identity). Always 200 with a JSON status; {@code found=false} when nothing matches. + */ + @Get("/runtime/rule/status") + public HttpResponse applyStatus(@Param("catalog") @Default("") final String catalog, + @Param("name") @Default("") final String name, + @Param("contentHash") @Default("") final String contentHash, + @Param("applyId") @Default("") final String applyId) { + return service.queryApplyStatus(catalog, name, contentHash, applyId); + } + @Get("/runtime/rule") public HttpResponse get(@Param("catalog") final String catalog, @Param("name") final String name, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 01d645a4e8a3..fb5ad4fb9af9 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -42,6 +42,9 @@ import java.util.Objects; import java.util.Optional; import java.util.Set; +import java.time.Duration; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; import java.util.concurrent.locks.ReentrantLock; import java.util.regex.Pattern; import java.util.zip.GZIPOutputStream; @@ -69,6 +72,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.layer.LayerConflictException; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.MainRouter; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendState; @@ -78,6 +83,9 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyStatus; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; /** @@ -178,6 +186,34 @@ private Set validCatalogs() { */ private volatile AdminClusterChannelManager peerChannelManager; + /** + * Off-REST-thread runner for the best-effort peer {@code NotifyApplied} fan-out. The fan-out is + * sequential with a per-peer deadline, so an unreachable peer would otherwise add + * {@code peerCount × deadline} to the operator's apply latency — the notify is a convergence + * optimization (peers self-converge on their next tick if it's lost), so it must never hold the + * HTTP response. Single daemon thread: applies are serialized per file upstream and the fan-out + * is idempotent, so ordering across applies doesn't matter; daemon so it never blocks JVM + * shutdown. Shut down by {@link #shutdown()}. + */ + private final ExecutorService notifyExecutor = Executors.newSingleThreadExecutor(r -> { + final Thread t = new Thread(r, "runtime-rule-notify-broadcast"); + t.setDaemon(true); + return t; + }); + + /** + * Runs the runtime-rule deferred schema fence in the background after a structural apply's + * durable commit (see {@link #scheduleBackgroundFence}). A small fixed daemon pool so several + * concurrent applies (different files) can each confirm cluster-wide propagation without one + * file's up-to-3-min fence blocking another's status; daemon so it never blocks JVM shutdown. + * Shut down by {@link #shutdown()}. + */ + private final ExecutorService fenceExecutor = Executors.newFixedThreadPool(4, r -> { + final Thread t = new Thread(r, "runtime-rule-bg-fence"); + t.setDaemon(true); + return t; + }); + public RuntimeRuleService(final ModuleManager moduleManager, final DSLManager dslManager, final RuntimeRuleClusterClient clusterClient, @@ -269,6 +305,169 @@ private HttpResponse forwardToMain(final String mainAddr, } } + /** + * Read-only apply-status query for the UI / operator. Served by the main: when self is main + * (or single-process / no cluster client), reads the local {@link SchemaApplyCoordinator}; + * otherwise routes to the main via {@code GetApplyStatus}. Resolve by {@code applyId} (the + * live handle) or by {@code catalog}/{@code name} (+ optional {@code contentHash}) once the + * apply-id is gone (page refresh / main restart). Always 200 with a JSON status; {@code + * found=false} / phase {@code UNKNOWN} when nothing matches (caller compares the durable + * content hash itself). + */ + public HttpResponse queryApplyStatus(final String catalog, final String name, + final String contentHash, final String applyId) { + final AdminClusterChannelManager apm = resolvePeerChannelManager(); + if (apm == null || MainRouter.isSelfMain(apm) || clusterClient == null) { + final ApplyStatus local = applyId.isEmpty() + ? SchemaApplyCoordinator.INSTANCE.getLatestByFile( + catalog, name, contentHash.isEmpty() ? null : contentHash) + : SchemaApplyCoordinator.INSTANCE.get(applyId); + if (local != null) { + return applyStatusJson(local); + } + // Live status gone (apply-id evicted / never began on this main). Fall back to the + // durable rule row for a content-derived answer; only UNKNOWN if even that is absent. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + return fromDao != null ? fromDao : applyStatusJson(null); + } + final ApplyStatusResponse remote = clusterClient.getApplyStatus(ApplyStatusRequest.newBuilder() + .setApplyId(applyId).setCatalog(catalog).setName(name).setContentHash(contentHash).build()); + if (remote == null) { + // Main unreachable, or a mixed-version main answered UNIMPLEMENTED. Degrade to the + // durable rule row (shared storage) before giving up with a transport error. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + if (fromDao != null) { + return fromDao; + } + return HttpResponse.of(HttpStatus.BAD_GATEWAY, MediaType.JSON_UTF_8, + jsonBody("status_unavailable", catalog, name, + "could not reach the cluster main for apply status; retry shortly")); + } + if (!remote.getFound()) { + // The main reached us but no longer holds a live status (evicted / restarted). The + // durable row is the same truth the main would derive — read it locally. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + if (fromDao != null) { + return fromDao; + } + } + return applyStatusJsonFromProto(remote); + } + + /** + * Content-derived apply status from the durable rule row, used when the live coordinator status + * is gone (apply-id evicted, main restarted, or the main is unreachable). The persist-is-commit + * invariant means an {@code ACTIVE} row IS the durable record that the apply of its content + * committed, so a matching {@code ACTIVE} row reports {@link ApplyPhase#APPLIED} (flagged + * {@code derivedFrom=durable-dao}). A hash mismatch means the queried content is not the current + * applied content; an {@code INACTIVE} row means it was paused. Returns {@code null} when the DAO + * is unresolvable or the read fails, so the caller keeps its own unavailable/unknown response. + */ + private HttpResponse applyStatusFromDao(final String catalog, final String name, final String contentHash) { + final RuntimeRuleManagementDAO.RuntimeRuleFile row; + try { + row = currentRuleFile(catalog, name); + } catch (final IOException e) { + log.warn("apply-status DAO fallback read failed for {}/{}; reporting unknown", catalog, name, e); + return null; + } + if (row == null) { + return null; + } + final String rowHash = ContentHash.sha256Hex(row.getContent()); + final boolean active = !RuntimeRule.STATUS_INACTIVE.equals(row.getStatus()); + final boolean hashMatches = contentHash.isEmpty() || contentHash.equals(rowHash); + final JsonObject o = new JsonObject(); + o.addProperty("catalog", catalog); + o.addProperty("name", name); + o.addProperty("contentHash", rowHash); + o.addProperty("derivedFrom", "durable-dao"); + if (active && hashMatches) { + o.addProperty("found", true); + o.addProperty("phase", ApplyPhase.APPLIED.name()); + o.addProperty("note", "live apply status unavailable; derived from the durable rule row " + + "(persist-is-commit: an ACTIVE row means this content's apply committed)"); + } else { + o.addProperty("found", false); + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + o.addProperty("note", active + ? "the queried content is not the currently applied content (hash mismatch)" + : "the rule row is INACTIVE (paused)"); + } + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + + private HttpResponse applyStatusJson(final ApplyStatus s) { + final JsonObject o = new JsonObject(); + if (s == null) { + o.addProperty("found", false); + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + } else { + o.addProperty("found", true); + o.addProperty("applyId", s.getApplyId()); + o.addProperty("catalog", s.getCatalog()); + o.addProperty("name", s.getName()); + o.addProperty("contentHash", s.getContentHash()); + o.addProperty("phase", s.getPhase().name()); + if (s.getFailureReason() != null) { + o.addProperty("failureReason", s.getFailureReason()); + } + o.addProperty("startedAtMs", s.getStartedAtMs()); + o.addProperty("updatedAtMs", s.getUpdatedAtMs()); + if (!s.getFenceLaggards().isEmpty()) { + final JsonArray laggards = new JsonArray(); + s.getFenceLaggards().forEach(laggards::add); + o.add("fenceLaggards", laggards); + } + } + // Schema parity with the routed-from-main path (applyStatusJsonFromProto), which always + // carries servedBy: this node answered locally (self is main / single process). + o.addProperty("servedBy", "self"); + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + + private HttpResponse applyStatusJsonFromProto(final ApplyStatusResponse r) { + final JsonObject o = new JsonObject(); + o.addProperty("found", r.getFound()); + if (r.getFound()) { + o.addProperty("applyId", r.getApplyId()); + o.addProperty("catalog", r.getCatalog()); + o.addProperty("name", r.getName()); + o.addProperty("contentHash", r.getContentHash()); + o.addProperty("phase", stripApplyPhasePrefix(r.getPhase().name())); + if (!r.getFailureReason().isEmpty()) { + o.addProperty("failureReason", r.getFailureReason()); + } + o.addProperty("startedAtMs", r.getStartedAtMs()); + o.addProperty("updatedAtMs", r.getUpdatedAtMs()); + if (r.getFenceLaggardsCount() > 0) { + final JsonArray laggards = new JsonArray(); + r.getFenceLaggardsList().forEach(laggards::add); + o.add("fenceLaggards", laggards); + } + } else { + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + } + o.addProperty("servedBy", r.getNodeId()); + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + + /** Strip the proto {@code APPLY_PHASE_} prefix so the routed-from-main JSON phase matches the + * local-path {@link ApplyPhase} names (e.g. {@code APPLY_PHASE_APPLIED} → {@code APPLIED}). + * Any value that isn't a live {@link ApplyPhase} — the proto default {@code UNSPECIFIED}, a + * reserved/removed phase, or a value from a newer-version peer — maps to {@code UNKNOWN} so the + * JSON phase is always a name the client's enum recognizes. */ + private static String stripApplyPhasePrefix(final String protoName) { + final String prefix = "APPLY_PHASE_"; + final String stripped = protoName.startsWith(prefix) ? protoName.substring(prefix.length()) : protoName; + for (final ApplyPhase p : ApplyPhase.values()) { + if (p.name().equals(stripped)) { + return stripped; + } + } + return ApplyPhase.UNKNOWN.name(); + } + private AdminClusterChannelManager resolvePeerChannelManager() { AdminClusterChannelManager local = peerChannelManager; if (local != null) { @@ -1002,12 +1201,29 @@ private HttpResponse applyStructural(final String catalog, final String name, // removedMetrics, swap appliedMal/appliedContent, retire old loader, alarm reset, // advance snapshot) is stashed in the dslManager — we drain it below once persist // resolves. + // Track this apply in the coordinator so a progress query (and peers, later) can observe + // its outcome. From here every exit path marks a terminal phase (APPLIED / FAILED / + // DEGRADED); a missed branch leaves only a stale PENDING the background watch reaps. + final String applyId = SchemaApplyCoordinator.INSTANCE.begin( + catalog, name, ContentHash.sha256Hex(content)); final long updateTime = System.currentTimeMillis(); final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); final DSLRuntimeState postApply; + // Build the deferred-fence opt ourselves so WE own the post-DDL schema fence. It must run + // AFTER persist but BEFORE dispatch resumes: an un-propagated write is silently dropped at + // the data node (CLAUDE.md tip #16), so the local commit (which swaps the bundle + unparks + // dispatch) and the peer resume/notify must wait for the fence. fenceRunByCaller tells the + // installer to register the fence closure but NOT run it inline — we run it (in the + // background) right before resuming, so the HTTP response is not held for the wait. + final StorageManipulationOpt fenceOpt = StorageManipulationOpt.withSchemaChangeDeferredFence( + Duration.ofMillis(dslManager.getDeferredFenceTimeoutMs())); + fenceOpt.setFenceRunByCaller(true); + // The apply call compiles, verifies, and fires the schema-change DDL (the fence is deferred + // to us); mark DDL before it so an in-flight query sees progress past PENDING. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.DDL); try { - postApply = dslManager.applyNowForRuleFile(ruleFile, true); + postApply = dslManager.applyNowForRuleFile(ruleFile, true, fenceOpt); } catch (final LayerConflictException lce) { // Runtime-DSL layer-declaration conflict — operator-actionable, not a server // error. Resume local + peers immediately so the cluster keeps serving the @@ -1016,6 +1232,7 @@ private HttpResponse applyStructural(final String catalog, final String name, catalog, name, lce.getMessage()); dslManager.getSuspendCoord().localResume(catalog, name); broadcastResume(catalog, name, "layer_conflict"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "layer conflict: " + lce.getMessage()); return badRequest(lce.applyStatus(), catalog, name, lce.getMessage()); } catch (final Throwable t) { // Some exceptions wrap LayerConflictException as cause; unwrap so the operator @@ -1027,6 +1244,7 @@ private HttpResponse applyStructural(final String catalog, final String name, + "(wrapped): {}", catalog, name, wrapped.getMessage()); dslManager.getSuspendCoord().localResume(catalog, name); broadcastResume(catalog, name, "layer_conflict"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "layer conflict: " + wrapped.getMessage()); return badRequest(wrapped.applyStatus(), catalog, name, wrapped.getMessage()); } log.error("runtime-rule STRUCTURAL apply threw for {}/{}", catalog, name, t); @@ -1034,9 +1252,11 @@ private HttpResponse applyStructural(final String catalog, final String name, // Peers went SUSPENDED on our earlier broadcast; let them know the apply // aborted so they flip back to RUNNING within an RPC round-trip. broadcastResume(catalog, name, "apply_threw"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "apply threw: " + t.getMessage()); return serverError("apply_failed", catalog, name, t.getMessage()); } if (postApply != null && postApply.getLastApplyError() != null) { + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, postApply.getLastApplyError()); // Apply failed (DDL verify mismatch, compile surprise, applier exception). Row // is NOT yet persisted. applyOneRuleFile already rolled back its own partial // registration on the exception path; the pendingCommits stash is only @@ -1063,10 +1283,69 @@ private HttpResponse applyStructural(final String catalog, final String name, return serverError("apply_failed", catalog, name, err); } - // Apply succeeded + verified. Commit the row — the design's commit point. Retry a - // couple of times on transient failures before giving up; the per-backend - // RuntimeRuleManagementDAO.save can throw on a brief storage outage. A narrow retry - // here avoids turning a blip into a cluster-divergence event. + // Apply succeeded + verified — DDL fired, the schema fence is deferred to us, and the + // pending commit is stashed (deferCommit). The write-safe tail runs in the BACKGROUND so the + // (up to 3-min) fence wait doesn't block this response: fence → persist → commit → resume. + // Persist (the durable commit point) happens AFTER the fence, so "durable" implies "schema + // propagated cluster-wide": a main crash before persist leaves NO row — peers/crash-recovery + // safely stay on the old content (the orphaned measure from DDL is inert) — and any durable + // row is guaranteed fence-confirmed, so convergence never resumes dispatch against an + // unpropagated schema. The response returns now at FENCING with the applyId; the operator + // polls GET /runtime/rule/status for FENCING → ROLLING_OUT → APPLIED (or DEGRADED + laggards, + // or FAILED on persist failure). + SchemaApplyCoordinator.INSTANCE.markFencing(applyId); + boolean scheduled = true; + try { + fenceExecutor.submit( + () -> fenceThenPersistThenResume(applyId, fenceOpt, catalog, name, content, updateTime)); + } catch (final Throwable t) { + // Executor rejected (shutting down). Run inline (blocking) so the suspend bracket does + // not leak and the apply still completes — write-safety wins over a non-blocking response. + log.warn("runtime-rule could not schedule the background fence for {}/{}; running it " + + "inline so dispatch is not left suspended", catalog, name, t); + scheduled = false; + } + if (!scheduled) { + fenceThenPersistThenResume(applyId, fenceOpt, catalog, name, content, updateTime); + } + return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, + "structural apply accepted; fencing schema propagation, then persisting + resuming in " + + "the background — poll /runtime/rule/status?applyId=" + applyId + describeDelta(delta)); + } + + /** + * Write-safe post-apply tail, run on the fence executor (or inline if it's gone): + * fence → persist → commit → resume. The fence waits for the new measure schema + * to reach every data node BEFORE anything durable or visible happens (an un-propagated write is + * silently dropped — CLAUDE.md tip #16); only then is the rule row persisted (the durable commit + * point), so a crash before persist leaves no row and peers stay safely on the old content, and + * any durable row is guaranteed fence-confirmed. After persist it finalizes the local commit + * (drop removed metrics, swap the bundle, unpark dispatch) and resumes/notifies peers. + * + *

Terminal status: {@link ApplyPhase#FAILED} if persist fails (rolled back — nothing + * committed); {@link ApplyPhase#DEGRADED} if the fence didn't confirm within the budget (laggards + * listed; dispatch resumed anyway so a stuck node can't park the metric forever) or the local + * commit-tail threw (DB durable, peers converge from it); {@link ApplyPhase#APPLIED} otherwise. + */ + private void fenceThenPersistThenResume(final String applyId, final StorageManipulationOpt fenceOpt, + final String catalog, final String name, + final String content, final long updateTime) { + // 1. Fence the new measure schema's propagation BEFORE persisting or resuming. A fence + // error/timeout is non-fatal (best-effort) — we proceed but mark DEGRADED below. + StorageManipulationOpt.FenceOutcome fenceOutcome = null; + Throwable fenceError = null; + try { + fenceOpt.runDeferredFence(); + fenceOutcome = fenceOpt.getFenceOutcome(); + } catch (final Throwable t) { + fenceError = t; + log.warn("runtime-rule schema fence for {}/{} errored; proceeding anyway " + + "(BanyanDB still propagates the schema asynchronously)", catalog, name, t); + } + + // 2. Persist the rule row — THE durable commit point, AFTER the fence so durable ⟹ fenced. + // Narrow retry over a transient storage blip. On failure nothing is committed: roll back + // the local pending commit and resume peers to the old content (the DB never advanced). HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); if (persistError != null) { try { @@ -1077,71 +1356,116 @@ private HttpResponse applyStructural(final String catalog, final String name, persistError = persistRuleSync(catalog, name, content, updateTime); } if (persistError != null) { - // Persist still failing. The local node has registered added + shape-break - // metrics in MeterSystem (DDL fired, isExists verified) while the DB and peers - // remain on the old content. Discard drains the pending commit by removing only - // the added + shape-break metrics — it does NOT drop removedMetrics (the commit - // was stashed before that step, so those are still alive) and does NOT swap - // appliedMal/appliedContent (still on the pre-apply bundle). Net outcome: - // local node converges back to the pre-apply bundle exactly, no divergence from - // what the DB still says is current. try { dslManager.getCommitCoord().discardCommit(catalog, name); } catch (final Throwable rt) { - log.error("runtime-rule CRITICAL: persist-failure discard itself failed for " - + "{}/{}; state is inconsistent and requires operator intervention", - catalog, name, rt); + log.error("runtime-rule CRITICAL: persist-failure discard itself failed for {}/{}; " + + "state is inconsistent and requires operator intervention", catalog, name, rt); } - // Peers are still SUSPENDED on our earlier broadcast. The DB didn't advance, - // so self-heal would eventually flip them back, but broadcasting Resume now - // cuts the dispatch gap from 60 s to a single RPC round-trip. broadcastResume(catalog, name, "persist_failed"); - log.error("runtime-rule CRITICAL: STRUCTURAL persist FAILED after successful apply " - + "for {}/{} — discarded pending commit; local node re-aligned with old " - + "content. Operator action: re-push via /addOrUpdate once storage is healthy.", - catalog, name); - return persistError; + log.error("runtime-rule CRITICAL: STRUCTURAL persist FAILED after apply + fence for {}/{} " + + "— discarded the pending commit; local node re-aligned with old content. Operator " + + "action: re-push via /addOrUpdate once storage is healthy.", catalog, name); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "persist failed after apply + fence"); + return; } - // Persist succeeded — drain the pending commit now that the DB reflects the new - // content. commitCoord.finalizeCommit drops removedMetrics, swaps the applied - // pointers, retires the old loader, fires alarm reset, and advances the snapshot. - // - // Commit-tail failure handling: the DB row is durable (persist already succeeded), - // so peers converge from the DB — but on THIS node the local drop+recreate may - // not have fully landed. Return 500 commit_deferred so the operator sees a clear - // "DB row flipped, local commit threw" signal and can retry. Returning 200 would - // tell the operator "done" while the backend schema on this node may still be - // stale — that's the failure mode the review flagged. + // 3. Durable — finalize the local commit (swap bundle, unpark dispatch) + resume/notify peers. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); Throwable commitFailure = null; boolean drained = false; try { drained = dslManager.getCommitCoord().finalizeCommit(catalog, name); } catch (final Throwable t) { commitFailure = t; - log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist " - + "succeeded — DB is authoritative, peers will converge. Operator action: " - + "inspect log for the underlying cause.", catalog, name, t); + log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist — DB is " + + "authoritative, peers converge from it; this node retries on the next tick.", + catalog, name, t); } + if (commitFailure != null || drained) { + // The durable row advanced (commit drained, OR the local commit-tail threw but the row IS + // persisted) — peers must reconcile against it NOW rather than wait one ~30s tick. + broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); + } else { + // Nothing changed (force re-apply on byte-identical content) — just un-suspend peers. + broadcastResume(catalog, name, "structural_resume"); + } + + // 4. Terminal status. if (commitFailure != null) { - return serverError("commit_deferred", catalog, name, - "DB row persisted, but local commit-tail threw — backend shape on this " - + "node may not have fully landed. Peers converge from DB; this node " - + "will retry on the next dslManager tick. Cause: " + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "commit-tail deferred: DB persisted, local backend may be stale until the next tick: " + commitFailure.getMessage()); + } else if (fenceError != null) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence error (committed + durable; cluster converges): " + fenceError.getMessage()); + } else if (fenceOutcome != null && !fenceOutcome.isApplied()) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence did not confirm cluster-wide propagation within " + + dslManager.getDeferredFenceTimeoutMs() + " ms", + fenceOutcome.getLaggardNodeIds()); + } else { + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); } + } - // No commit was drained — typical for {@code force=true} re-applies on byte- - // identical content (engine returned NO_CHANGE so nothing was stashed). Peers are - // still PEER-suspended from our earlier broadcast and would only converge via the - // 60 s self-heal window without an explicit Resume. Send the Resume now so peers - // recover within an RPC round-trip. - if (!drained) { - broadcastResume(catalog, name, "force_no_change"); + /** + * Background tail for {@code /delete?mode=revertToBundled}, run on the fence executor (or + * inline if it's gone). Reinstalls the bundled rule via the apply pipeline — which runs the + * deferred schema fence — then removes the runtime row, and maps the orchestrator outcome to + * the apply's terminal phase so an operator polling {@code GET /runtime/rule/status?applyId} + * sees {@code APPLIED}, or {@code FAILED} with the reason. Unlike a structural /addOrUpdate + * there is no Suspend bracket to release here: revert never broadcast a pause (peers converge + * on the persisted row via the periodic scan), so a failure needs no Resume. + */ + private void revertToBundledTracked(final String applyId, final String catalog, + final String name, final String priorContent, + final RuntimeRuleManagementDAO dao) { + final DSLRuntimeDelete.Result revert; + try { + revert = dslManager.getDslRuntimeDelete().revertToBundled(catalog, name, priorContent); + } catch (final Throwable t) { + log.error("runtime-rule /delete: revertToBundled threw for {}/{}", catalog, name, t); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "revert threw: " + t.getMessage()); + return; } - - return ok(HttpStatus.OK, "structural_applied", catalog, name, - "structural apply succeeded" + describeDelta(delta)); + switch (revert.status) { + case REFUSED_CONFLICT: + log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "revert refused: " + revert.error); + return; + case PRECONDITION_FAILED: + log.error("runtime-rule /delete: revertToBundled precondition failed for {}/{}: {}", + catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "revert precondition failed: " + revert.error); + return; + case BUNDLED_APPLY_FAILED: + log.error("runtime-rule /delete: bundled apply failed for {}/{}: {}", + catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "bundled apply failed (typically a storage-backend DDL/verify issue); the " + + "orchestrator unwound the runtime install so local state matches the " + + "persisted INACTIVE row — retry once storage recovers. Cause: " + revert.error); + return; + case REVERTED: + default: + break; + } + // Bundled is durable cluster-wide (the apply's fence confirmed) — finalize by removing + // the runtime row. A delete failure here leaves bundled applied but the INACTIVE row + // lingering; the reconcile retries it, so report FAILED with that context. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); + try { + dao.delete(catalog, name); + } catch (final IOException e) { + log.error("failed to delete runtime rule {}/{} after revert", catalog, name, e); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "bundled was reinstalled but removing the runtime row failed; it will be retried " + + "on the next reconcile: " + e.getMessage()); + return; + } + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); } /** @@ -1306,6 +1630,42 @@ private List broadcastSuspend(final String catalog, final String nam } } + /** + * Fire NotifyApplied to every non-self peer after a successful commit so peers converge NOW + * instead of waiting up to one ~30s refresh tick. Best-effort and fire-and-forget: a failure + * is non-fatal because peers self-converge on their own tick regardless. + */ + private void broadcastNotifyApplied(final String catalog, final String name, final String contentHash) { + if (clusterClient == null) { + return; + } + // Fire-and-forget: never block the REST response on a sequential, per-peer-deadline fan-out. + // A lost notify is harmless — the peer self-converges on its next tick. + try { + notifyExecutor.submit(() -> { + try { + clusterClient.broadcastNotifyApplied(catalog, name, contentHash); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied broadcast failed for {}/{}; peers will " + + "converge on their next tick", catalog, name, t); + } + }); + } catch (final Throwable t) { + // Executor rejected (shutting down). Drop it — peers self-converge on their next tick. + log.warn("runtime-rule NotifyApplied could not be scheduled for {}/{}; peers converge " + + "on their next tick", catalog, name, t); + } + } + + /** Stop the best-effort background executors (notify fan-out + schema fence). The framework's + * {@code ModuleProvider} has no stop lifecycle hook, so this is not auto-invoked in production — + * both executors are daemon threads that never block JVM exit. Provided for clean test teardown + * and for a future module shutdown hook; mirrors {@code RuntimeRuleClusterServiceImpl.shutdown()}. */ + public void shutdown() { + notifyExecutor.shutdownNow(); + fenceExecutor.shutdownNow(); + } + /** * Inspect Suspend acks for the split-brain guard: if any peer responded with REJECTED * (origin conflict — it believes it is the main and is mid-apply), surface that to the @@ -1649,55 +2009,39 @@ private HttpResponse doDeleteLocked(final String catalog, final String name, } if (mode == DeleteMode.REVERT_TO_BUNDLED) { - // Bundled-revert path is the schema-change path: bundled may have a different - // shape than runtime. The orchestrator runs the unified pipeline: - // (1) installRuntime to put prior runtime claims back locally, - // (2) apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange) — engine.commit - // drops runtime-only metrics through the standard delta path, - // (3) reset rules-map state to boot-seeded so gone-keys reconcile leaves - // it alone after dao.delete. - // dao.delete only runs after revertToBundled returns REVERTED — a precondition - // or compile failure aborts the row deletion so the operator can retry. - final DSLRuntimeDelete.Result revert; + // Bundled-revert is the schema-change path: bundled may have a different shape + // than runtime, so — like a structural /addOrUpdate — it is tracked and runs + // ASYNCHRONOUSLY. The unified pipeline (installRuntime → apply(bundled, STRUCTURAL, + // withSchemaChange, deferred fence) → reset state → dao.delete) plus its schema + // fence (up to deferredFenceTimeoutMs) run on the background executor, and the + // response returns immediately with an applyId the operator polls via + // GET /runtime/rule/status. The precondition rejections above (inactivate-first, + // no_bundled_twin, requires_revert_to_bundled) are still reported synchronously; + // the revert pipeline's own outcomes (REFUSED_CONFLICT / PRECONDITION_FAILED / + // BUNDLED_APPLY_FAILED / row-delete failure) surface as the apply's terminal phase + // (FAILED, with the reason) on /status rather than an HTTP error. + final String bundled = + StaticRuleRegistry.active().find(catalog, name).orElse(prior.getContent()); + final String applyId = SchemaApplyCoordinator.INSTANCE.begin( + catalog, name, ContentHash.sha256Hex(bundled)); + SchemaApplyCoordinator.INSTANCE.markFencing(applyId); + final String priorContent = prior.getContent(); + boolean scheduled = true; try { - revert = dslManager.getDslRuntimeDelete() - .revertToBundled(catalog, name, prior.getContent()); + fenceExecutor.submit(() -> revertToBundledTracked(applyId, catalog, name, priorContent, dao)); } catch (final Throwable t) { - log.error("runtime-rule /delete: revertToBundled threw for {}/{}", catalog, name, t); - return serverError("revert_to_bundled_failed", catalog, name, t.getMessage()); + // Executor rejected (shutting down) — run inline so the apply still completes. + log.warn("runtime-rule could not schedule the background revert for {}/{}; " + + "running it inline", catalog, name, t); + scheduled = false; } - switch (revert.status) { - case REFUSED_CONFLICT: - log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, revert.error); - return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, - jsonBody("delete_refused", catalog, name, revert.error)); - case PRECONDITION_FAILED: - log.error("runtime-rule /delete: revertToBundled precondition failed for {}/{}: {}", - catalog, name, revert.error); - return serverError("revert_to_bundled_precondition_failed", catalog, name, revert.error); - case BUNDLED_APPLY_FAILED: - log.error("runtime-rule /delete: bundled apply failed for {}/{}: {}", - catalog, name, revert.error); - return serverError("revert_to_bundled_failed", catalog, name, - "bundled apply failed (typically a storage-backend DDL/verify " - + "issue — BanyanDB unreachable, shape rejection, or schema-" - + "barrier timeout). The orchestrator unwound the step-1 " - + "runtime install so local state matches the persisted " - + "INACTIVE row. Retry once storage recovers. Cause: " - + revert.error); - case REVERTED: - default: - break; - } - try { - dao.delete(catalog, name); - } catch (final IOException e) { - log.error("failed to delete runtime rule {}/{}", catalog, name, e); - return serverError("delete_failed", catalog, name, e.getMessage()); + if (!scheduled) { + revertToBundledTracked(applyId, catalog, name, priorContent, dao); } - return ok(HttpStatus.OK, "reverted_to_bundled", catalog, name, - "runtime row removed; bundled rule installed via apply pipeline (schema " - + "change handled by the standard delta path); peers converge on next tick"); + return okWithApplyId(HttpStatus.OK, "reverted_to_bundled", catalog, name, applyId, + "revert-to-bundled accepted; reinstalling the bundled rule across the cluster " + + "and removing the runtime row in the background — poll " + + "/runtime/rule/status?applyId=" + applyId); } // No-bundled-twin DEFAULT path. /inactivate already tore down local handlers under @@ -1783,6 +2127,21 @@ private static HttpResponse ok(final HttpStatus status, final String applyStatus return HttpResponse.of(status, MediaType.JSON_UTF_8, jsonBody(applyStatus, catalog, name, message)); } + /** {@link #ok} variant that also carries the apply's {@code applyId} so the caller can poll + * {@code GET /runtime/rule/status?applyId=…} directly. Used by the structural-apply success + * envelope. */ + private static HttpResponse okWithApplyId(final HttpStatus status, final String applyStatus, + final String catalog, final String name, + final String applyId, final String message) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", applyStatus); + body.addProperty("catalog", catalog == null ? "" : catalog); + body.addProperty("name", name == null ? "" : name); + body.addProperty("applyId", applyId == null ? "" : applyId); + body.addProperty("message", message == null ? "" : message); + return HttpResponse.of(status, MediaType.JSON_UTF_8, GSON.toJson(body)); + } + private static HttpResponse badRequest(final String applyStatus, final String catalog, final String name, final String message) { return HttpResponse.of(HttpStatus.BAD_REQUEST, MediaType.JSON_UTF_8, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java new file mode 100644 index 000000000000..2c0c39c6c151 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +/** + * Lifecycle phase of one runtime-rule apply, as tracked by {@link SchemaApplyCoordinator} on the + * cluster main. A progress query (added in a later phase) reports this so an operator can see how + * far an apply got — and, when it stops short, whether it is converging ({@link #DEGRADED}) or + * failed outright ({@link #FAILED}). + * + *

Normal progression: + * {@link #PENDING} → {@link #DDL} → {@link #FENCING} → {@link #ROLLING_OUT} → {@link #APPLIED}. + * These are the phases the main observes from its apply orchestration: {@code PENDING} once the + * apply is accepted, {@code DDL} while the compile/verify/schema-change call runs (a single opaque + * step from the orchestrator's vantage — sub-steps such as validation are not separately + * observable, so they are not modelled). Once DDL fires the HTTP response returns with the + * {@code applyId} at {@code FENCING} — the apply is accepted but NOT yet durable — while the main + * waits (in the background, on a generous timeout) for every BanyanDB data node to apply the new + * schema revision. The rule row is persisted (the durable commit point) only AFTER the fence + * confirms, so "durable" implies "schema propagated" and crash recovery never resumes against an + * unpropagated schema. Dispatch stays suspended through {@code FENCING} because an un-propagated + * write is silently dropped. Once the fence confirms and the row is persisted, the apply reaches + * {@code ROLLING_OUT} — finalize the local commit, unpark dispatch, resume/notify peers — and then + * {@code APPLIED}. The operator polls to watch {@code FENCING → ROLLING_OUT → APPLIED} (or + * {@code DEGRADED}/{@code FAILED}). Two off-ramps: + *

    + *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). Nothing was + * committed (a crash before persist likewise leaves no durable row); the change was rolled + * back and peers stay on the prior content.
  • + *
  • {@link #DEGRADED} — committed and durable, but the fence did not confirm cluster-wide + * propagation within the timeout (one or more data nodes lagging — exposed as + * {@code fenceLaggards}; dispatch is resumed anyway so a stuck node doesn't park the metric + * forever) or the local commit-tail threw. Forward-progress: peers converge from the durable + * row and BanyanDB keeps converging; this is NOT a revert.
  • + *
+ * {@link #UNKNOWN} is returned for an apply-id the main no longer holds (evicted / main restarted); + * callers fall back to a content-hash comparison. + */ +public enum ApplyPhase { + PENDING, + DDL, + FENCING, + ROLLING_OUT, + APPLIED, + DEGRADED, + FAILED, + UNKNOWN; + + /** + * A terminal phase no longer advances on its own: the apply succeeded ({@link #APPLIED}), + * failed and rolled back ({@link #FAILED}), or committed-but-unconfirmed ({@link #DEGRADED}, + * which a background re-check may still flip to {@link #APPLIED}). + */ + public boolean isTerminal() { + return this == APPLIED || this == DEGRADED || this == FAILED; + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java new file mode 100644 index 000000000000..57dbb35ea175 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import java.util.Collections; +import java.util.List; +import lombok.Getter; + +/** + * Immutable snapshot of one runtime-rule apply's progress on the cluster main. Held by + * {@link SchemaApplyCoordinator}; transitions produce a new snapshot (copy-on-write) so concurrent + * readers (a progress query) never see a torn value. + * + *

{@code contentHash} is the durable identity of the applied file — a status query can be + * answered by it even after the ephemeral {@code applyId} is gone. {@code failureReason} is set + * only for {@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}. + * + *

This snapshot carries the apply-level (main-orchestrated) phase. Per-node breakdown + * (storage-plane laggards, per-OAP-node applied state) is layered on in a later phase via the + * status query DTO; this type stays a small, immutable core. + */ +@Getter +public final class ApplyStatus { + private final String applyId; + private final String catalog; + private final String name; + private final String contentHash; + private final ApplyPhase phase; + /** Non-null only for {@link ApplyPhase#FAILED} (pre-commit error) and {@link ApplyPhase#DEGRADED} + * (committed but a node lagging at fence timeout, or local commit-tail threw). Null otherwise. */ + private final String failureReason; + private final long startedAtMs; + private final long updatedAtMs; + /** Data-node ids that had not confirmed the new schema revision when the fence timed out. + * Non-empty only for {@link ApplyPhase#DEGRADED} caused by fence non-confirmation; always a + * non-null (possibly empty) immutable list. */ + private final List fenceLaggards; + + public ApplyStatus(final String applyId, final String catalog, final String name, + final String contentHash, final ApplyPhase phase, final String failureReason, + final long startedAtMs, final long updatedAtMs, final List fenceLaggards) { + this.applyId = applyId; + this.catalog = catalog; + this.name = name; + this.contentHash = contentHash; + this.phase = phase; + this.failureReason = failureReason; + this.startedAtMs = startedAtMs; + this.updatedAtMs = updatedAtMs; + this.fenceLaggards = fenceLaggards == null + ? Collections.emptyList() + : Collections.unmodifiableList(fenceLaggards); + } + + /** A copy advanced to {@code newPhase}, clearing any prior failure reason + laggards (forward + * progress). */ + public ApplyStatus withPhase(final ApplyPhase newPhase, final long nowMs) { + return new ApplyStatus(applyId, catalog, name, contentHash, newPhase, null, startedAtMs, nowMs, + Collections.emptyList()); + } + + /** A copy moved to a non-success terminal ({@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}) + * carrying {@code reason} and no laggards. */ + public ApplyStatus withFailure(final ApplyPhase terminalPhase, final String reason, final long nowMs) { + return withFailure(terminalPhase, reason, Collections.emptyList(), nowMs); + } + + /** A copy moved to a non-success terminal carrying {@code reason} and the {@code laggards} that + * caused a fence-non-confirmation {@link ApplyPhase#DEGRADED}. */ + public ApplyStatus withFailure(final ApplyPhase terminalPhase, final String reason, + final List laggards, final long nowMs) { + return new ApplyStatus(applyId, catalog, name, contentHash, terminalPhase, reason, startedAtMs, nowMs, + laggards); + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java new file mode 100644 index 000000000000..c1fffe185f60 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.UUID; +import java.util.concurrent.ConcurrentHashMap; +import java.util.function.LongSupplier; +import java.util.function.UnaryOperator; +import lombok.extern.slf4j.Slf4j; + +/** + * Single in-memory owner of runtime-rule apply progress on the cluster main. Each apply opens a + * status entry keyed by a generated {@code applyId} and advances it through {@link ApplyPhase} + * as the apply runs (validate → DDL → fence → roll out → applied), or to a terminal + * {@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}. + * + *

Two indexes back the two ways a caller asks "is this live yet?": + *

    + *
  • by {@code applyId} — the live handle a just-submitted apply polls;
  • + *
  • by {@code (catalog, name)} → latest applyId — the content-based path used when the + * apply-id is gone (page refresh / main restart), resolved against the durable content hash.
  • + *
+ * + *

Concurrency: a {@link ConcurrentHashMap} of immutable {@link ApplyStatus} snapshots. Writes + * for a given apply are effectively single-threaded (the apply orchestration serializes per file), + * reads (progress queries) are concurrent and lock-free against the snapshot. This phase is the + * building block; the apply-lifecycle wiring, per-node breakdown, the {@code GetApplyStatus} query + * surface, and a background convergence watch with TTL eviction layer on later. State is in-memory + * by design — the durable content hash reconstructs truth after a main restart. + */ +@Slf4j +public class SchemaApplyCoordinator { + + /** + * Process-wide instance the production apply / cluster-RPC / reconcile paths share, mirroring + * the {@code MetadataRegistry.INSTANCE} / {@code DSLClassLoaderManager.INSTANCE} pattern so the + * coordinator need not be threaded through every constructor. Uses the wall-clock; tests + * construct their own instance with an injected clock instead. + */ + public static final SchemaApplyCoordinator INSTANCE = new SchemaApplyCoordinator(); + + private final Map byApplyId = new ConcurrentHashMap<>(); + private final Map latestApplyIdByFile = new ConcurrentHashMap<>(); + private final LongSupplier clock; + + public SchemaApplyCoordinator() { + this(System::currentTimeMillis); + } + + /** Clock-injectable for deterministic tests (and the timed background watch added later). */ + public SchemaApplyCoordinator(final LongSupplier clock) { + this.clock = clock; + } + + /** + * Open a new apply in {@link ApplyPhase#PENDING} and return its {@code applyId}. Records it as + * the latest apply for {@code (catalog, name)} so a content-based query resolves to it. + */ + public String begin(final String catalog, final String name, final String contentHash) { + final String applyId = UUID.randomUUID().toString(); + final long now = clock.getAsLong(); + byApplyId.put(applyId, new ApplyStatus( + applyId, catalog, name, contentHash, ApplyPhase.PENDING, null, now, now, null)); + latestApplyIdByFile.put(fileKey(catalog, name), applyId); + if (log.isDebugEnabled()) { + log.debug("apply [{}] begin: {}/{} hash={}", applyId, catalog, name, contentHash); + } + return applyId; + } + + /** Advance an apply to {@code phase} (forward progress; clears any prior failure reason). No-op + * for an unknown apply-id. */ + public void transition(final String applyId, final ApplyPhase phase) { + update(applyId, s -> s.withPhase(phase, clock.getAsLong())); + } + + /** Terminal success. */ + public void markApplied(final String applyId) { + transition(applyId, ApplyPhase.APPLIED); + } + + /** Move an apply to {@link ApplyPhase#FENCING}: the background wait for cluster-wide schema + * propagation is in flight (after the durable commit + peer resume). No-op for an unknown id. */ + public void markFencing(final String applyId) { + transition(applyId, ApplyPhase.FENCING); + } + + /** Terminal: committed and durable, but cluster-wide propagation unconfirmed within budget + * (a node is lagging). Not a revert — a background re-check may flip it to APPLIED later. */ + public void markDegraded(final String applyId, final String reason) { + update(applyId, s -> s.withFailure(ApplyPhase.DEGRADED, reason, clock.getAsLong())); + } + + /** {@link #markDegraded(String, String)} carrying the data-node ids that had not confirmed the + * schema revision at fence timeout, surfaced to the operator on the status. */ + public void markDegraded(final String applyId, final String reason, final List laggards) { + update(applyId, s -> s.withFailure(ApplyPhase.DEGRADED, reason, laggards, clock.getAsLong())); + } + + /** Terminal: a pre-commit error (compile / verify / DDL RPC / persist); the change was rolled + * back. */ + public void markFailed(final String applyId, final String reason) { + update(applyId, s -> s.withFailure(ApplyPhase.FAILED, reason, clock.getAsLong())); + } + + private void update(final String applyId, final UnaryOperator op) { + byApplyId.computeIfPresent(applyId, (k, s) -> op.apply(s)); + } + + /** The live status for an apply-id, or {@code null} when the main no longer holds it (caller + * treats null as {@link ApplyPhase#UNKNOWN} and falls back to the content-based path). */ + public ApplyStatus get(final String applyId) { + return byApplyId.get(applyId); + } + + /** + * The latest apply status for a file, for the content-based query when the apply-id is unknown. + * Returns {@code null} if no apply for that file is tracked, or if {@code expectedContentHash} + * is non-null and does not match the latest apply's hash (the latest tracked apply is for a + * different content than the caller asked about). + */ + public ApplyStatus getLatestByFile(final String catalog, final String name, + final String expectedContentHash) { + final String applyId = latestApplyIdByFile.get(fileKey(catalog, name)); + if (applyId == null) { + return null; + } + final ApplyStatus status = byApplyId.get(applyId); + if (status == null) { + return null; + } + if (expectedContentHash != null && !expectedContentHash.equals(status.getContentHash())) { + return null; + } + return status; + } + + /** Number of tracked applies — for tests and the TTL-eviction watch. */ + public int trackedCount() { + return byApplyId.size(); + } + + /** + * Evict tracked applies whose last update is older than {@code ttlMs}, bounding memory. Both + * terminal entries (kept around so a post-refresh query within the window still resolves) and + * stale non-terminal ones (a missed apply branch left in PENDING) are reaped once past the + * TTL — a later query then returns {@code null}/UNKNOWN and the caller falls back to comparing + * the durable content hash against the DAO row. The {@code (catalog, name) → latest} index + * entry is cleared only when it still points at an evicted apply, so a newer apply for the + * same file keeps its mapping. Returns the number evicted. + */ + public int evictExpired(final long ttlMs) { + final long cutoff = clock.getAsLong() - ttlMs; + final Set evicted = new HashSet<>(); + byApplyId.entrySet().removeIf(e -> { + if (e.getValue().getUpdatedAtMs() < cutoff) { + evicted.add(e.getKey()); + return true; + } + return false; + }); + if (!evicted.isEmpty()) { + latestApplyIdByFile.values().removeIf(evicted::contains); + if (log.isDebugEnabled()) { + log.debug("apply-status: evicted {} entr{} older than {} ms", + evicted.size(), evicted.size() == 1 ? "y" : "ies", ttlMs); + } + } + return evicted.size(); + } + + private static String fileKey(final String catalog, final String name) { + return catalog + "/" + name; + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto index 898204985bad..54acb2032ca8 100644 --- a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto +++ b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto @@ -51,6 +51,16 @@ service RuntimeRuleClusterService { // body in the ack. Eliminates the "client must resubmit" round-trip from the old // 421 Misdirected Request behaviour. rpc Forward(ForwardRequest) returns (ForwardResponse); + // Query the apply-status the main tracks for a runtime-rule file. Read-only; served by the + // main (non-main nodes route here, since only the main runs applies and holds the status). + // Lets the UI/operator poll an apply's progress, and — after the ephemeral apply_id is gone + // (page refresh / main restart) — ask by (catalog, name[, content_hash]), the durable identity. + rpc GetApplyStatus(ApplyStatusRequest) returns (ApplyStatusResponse); + // Fired by the main right after it commits a structural apply, so peers converge NOW + // (run a reconcile against the just-persisted DB row) instead of waiting up to one refresh + // tick (~30s). Best-effort: the ack only confirms the peer accepted the nudge; the peer + // still self-converges on its next tick if the notify is lost. + rpc NotifyApplied(NotifyAppliedRequest) returns (NotifyAppliedAck); } message SuspendRequest { @@ -166,3 +176,80 @@ message ForwardResponse { // operator can correlate which node refused. string node_id = 3; } + +// Identifies the apply whose status the caller wants. Either apply_id (the live handle a +// just-submitted apply was given) OR catalog+name (+ optional content_hash) — the content-based +// path used once the apply_id is gone. content_hash, when set, requires the latest tracked apply +// for that file to match (else found=false), so a stale query doesn't misreport a newer apply. +message ApplyStatusRequest { + string apply_id = 1; + string catalog = 2; + string name = 3; + string content_hash = 4; +} + +// Lifecycle phase of an apply on the main. Mirrors the Java ApplyPhase; prefixed to avoid +// collision with the other enums in this package. +enum ApplyStatusPhase { + APPLY_PHASE_UNSPECIFIED = 0; + APPLY_PHASE_PENDING = 1; + // 2 (VALIDATING) was removed: it is a sub-step inside the single compile/verify/schema-change + // call and is not separately observable by the main's orchestration. Reserved to keep the wire + // number stable. + reserved 2; + reserved "APPLY_PHASE_VALIDATING"; + APPLY_PHASE_DDL = 3; + // After persist, the main waits (in the background, on a generous timeout) for every data node + // to apply the new schema revision BEFORE resuming dispatch — an un-propagated write is dropped. + // The HTTP response returns here with the apply_id; dispatch stays suspended through FENCING. + APPLY_PHASE_FENCING = 4; + // Fence confirmed — finalize the local commit, unpark dispatch, resume/notify peers. + APPLY_PHASE_ROLLING_OUT = 5; + APPLY_PHASE_APPLIED = 6; + APPLY_PHASE_DEGRADED = 7; + APPLY_PHASE_FAILED = 8; + // The main no longer holds an apply matching the request (evicted / main restarted, or the + // content_hash did not match the latest tracked apply). Caller falls back to a content-hash + // comparison against the durable DAO row. + APPLY_PHASE_UNKNOWN = 9; +} + +message ApplyStatusResponse { + // False when no matching apply is tracked (phase is then APPLY_PHASE_UNKNOWN and the rest + // are empty/zero). True otherwise, with the fields below populated. + bool found = 1; + string apply_id = 2; + string catalog = 3; + string name = 4; + string content_hash = 5; + ApplyStatusPhase phase = 6; + // Non-empty only for APPLY_PHASE_FAILED / APPLY_PHASE_DEGRADED. + string failure_reason = 7; + int64 started_at_ms = 8; + int64 updated_at_ms = 9; + // Instance id of the node that produced this status (the main). + string node_id = 10; + // Data-node ids that had not confirmed the schema revision when the fence timed out. + // Non-empty only on APPLY_PHASE_DEGRADED (fence non-confirmation). + repeated string fence_laggards = 11; +} + +// Sent by the main after a successful structural commit so peers reconcile immediately. +message NotifyAppliedRequest { + string catalog = 1; + string name = 2; + // The committed content hash — informational/correlation; the peer reconciles from the DB row. + string content_hash = 3; + // Main's instance id; receivers compare against their own to suppress self-broadcast loops. + string sender_node_id = 4; + // Epoch millis on the main; diagnostic only. + int64 issued_at_ms = 5; +} + +message NotifyAppliedAck { + string node_id = 1; + // True when the peer scheduled an immediate reconcile; false on self-broadcast suppression + // or when the receiver isn't ready (it still self-converges on its next tick). + bool accepted = 2; + string detail = 3; +} diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java index dc4222b49185..66f147557c71 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java @@ -24,6 +24,7 @@ import javassist.ClassPool; import org.apache.skywalking.oap.server.core.analysis.Layer; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.LayerConflictException; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.RuntimeLayerRegistry; @@ -134,6 +135,37 @@ void successfulApplyRegistersDerivedMetricNames() throws Exception { any(StorageManipulationOpt.class)); } + @Test + void deferredFenceFailureRollsBackAndCarriesPartialMetricsForCallerRollback() throws Exception { + // The DDL for every metric fires first; the batched schema fence runs once at the end. A + // barrier transport error there must abort the apply as an ApplyException (so the REST + // caller rolls back), carrying the metric names registered before the fence so the caller + // knows what to unwind. Inline path (fenceRunByCaller=false) so the applier runs the fence. + final String yaml = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu_total_percentage\n" + + " exp: node_cpu_seconds_total.sum(['host']).rate('PT1M')\n" + + " - name: mem_total_used_percentage\n" + + " exp: node_memory_MemTotal_bytes.sum(['host'])\n"; + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport down"); + }); + + final MalFileApplier.ApplyException ex = assertThrows( + MalFileApplier.ApplyException.class, + () -> applier.apply(yaml, "otel-rules/vm-fence", "hashF", opt)); + + assertTrue(ex.getMessage().contains("schema fence failed"), + "fence failure must surface as an apply error, not be swallowed"); + assertEquals( + setOf("meter_vm_cpu_total_percentage", "meter_vm_mem_total_used_percentage"), + ex.getPartiallyRegistered(), + "metrics registered before the fence must be carried for the caller's rollback"); + } + @Test void removeCallsMeterSystemPerName() { // The inverse side of the contract: on unregister every metric name the prior apply diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java new file mode 100644 index 000000000000..3d1746f38d1f --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import io.grpc.stub.StreamObserver; +import java.util.Arrays; +import java.util.concurrent.CountDownLatch; +import java.util.concurrent.TimeUnit; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.Mockito.doAnswer; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.never; +import static org.mockito.Mockito.verify; + +/** + * Unit coverage for the admin-internal gRPC handlers the cluster main serves: + * {@code getApplyStatus} (the apply-status query, including the {@link ApplyPhase} → proto mapping + * and the laggard list) and {@code notifyApplied} (self-suppression + the off-thread reconcile + * nudge). Status flows through the {@link SchemaApplyCoordinator#INSTANCE} singleton, so each test + * opens its own apply (a fresh UUID applyId) to stay isolated from the others. + */ +class RuntimeRuleClusterServiceImplTest { + + private static final String SELF = "self-node_17129"; + + /** Minimal StreamObserver that captures the single onNext value the handlers emit. */ + private static final class Capturing implements StreamObserver { + private T value; + + @Override + public void onNext(final T v) { + this.value = v; + } + + @Override + public void onError(final Throwable t) { + } + + @Override + public void onCompleted() { + } + } + + private ApplyStatusResponse query(final RuntimeRuleClusterServiceImpl impl, final String applyId) { + final Capturing obs = new Capturing<>(); + impl.getApplyStatus(ApplyStatusRequest.newBuilder().setApplyId(applyId).build(), obs); + return obs.value; + } + + @Test + void getApplyStatusMapsEachPhaseToProtoAlongTheHappyPath() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-phase", "h1"); + + assertEquals(ApplyStatusPhase.APPLY_PHASE_PENDING, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.DDL); + assertEquals(ApplyStatusPhase.APPLY_PHASE_DDL, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.markFencing(id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_FENCING, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.ROLLING_OUT); + assertEquals(ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.markApplied(id); + + final ApplyStatusResponse applied = query(impl, id); + assertTrue(applied.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_APPLIED, applied.getPhase()); + assertEquals("otel-rules", applied.getCatalog()); + assertEquals("vm-phase", applied.getName()); + assertEquals(SELF, applied.getNodeId()); + assertTrue(applied.getFenceLaggardsList().isEmpty()); + } + + @Test + void getApplyStatusSurfacesDegradedWithLaggardNodeIds() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-degraded", "h2"); + SchemaApplyCoordinator.INSTANCE.markDegraded(id, "fence did not confirm", + Arrays.asList("data-1_17912", "data-2_17912")); + + final ApplyStatusResponse resp = query(impl, id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_DEGRADED, resp.getPhase()); + assertEquals("fence did not confirm", resp.getFailureReason()); + assertEquals(Arrays.asList("data-1_17912", "data-2_17912"), resp.getFenceLaggardsList()); + } + + @Test + void getApplyStatusMapsFailed() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-failed", "h3"); + SchemaApplyCoordinator.INSTANCE.markFailed(id, "ddl_verify_failed"); + + final ApplyStatusResponse resp = query(impl, id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_FAILED, resp.getPhase()); + assertEquals("ddl_verify_failed", resp.getFailureReason()); + assertTrue(resp.getFenceLaggardsList().isEmpty()); + } + + @Test + void getApplyStatusReturnsUnknownWhenNothingTracked() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + + final ApplyStatusResponse resp = query(impl, "no-such-apply-id"); + assertFalse(resp.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_UNKNOWN, resp.getPhase()); + // node_id is still stamped so the caller knows which node answered. + assertEquals(SELF, resp.getNodeId()); + } + + @Test + void getApplyStatusResolvesByCatalogNameWhenApplyIdAbsent() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-byname", "hbyname"); + SchemaApplyCoordinator.INSTANCE.markApplied(id); + + final Capturing obs = new Capturing<>(); + impl.getApplyStatus(ApplyStatusRequest.newBuilder() + .setCatalog("otel-rules").setName("vm-byname").setContentHash("hbyname").build(), obs); + + assertTrue(obs.value.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_APPLIED, obs.value.getPhase()); + assertEquals(id, obs.value.getApplyId()); + } + + @Test + void notifyAppliedSuppressesSelfBroadcastAndDoesNotReconcile() { + final DSLManager dslManager = mock(DSLManager.class); + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(dslManager, SELF); + final Capturing obs = new Capturing<>(); + + impl.notifyApplied(NotifyAppliedRequest.newBuilder() + .setSenderNodeId(SELF).setCatalog("otel-rules").setName("vm").build(), obs); + + assertFalse(obs.value.getAccepted(), "a node's own broadcast must be suppressed"); + verify(dslManager, never()).tick(); + } + + @Test + void notifyAppliedFromPeerSchedulesAReconcileTick() throws Exception { + final DSLManager dslManager = mock(DSLManager.class); + final CountDownLatch ticked = new CountDownLatch(1); + doAnswer(inv -> { + ticked.countDown(); + return null; + }).when(dslManager).tick(); + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(dslManager, SELF); + final Capturing obs = new Capturing<>(); + + impl.notifyApplied(NotifyAppliedRequest.newBuilder() + .setSenderNodeId("other-node_17129").setCatalog("otel-rules").setName("vm").build(), obs); + + assertTrue(obs.value.getAccepted(), "a peer notify must be accepted"); + assertTrue(ticked.await(3, TimeUnit.SECONDS), + "the reconcile tick must run off the gRPC thread"); + } +} diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java index 7eccc122c234..d7a7af0b76ec 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java @@ -146,8 +146,9 @@ void setUp() { .thenReturn(mock(StructuralCommitCoordinator.class)); when(dslManager.getDslRuntimeDelete()) .thenReturn(mock(DSLRuntimeDelete.class)); - // Stub both overloads — the REST handler calls the single-arg form on the - // FILTER_ONLY path and the two-arg form (deferCommit=true) on STRUCTURAL. + // Stub all overloads — the REST handler calls the single-arg form on the FILTER_ONLY + // path, and the three-arg form (deferCommit=true + the orchestrator-owned fence opt) on + // STRUCTURAL; the two-arg form is kept stubbed for any legacy caller. when(dslManager.applyNowForRuleFile(any())).thenAnswer(inv -> { final Object arg = inv.getArgument(0); if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { @@ -166,6 +167,15 @@ void setUp() { } return null; }); + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean(), any())).thenAnswer(inv -> { + final Object arg = inv.getArgument(0); + if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { + final RuntimeRuleManagementDAO.RuntimeRuleFile file = + (RuntimeRuleManagementDAO.RuntimeRuleFile) arg; + return DSLRuntimeState.running(file.getCatalog(), file.getName(), "h", 0L); + } + return null; + }); handler = new RuntimeRuleRestHandler(moduleManager, dslManager, clusterClient, 30_000L); } @@ -199,7 +209,7 @@ void malScopeChangeAcceptedWithAllowStorageChangeTrue() throws Exception { assertHttp(resp, HttpStatus.OK); // STRUCTURAL path uses the two-arg overload (deferCommit=true) so row-persist // failure can cleanly roll back. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -212,7 +222,7 @@ void malScopeChangeAcceptedThroughFixRoute() throws Exception { HttpData.ofUtf8(INSTANCE_YAML)); assertHttp(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -249,7 +259,7 @@ void malAddedMetricNeverTripsGuardrail() throws Exception { assertHttp(resp, HttpStatus.OK); // Non-empty addedMetrics makes this STRUCTURAL (NEW classification on first apply // or STRUCTURAL on update) — goes through the deferred-commit path. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } // ---- LAL scenarios -------------------------------------------------------------------- @@ -300,7 +310,7 @@ void lalBodyOnlyEditAccepted() throws Exception { assertHttp(resp, HttpStatus.OK); // LAL always routes through the STRUCTURAL path (classifyLal reports STRUCTURAL on // every content change), so the two-arg overload fires. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -315,7 +325,7 @@ void lalRuleAddedAcceptedWithAllowStorageChangeTrue() throws Exception { HttpData.ofUtf8(twoRules)); assertHttp(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } // ---- helpers -------------------------------------------------------------------------- diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java index ac1fd1880b1c..fcb72b3c97d8 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java @@ -58,6 +58,7 @@ import org.mockito.Mockito; import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; import static org.mockito.ArgumentMatchers.any; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.never; @@ -183,10 +184,10 @@ void addOrUpdateBypassesNoChangeOnInactiveRow() throws Exception { // Reactivation pushes through the STRUCTURAL/NEW path — expect 200 with a status // other than no_change. We don't assert on the exact applyStatus string here (that - // depends on classifier output); the key assertion is that the two-arg deferred- - // commit form of applyNowForRuleFile ran (STRUCTURAL path signature). + // depends on classifier output); the key assertion is that the three-arg deferred- + // commit + fence-opt form of applyNowForRuleFile ran (STRUCTURAL path signature). assertHttpStatus(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -205,8 +206,39 @@ void fixBypassesNoChangeEvenOnByteIdenticalActiveRow() throws Exception { assertHttpStatus(resp, HttpStatus.OK); // /addOrUpdate?force=true with byte-identical content → classifier returns // NO_CHANGE, handler falls through to applyStructural (not applyFilterOnly since - // NO_CHANGE != FILTER_ONLY) which uses the two-arg deferred-commit form. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + // NO_CHANGE != FILTER_ONLY) which uses the three-arg deferred-commit + fence-opt form. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); + } + + @Test + void applyStatusDegradesToDurableRowWhenLiveStatusGone() throws Exception { + // No live coordinator status for this (catalog,name) and no apply-id supplied — the query + // must fall back to the durable rule row. An ACTIVE row reports APPLIED derived from the + // content hash (persist-is-commit), so a page refresh after the applyId is gone still + // resolves instead of returning an opaque UNKNOWN. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm-statusfallback", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.applyStatus("otel-rules", "vm-statusfallback", "", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertTrue(body.contains("\"found\":true"), body); + assertTrue(body.contains("\"phase\":\"APPLIED\""), body); + assertTrue(body.contains("\"derivedFrom\":\"durable-dao\""), body); + } + + @Test + void applyStatusUnknownWhenNoLiveStatusAndNoDurableRow() throws Exception { + // Neither a live status nor a durable row for the queried apply → found=false / UNKNOWN. + when(dao.getAll()).thenReturn(Arrays.asList()); + + final HttpResponse resp = handler.applyStatus("otel-rules", "vm-none", "", "no-such-apply-id"); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertTrue(body.contains("\"found\":false"), body); + assertTrue(body.contains("UNKNOWN"), body); } @Test @@ -534,6 +566,8 @@ private void whenReconcilerApplySucceeds(final String catalog, final String name // NEW paths use the two-arg form (deferCommit=true). when(dslManager.applyNowForRuleFile(any())).thenReturn(state); when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean())).thenReturn(state); + // STRUCTURAL apply now passes the orchestrator-owned fence opt → the three-arg overload. + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean(), any())).thenReturn(state); // Apply path needs SUSPENDED on localSuspend so the workflow proceeds. The other // subsystem getters were stubbed in setUp() with default mocks; here we just // override localSuspend to return SUSPENDED instead of the default null. diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java new file mode 100644 index 000000000000..38a16e842401 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import java.util.concurrent.atomic.AtomicLong; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertNull; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; + +class SchemaApplyCoordinatorTest { + + private final AtomicLong clock = new AtomicLong(1_000L); + + private SchemaApplyCoordinator newCoordinator() { + return new SchemaApplyCoordinator(clock::get); + } + + @Test + void beginOpensPendingStatusAndIndexesByApplyIdAndFile() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "hash-1"); + + final ApplyStatus byId = coord.get(applyId); + assertNotNull(byId, "a just-begun apply must be retrievable by apply-id"); + assertEquals(ApplyPhase.PENDING, byId.getPhase()); + assertEquals("otel-rules", byId.getCatalog()); + assertEquals("vm", byId.getName()); + assertEquals("hash-1", byId.getContentHash()); + assertEquals(1_000L, byId.getStartedAtMs()); + assertNull(byId.getFailureReason(), "a fresh apply has no failure reason"); + + final ApplyStatus byFile = coord.getLatestByFile("otel-rules", "vm", null); + assertEquals(applyId, byFile.getApplyId(), "content-based lookup must resolve to the latest apply"); + } + + @Test + void transitionsAdvancePhaseAndStampUpdatedAt() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + clock.set(2_000L); + coord.transition(applyId, ApplyPhase.DDL); + clock.set(3_000L); + coord.transition(applyId, ApplyPhase.ROLLING_OUT); + + final ApplyStatus s = coord.get(applyId); + assertEquals(ApplyPhase.ROLLING_OUT, s.getPhase()); + assertEquals(1_000L, s.getStartedAtMs(), "startedAt is fixed at begin"); + assertEquals(3_000L, s.getUpdatedAtMs(), "updatedAt advances with each transition"); + } + + @Test + void markAppliedIsTerminalSuccess() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + coord.markApplied(applyId); + final ApplyStatus s = coord.get(applyId); + assertEquals(ApplyPhase.APPLIED, s.getPhase()); + assertTrue(s.getPhase().isTerminal()); + assertNull(s.getFailureReason()); + } + + @Test + void markFailedAndDegradedCarryReasonAndAreTerminal() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String failed = coord.begin("otel-rules", "a", "h"); + coord.markFailed(failed, "previous schema not found"); + assertEquals(ApplyPhase.FAILED, coord.get(failed).getPhase()); + assertEquals("previous schema not found", coord.get(failed).getFailureReason()); + assertTrue(coord.get(failed).getPhase().isTerminal()); + + final String degraded = coord.begin("otel-rules", "b", "h"); + coord.markDegraded(degraded, "data node lagging: node-3"); + assertEquals(ApplyPhase.DEGRADED, coord.get(degraded).getPhase()); + assertEquals("data node lagging: node-3", coord.get(degraded).getFailureReason()); + assertTrue(coord.get(degraded).getPhase().isTerminal()); + } + + @Test + void forwardTransitionClearsAStaleFailureReason() { + // A DEGRADED apply whose background re-check later confirms convergence flips to APPLIED, + // and the stale "lagging" reason must not linger. + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + coord.markDegraded(applyId, "node-3 lagging"); + coord.markApplied(applyId); + assertEquals(ApplyPhase.APPLIED, coord.get(applyId).getPhase()); + assertNull(coord.get(applyId).getFailureReason(), + "advancing past DEGRADED must clear the failure reason"); + } + + @Test + void unknownApplyIdReturnsNullAndTransitionsAreNoOps() { + final SchemaApplyCoordinator coord = newCoordinator(); + assertNull(coord.get("does-not-exist")); + assertDoesNotThrow(() -> coord.transition("does-not-exist", ApplyPhase.DDL)); + assertDoesNotThrow(() -> coord.markFailed("does-not-exist", "x")); + } + + @Test + void getLatestByFileHonorsExpectedContentHash() { + final SchemaApplyCoordinator coord = newCoordinator(); + coord.begin("otel-rules", "vm", "hash-A"); + + // Matching hash resolves; a different hash (caller asks about content the latest apply is + // NOT for) returns null so the caller doesn't misreport an unrelated apply. + assertNotNull(coord.getLatestByFile("otel-rules", "vm", "hash-A")); + assertNull(coord.getLatestByFile("otel-rules", "vm", "hash-B")); + assertNull(coord.getLatestByFile("otel-rules", "absent", null)); + } + + @Test + void evictExpiredReapsOldEntriesAndClearsTheirFileIndex() { + final SchemaApplyCoordinator coord = newCoordinator(); + clock.set(1_000L); + final String old = coord.begin("otel-rules", "old", "h"); + coord.markApplied(old); + clock.set(10_000L); + final String fresh = coord.begin("otel-rules", "fresh", "h2"); + + // cutoff = now(10_000) - ttl(5_000) = 5_000; 'old' (updatedAt 1_000) is evicted, 'fresh' kept. + final int evicted = coord.evictExpired(5_000L); + + assertEquals(1, evicted); + assertNull(coord.get(old), "expired apply must be evicted"); + assertNull(coord.getLatestByFile("otel-rules", "old", null), + "the file index entry pointing at an evicted apply must be cleared"); + assertNotNull(coord.get(fresh), "a fresh apply must survive eviction"); + assertEquals(1, coord.trackedCount()); + } + + @Test + void evictExpiredKeepsFileIndexWhenANewerApplyReplacedTheEvictedOne() { + final SchemaApplyCoordinator coord = newCoordinator(); + clock.set(1_000L); + final String first = coord.begin("otel-rules", "vm", "h1"); + coord.markApplied(first); + clock.set(10_000L); + final String second = coord.begin("otel-rules", "vm", "h2"); + + coord.evictExpired(5_000L); + + assertNull(coord.get(first), "the older apply for the file is evicted"); + final ApplyStatus latest = coord.getLatestByFile("otel-rules", "vm", null); + assertNotNull(latest, "the file index must still resolve via the newer apply"); + assertEquals(second, latest.getApplyId()); + } + + @Test + void latestByFileFollowsTheNewestApply() { + final SchemaApplyCoordinator coord = newCoordinator(); + coord.begin("otel-rules", "vm", "hash-old"); + final String newer = coord.begin("otel-rules", "vm", "hash-new"); + final ApplyStatus latest = coord.getLatestByFile("otel-rules", "vm", null); + assertEquals(newer, latest.getApplyId()); + assertEquals("hash-new", latest.getContentHash()); + assertEquals(2, coord.trackedCount(), "both applies remain tracked until eviction"); + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java index d65d7fb75bc1..cbb6602b9928 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java @@ -55,11 +55,20 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage // which made the contract a half-truth. Gate ahead of isExists so a peer apply // is genuinely zero-RPC. if (!flags.isInspectBackend()) { + // Local-cache-only (peer reconciler) tick: zero server RPCs, but the local schema + // cache MUST still be (re)derived from the declared model — the inspectBackend flag + // contract requires exactly this. Without it, a peer holds a live dispatch worker + // whose cache entry is either missing (first apply) or STALE (a reshape re-fires + // whenCreating with a new shape after StorageModels.remove+add). The read-side + // self-heal only fills a MISSING entry, never refreshes a stale one, so the peer + // would keep translating writes with the old shape. RPC-free; no-op for backends + // without a local schema cache (ES, JDBC). + populateLocalCacheOnly(model, opt); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, "local-cache-only mode; main-node is expected to have installed this resource"); log.debug( - "install: model [{}] not installed; local-cache-only mode — skipping (no isExists probe)", + "install: model [{}] not installed; local-cache-only mode — local schema cache refreshed, no isExists probe", model.getName() ); return; @@ -99,20 +108,39 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage // resource that only this very apply would ever create. if (deferDDLToInitNode(opt)) { while (true) { - InstallInfo info = isExists(model, opt); - if (!info.isAllExist()) { - try { + boolean allExist; + try { + InstallInfo info = isExists(model, opt); + allExist = info.isAllExist(); + if (!allExist) { log.info( "install info: {}.table for model: [{}] not all required resources exist. OAP is running in 'no-init' mode, waiting create or update... retry 3s later.", info.buildInstallInfoMsg(), model.getName() ); - Thread.sleep(3000L); - } catch (InterruptedException e) { - log.error(e.getMessage()); } - } else { + } catch (final StorageException e) { + if (!isRetryableNoInitProbeFailure(e)) { + throw e; + } + // A transient backend error during the probe (e.g. a BanyanDB cluster data node + // still Init-ing, "client connection is closing") is NOT a reason to abort boot: + // the init OAP will create the resource and the next probe succeeds. Treat it like + // "not present yet" and retry in-loop, rather than letting it escape and crash-loop + // the pod — which would only re-enter this same loop after a full restart. + allExist = false; + log.warn("install info: existence probe for model: [{}] threw a transient backend " + + "error. OAP is running in 'no-init' mode, retry 3s later.", model.getName(), e); + } + if (allExist) { break; } + try { + Thread.sleep(3000L); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new StorageException( + "interrupted while waiting for no-init backend resources for model " + model.getName(), e); + } } return; } @@ -143,12 +171,20 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage @Override public void whenRemoving(Model model, StorageManipulationOpt opt) throws StorageException { if (!opt.getFlags().isDropOnRemoval()) { + // Peer (or boot path that never drops): the backend drop is the main node's job, + // but this node must still evict its own local schema-cache entry so a removed + // model leaves no stale translation behind in an otherwise insert-only cache. + // RPC-free; no-op for backends without a local cache. + evictLocalCache(model); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, "dropOnRemoval flag is off; server drop is main-node responsibility (or boot path that never drops)"); return; } dropTable(model, opt); + // Evict only after a successful drop — a thrown dropTable leaves the model in the + // registry for retry (see StorageModels.remove), so its cache entry must stay too. + evictLocalCache(model); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.DROPPED, null); } @@ -170,6 +206,16 @@ protected static boolean deferDDLToInitNode(final StorageManipulationOpt opt) { return RunningMode.isNoInitMode() && opt.getFlags().isDeferDDLToInitNode(); } + /** + * Whether a {@link StorageException} from the no-init defer-loop existence probe is + * known to be transient and should be retried in-loop. The base implementation is + * conservative so permanent model/config errors do not become an infinite boot wait; + * storage backends opt in only for transport-level probe failures they can classify. + */ + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + return false; + } + public void start() { } @@ -237,6 +283,30 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc dropTable(model); } + /** + * Refresh THIS node's local schema cache for {@code model} from the declared model, with + * no server RPC. Called on the local-cache-only path + * ({@link StorageManipulationOpt.Flags#isInspectBackend() inspectBackend == false}, i.e. + * {@link StorageManipulationOpt#withoutSchemaChange()}), where the cluster main owns + * backend DDL and this node only needs an up-to-date entry to translate its own + * reads/writes. Backends with a local schema cache (BanyanDB) override to (re)derive and + * overwrite the entry — overwrite, not fill-if-absent, so a reshape that + * re-fires {@link #whenCreating} replaces a now-stale entry instead of leaving the old + * shape in place. Default no-op: backends without a local cache (ES, JDBC) have nothing to + * refresh. + */ + protected void populateLocalCacheOnly(Model model, StorageManipulationOpt opt) throws StorageException { + } + + /** + * Drop THIS node's local schema-cache entry for a removed {@code model}, with no server + * RPC. Called from {@link #whenRemoving} on every node so a removed model never leaves a + * stale translation in an otherwise insert-only cache. Default no-op: backends without a + * local schema cache (ES, JDBC) have nothing to evict; BanyanDB overrides. + */ + protected void evictLocalCache(Model model) { + } + @Getter @Setter public abstract static class InstallInfo { diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index 9b6d8cb04a3e..01f526e53a65 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -18,12 +18,15 @@ package org.apache.skywalking.oap.server.core.storage.model; +import java.time.Duration; import java.util.Collections; import java.util.List; import java.util.concurrent.CopyOnWriteArrayList; import java.util.concurrent.atomic.AtomicLong; import lombok.Builder; import lombok.Getter; +import lombok.Setter; +import org.apache.skywalking.oap.server.core.storage.StorageException; /** * Per-call policy + outcome for a storage model manipulation — threaded through the @@ -307,6 +310,37 @@ public static StorageManipulationOpt withoutSchemaChange() { return new StorageManipulationOpt(Mode.WITHOUT_SCHEMA_CHANGE); } + /** + * {@link Mode#WITH_SCHEMA_CHANGE} but with the post-install schema fence DEFERRED. The + * installer records each resource's {@code mod_revision} without fencing, then registers a + * single flush via {@link #setDeferredFence(DeferredFence)}; the caller runs that flush ONCE + * with {@link #runDeferredFence()} after the whole apply (e.g. a multi-rule file) so the + * bundle waits on one barrier instead of one fence per metric/downsampling. All flags are + * identical to {@link #withSchemaChange()} — only the create/update fence is batched; drops + * still fence inline. + */ + public static StorageManipulationOpt withSchemaChangeDeferredFence() { + final StorageManipulationOpt opt = new StorageManipulationOpt(Mode.WITH_SCHEMA_CHANGE); + opt.deferFence = true; + return opt; + } + + /** + * {@link #withSchemaChangeDeferredFence()} with an explicit batched-fence timeout. Used by the + * runtime-rule operator apply, which fences on a generous cluster-propagation budget (default + * 3 min, configurable) instead of the installer's short inline default — the apply is async + + * progress-queryable, so a long single wait is affordable. The inline/static/delete fence paths + * (which never set a timeout here) keep the installer's short constant. + * + * @param timeout the batched-fence wait; passed to the backend via {@link #getFenceTimeoutMs()}. + * @return a deferred-fence opt carrying {@code timeout}. + */ + public static StorageManipulationOpt withSchemaChangeDeferredFence(final Duration timeout) { + final StorageManipulationOpt opt = withSchemaChangeDeferredFence(); + opt.fenceTimeoutMs = timeout == null ? 0L : timeout.toMillis(); + return opt; + } + /** * True for {@link Mode#WITH_SCHEMA_CHANGE}. The on-demand operator workflow — drops, * updates, and reshapes are permitted because the caller explicitly asked for them. @@ -380,6 +414,115 @@ public long getMaxModRevision() { return maxModRevision.get(); } + /** + * A storage-backend schema fence whose execution is deferred to the end of a batched + * apply. The backend installer (e.g. BanyanDB) registers one on a + * {@link #withSchemaChangeDeferredFence()} opt instead of fencing per resource; the apply + * orchestration runs it once via {@link #runDeferredFence()}. Implemented as a closure in + * the storage plugin so core stays backend-agnostic (same pattern as the local-cache + * populator). A timeout inside the fence is a non-fatal WARN; only a barrier transport + * error surfaces as {@link StorageException}. + */ + @FunctionalInterface + public interface DeferredFence { + void await() throws StorageException; + } + + /** + * True only for {@link #withSchemaChangeDeferredFence()}. The installer reads this to skip + * the per-resource create/update fence and register a single {@link DeferredFence} instead. + */ + @Getter + private boolean deferFence = false; + + private volatile DeferredFence deferredFence; + + /** + * Batched-fence wait in millis, or {@code 0} (the default) meaning "use the backend installer's + * own short constant". Set only by {@link #withSchemaChangeDeferredFence(Duration)} on the + * runtime-rule operator path; the backend reads it when running the deferred fence so the + * inline/static/delete paths (which leave it {@code 0}) keep the short timeout. + */ + @Getter + private long fenceTimeoutMs = 0L; + + /** + * True when the CALLER (the apply orchestrator) runs {@link #runDeferredFence()} itself — + * typically on a background thread after the durable commit — rather than the backend installer + * running it inline at the end of the apply. The runtime-rule REST apply sets this so the long + * (3-min) fence does not block the apply / hold peers suspended; the reconciler tick leaves it + * {@code false} so the installer keeps fencing inline with the short timeout. + */ + @Getter + @Setter + private boolean fenceRunByCaller = false; + + /** + * Outcome of a deferred fence, recorded by the backend so the orchestrator can mark + * {@code APPLIED} vs {@code DEGRADED}-with-laggards after {@link #runDeferredFence()} returns. + */ + public static final class FenceOutcome { + @Getter + private final boolean applied; + @Getter + private final List laggardNodeIds; + + public FenceOutcome(final boolean applied, final List laggardNodeIds) { + this.applied = applied; + this.laggardNodeIds = laggardNodeIds == null + ? Collections.emptyList() + : Collections.unmodifiableList(laggardNodeIds); + } + } + + /** Recorded by the backend during {@link #runDeferredFence()}; read by the orchestrator after. + * Null when no deferred fence ran (no DDL) or the backend records no outcome. */ + @Getter + @Setter + private volatile FenceOutcome fenceOutcome; + + /** + * Register the single fence to run after the batched apply completes. Idempotent — the + * installer may call it once per resource; the latest (equivalent) closure wins. No-op + * carrier for backends without a revision concept (they never call it). + */ + public void setDeferredFence(final DeferredFence fence) { + this.deferredFence = fence; + } + + /** + * Run the registered {@link DeferredFence} once, if any. Called by the apply orchestration + * after all DDL for the batch is fired. No-op when nothing was registered (peer/no-change + * applies, or non-BanyanDB backends). + * + *

One-shot. A single reconciler tick reuses ONE opt across every rule + * file ({@code RuleSync#runOnce}), calling this once per file. The closure + accumulated + * {@link #maxModRevision} are always reset (even when this file performed no + * DDL and registered no closure), so the next file neither re-runs this file's stale fence nor + * inherits this file's revision — each file fences on its own DDL only. (Drop revisions that a + * later commit-tail records on a shared opt are inline-fenced at drop time and benign here: the + * next file's own create revision is monotonically higher, so it dominates the fence.) The reset + * is in a {@code finally} so a fence transport failure still isolates the next file; the closure + * reads {@link #getMaxModRevision()} during {@code await()}, so it is reset only after. + * + *

{@link #fenceOutcome} is cleared before the fence runs (so a shared tick opt + * starts each file clean) and the backend sets it during the run; it is intentionally + * NOT cleared afterward so the caller can read it (the runtime-rule orchestrator reads it to + * decide {@code APPLIED} vs {@code DEGRADED}). + */ + public void runDeferredFence() throws StorageException { + final DeferredFence fence = this.deferredFence; + this.deferredFence = null; + this.fenceOutcome = null; + try { + if (fence != null) { + fence.await(); + } + } finally { + maxModRevision.set(0L); + } + } + /** * Append a per-resource outcome. Called by the installer as it examines each * underlying storage resource. diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java index d9cda58cd7ae..8e91b28f0470 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java @@ -26,6 +26,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertThrows; import static org.junit.jupiter.api.Assertions.assertTimeoutPreemptively; import static org.junit.jupiter.api.Assertions.assertTrue; import static org.mockito.Mockito.mock; @@ -102,29 +103,177 @@ void withSchemaChangeSkipsCreateWhenResourceAlreadyExists() throws StorageExcept "withSchemaChange must not re-create a resource that already exists"); } + @Test + void noInitDeferLoopRetriesTransientProbeErrorInsteadOfCrashing() { + RunningMode.setMode("no-init"); + // The first existence probe throws a transient StorageException (mimicking a BanyanDB + // cluster data node still Init-ing); the next probe reports the resource present. + final RecordingInstaller installer = new RecordingInstaller(true /* present after transient */, + 1 /* one transient probe failure */, true /* retryable probe failure */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_transient"); + + // Must NOT propagate the transient (which would escape whenCreating and crash-loop the pod); + // must retry in-loop, then return on the defer path without creating. 10s covers the 3s sleep. + assertTimeoutPreemptively(Duration.ofSeconds(10), () -> + installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent())); + assertEquals(0, installer.createTableCalls, + "a transient probe error must be retried, then defer to the init node without creating"); + assertTrue(installer.probeCalls >= 2, + "the loop must probe again after the transient instead of escaping on the first throw"); + } + + @Test + void noInitDeferLoopPropagatesNonRetryableProbeError() { + RunningMode.setMode("no-init"); + final RecordingInstaller installer = new RecordingInstaller(true /* unused */, + 1 /* one probe failure */, false /* permanent/non-retryable */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_bad_model"); + + assertThrows(StorageException.class, + () -> installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent()), + "permanent model/config probe failures must not be converted into an infinite no-init wait"); + assertEquals(1, installer.probeCalls, + "a non-retryable failure must escape without sleeping and probing again"); + assertEquals(0, installer.createTableCalls); + } + + @Test + void noInitDeferLoopPropagatesInterruptedSleep() { + RunningMode.setMode("no-init"); + final RecordingInstaller installer = new RecordingInstaller(false /* resource absent */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_wait_interrupted"); + + Thread.currentThread().interrupt(); + try { + assertThrows(StorageException.class, + () -> installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent()), + "an interrupted no-init wait must fail fast so shutdown can proceed"); + assertTrue(Thread.currentThread().isInterrupted(), + "the interrupt flag must be restored for upstream shutdown handling"); + } finally { + Thread.interrupted(); + } + } + + @Test + void withoutSchemaChangePopulatesLocalCacheAndIssuesNoBackendRpc() throws StorageException { + // Peer reconciler tick (inspectBackend=false): the installer must refresh the local + // schema cache (so a reshape re-add overwrites a now-stale entry) WITHOUT any backend + // existence probe or create. This is the C-1 fix for the worker-without-cache / + // stale-cache desync. + final RecordingInstaller installer = new RecordingInstaller(false /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_peer"); + + installer.whenCreating(model, StorageManipulationOpt.withoutSchemaChange()); + + assertEquals(1, installer.populateLocalCacheCalls, + "a withoutSchemaChange (peer) apply must refresh the local schema cache"); + assertEquals(0, installer.probeCalls, + "a withoutSchemaChange (peer) apply must issue zero backend existence probes"); + assertEquals(0, installer.createTableCalls, + "a withoutSchemaChange (peer) apply must never create backend resources"); + } + + @Test + void whenRemovingPeerEvictsLocalCacheWithoutDroppingBackend() throws StorageException { + // Peer remove (dropOnRemoval=false): the backend drop is the main's job, but the peer + // must still evict its own cache entry so a removed model leaves no stale translation. + final RecordingInstaller installer = new RecordingInstaller(true /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_remove_peer"); + + installer.whenRemoving(model, StorageManipulationOpt.withoutSchemaChange()); + + assertEquals(1, installer.evictLocalCacheCalls, + "a peer remove must evict the local schema cache entry"); + assertEquals(0, installer.dropTableCalls, + "a peer remove (dropOnRemoval off) must not drop backend resources"); + } + + @Test + void whenRemovingMainDropsBackendThenEvictsLocalCache() throws StorageException { + // Main remove (withSchemaChange, dropOnRemoval=true): drop the backend AND evict the + // local cache so the insert-only registry does not keep a tombstoned model's entry. + final RecordingInstaller installer = new RecordingInstaller(true /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_remove_main"); + + installer.whenRemoving(model, StorageManipulationOpt.withSchemaChange()); + + assertEquals(1, installer.dropTableCalls, + "a main remove must drop the backend resource"); + assertEquals(1, installer.evictLocalCacheCalls, + "a main remove must evict the local schema cache entry after the drop"); + } + /** Minimal concrete {@link ModelInstaller} that records createTable calls and reports a * fixed existence result, so the base whenCreating branching can be exercised without a - * real storage backend. */ + * real storage backend. Optionally throws a transient {@link StorageException} on the first + * {@code transientProbeFailures} existence probes to exercise the no-init defer-loop retry. */ private static final class RecordingInstaller extends ModelInstaller { private final boolean resourcePresent; + private final int transientProbeFailures; + private final boolean retryableProbeFailure; + private int probeCalls; private int createTableCalls; + private int populateLocalCacheCalls; + private int evictLocalCacheCalls; + private int dropTableCalls; private RecordingInstaller(final boolean resourcePresent) { + this(resourcePresent, 0, false); + } + + private RecordingInstaller(final boolean resourcePresent, final int transientProbeFailures) { + this(resourcePresent, transientProbeFailures, true); + } + + private RecordingInstaller(final boolean resourcePresent, final int transientProbeFailures, + final boolean retryableProbeFailure) { super(null, null); this.resourcePresent = resourcePresent; + this.transientProbeFailures = transientProbeFailures; + this.retryableProbeFailure = retryableProbeFailure; } @Override - public InstallInfo isExists(final Model model, final StorageManipulationOpt opt) { + public InstallInfo isExists(final Model model, final StorageManipulationOpt opt) throws StorageException { + if (probeCalls++ < transientProbeFailures) { + throw new StorageException("transient backend error"); + } final TestInstallInfo info = new TestInstallInfo(model); info.setAllExist(resourcePresent); return info; } + @Override + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + return retryableProbeFailure; + } + @Override public void createTable(final Model model) { createTableCalls++; } + + @Override + public void dropTable(final Model model) { + dropTableCalls++; + } + + @Override + protected void populateLocalCacheOnly(final Model model, final StorageManipulationOpt opt) { + populateLocalCacheCalls++; + } + + @Override + protected void evictLocalCache(final Model model) { + evictLocalCacheCalls++; + } } private static final class TestInstallInfo extends ModelInstaller.InstallInfo { diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java new file mode 100644 index 000000000000..a6fd54be1474 --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java @@ -0,0 +1,161 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.storage.model; + +import java.util.concurrent.atomic.AtomicInteger; +import org.apache.skywalking.oap.server.core.storage.StorageException; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Unit coverage for the batched-fence machinery on {@link StorageManipulationOpt}: a + * {@link StorageManipulationOpt#withSchemaChangeDeferredFence()} opt carries the same flags as + * {@link StorageManipulationOpt#withSchemaChange()} but lets the installer register a single + * {@link StorageManipulationOpt.DeferredFence} that the apply orchestration runs ONCE instead of + * one fence per metric/downsampling. + */ +class StorageManipulationOptTest { + + @Test + void deferredFenceOptHasSameFlagsAsWithSchemaChange() { + final StorageManipulationOpt deferred = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final StorageManipulationOpt plain = StorageManipulationOpt.withSchemaChange(); + // Behaviour must be identical except for the batching toggle — same mode/flags so every + // create/update/drop privilege the installer checks is unchanged. + assertEquals(plain.getMode(), deferred.getMode(), + "deferred-fence opt must keep the WITH_SCHEMA_CHANGE mode"); + assertTrue(deferred.isWithSchemaChange(), + "deferred-fence opt must still report withSchemaChange semantics"); + assertTrue(deferred.isDeferFence(), + "deferred-fence opt must flag deferFence"); + assertFalse(plain.isDeferFence(), + "the plain withSchemaChange opt must NOT defer the fence"); + assertFalse(StorageManipulationOpt.withoutSchemaChange().isDeferFence()); + } + + @Test + void runDeferredFenceInvokesRegisteredFenceOnce() throws StorageException { + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger calls = new AtomicInteger(); + opt.setDeferredFence(calls::incrementAndGet); + + opt.runDeferredFence(); + + assertEquals(1, calls.get(), "the registered fence must run exactly once on flush"); + } + + @Test + void runDeferredFenceIsNoOpWhenNothingRegistered() { + // Peer / no-change applies (and non-BanyanDB backends) never register a fence; flushing + // must be a safe no-op, not an NPE. + assertDoesNotThrow(() -> StorageManipulationOpt.withSchemaChangeDeferredFence().runDeferredFence()); + assertDoesNotThrow(() -> StorageManipulationOpt.withoutSchemaChange().runDeferredFence()); + } + + @Test + void runDeferredFencePropagatesStorageException() { + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport error"); + }); + // A barrier transport error must surface so the apply aborts exactly like an inline fence. + assertThrows(StorageException.class, opt::runDeferredFence); + } + + @Test + void runDeferredFenceIsOneShotAcrossFiles() throws StorageException { + // A reconciler tick reuses ONE opt across every rule file. After a file flushes its + // fence, a later file that performed no DDL (registers no new closure) must NOT re-run + // the earlier file's stale fence. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger calls = new AtomicInteger(); + opt.setDeferredFence(calls::incrementAndGet); + + opt.runDeferredFence(); + opt.runDeferredFence(); + + assertEquals(1, calls.get(), "the fence must run once and be cleared, not re-run for the next file"); + } + + @Test + void runDeferredFenceResetsRevisionAfterAwait() throws StorageException { + // The closure must observe this file's accumulated revision DURING await, then the opt + // resets so the next file fences on its own DDL only — not the cumulative max. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(42L); + final AtomicInteger seenDuringAwait = new AtomicInteger(); + opt.setDeferredFence(() -> seenDuringAwait.set((int) opt.getMaxModRevision())); + + opt.runDeferredFence(); + + assertEquals(42L, seenDuringAwait.get(), "the fence must see the recorded revision during await"); + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "the revision must reset after the fence so a later file is not over-fenced"); + } + + @Test + void runDeferredFenceResetsRevisionEvenWithNoClosureRegistered() { + // A no-DDL file on a shared tick opt registers no closure, but a prior file (or its + // commit-tail drops) may have left a revision on the opt. runDeferredFence must still clear + // it so the next file is not over-fenced on a stale revision. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(77L); + + assertDoesNotThrow(opt::runDeferredFence); + + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "a no-closure flush must still reset the accumulated revision (shared-tick isolation)"); + } + + @Test + void runDeferredFenceResetsRevisionEvenWhenFenceThrows() { + // A barrier transport failure on one file must not leave a stale revision that the next + // file would inherit; the reset runs in finally. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(99L); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport error"); + }); + + assertThrows(StorageException.class, opt::runDeferredFence); + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "the revision must reset even when the fence throws"); + } + + @Test + void laterSetDeferredFenceWins() throws StorageException { + // The installer may register the closure once per resource; the latest (equivalent) one + // wins and still runs a single time. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger first = new AtomicInteger(); + final AtomicInteger second = new AtomicInteger(); + opt.setDeferredFence(first::incrementAndGet); + opt.setDeferredFence(second::incrementAndGet); + + opt.runDeferredFence(); + + assertEquals(0, first.get(), "an overwritten fence must not run"); + assertEquals(1, second.get(), "the latest registered fence runs once"); + } +} diff --git a/oap-server/server-starter/src/main/resources/application.yml b/oap-server/server-starter/src/main/resources/application.yml index ec2d3d7bae4f..e02938cc8e60 100644 --- a/oap-server/server-starter/src/main/resources/application.yml +++ b/oap-server/server-starter/src/main/resources/application.yml @@ -718,6 +718,13 @@ receiver-runtime-rule: refreshRulesPeriod: ${SW_RECEIVER_RUNTIME_RULE_REFRESH_RULES_PERIOD:30} # SUSPENDED state self-heal threshold (seconds). selfHealThresholdSeconds: ${SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS:60} + # Timeout (seconds) for the runtime-rule deferred BanyanDB schema fence on the operator REST + # apply: the background wait for every data node to apply a new measure's schema revision. The + # apply returns immediately at FENCING (accepted, not yet durable); the rule row is persisted + # only after the fence confirms, then dispatch resumes. Poll GET /runtime/rule/status for + # fencing -> rolling-out -> applied (or degraded/failed). Inline/static/delete fences keep the + # short 2s constant. + deferredFenceTimeoutSeconds: ${SW_RECEIVER_RUNTIME_RULE_DEFERRED_FENCE_TIMEOUT_SECONDS:180} # DSL Debug API — sampling debugger for MAL / LAL / OAL rules. Read-only OAL # listing under /runtime/oal/*; session control plane under /dsl-debugging/*. diff --git a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml index def21fc3086d..add9cdc45522 100644 --- a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml +++ b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml @@ -52,7 +52,7 @@ metricsRules: exp: banyandb_system_up_time # CPU usage (cores). process_* rides on every container including lifecycle. - name: cpu_usage - exp: process_cpu_seconds_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: process_cpu_seconds_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # resident memory (bytes). Raw gauge, present on all containers. - name: rss_memory exp: process_resident_memory_bytes @@ -75,45 +75,45 @@ metricsRules: exp: banyandb_system_disk.tagEqual('kind','used_percent').avg(['cluster','pod_name','container_name','node_role','node_type','path']) * 100 # network throughput (bytes/s) by interface name. - name: network_recv - exp: banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S') + exp: banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT1M') - name: network_sent - exp: banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S') + exp: banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT1M') # Go runtime. - name: goroutines exp: go_goroutines # average GC pause (s) = rate(Σpause) / rate(Σcount). go_gc_duration_seconds is a summary (no buckets), # so this ratio of _sum/_count is the only valid average — do not apply histogram_percentile to it. - name: gc_pause_avg - exp: go_gc_duration_seconds_sum.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(go_gc_duration_seconds_count.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + exp: go_gc_duration_seconds_sum.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(go_gc_duration_seconds_count.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) - name: heap_inuse exp: go_memstats_heap_inuse_bytes - name: heap_next_gc exp: go_memstats_next_gc_bytes - name: alloc_rate - exp: go_memstats_alloc_bytes_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: go_memstats_alloc_bytes_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # ---- Liaison only (front door; the UI gates these on container_name == liaison) ---- # query rate (req/s) by data-model service (measure/stream/trace/property). method literal is "query". - name: liaison_query_rate - exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type','service']).rate('PT15S') + exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type','service']).rate('PT1M') # gRPC errors/min. Three liaison-side error families (mirrors the Grafana "gRPC Error Rate" panel, # which sums total_err + registry_err + stream_msg_received_err). All lazily registered -> empty on a # healthy cluster; each pre-aggregated to the same label set before '+'. - name: liaison_grpc_error_rate - exp: (banyandb_liaison_grpc_total_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_liaison_grpc_total_registry_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) * 60 + exp: (banyandb_liaison_grpc_total_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_liaison_grpc_total_registry_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) * 60 # registry operation rate (req/s): schema registry ops on the liaison front door. total_started is # query-only on the wire, so the former tagNotEqual('method','query') term was empty and is dropped; # registry_started carries the non-query op count. - name: liaison_registry_op_rate - exp: banyandb_liaison_grpc_total_registry_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_liaison_grpc_total_registry_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # write rate (writes/s) seen at the liaison front door. group label dropped (instance-level total). - name: liaison_write_rate - exp: banyandb_measure_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # tier-2 publish pipeline (liaison -> data): throughput by operation, bytes/s, and p99 send latency. - name: liaison_publish_throughput - exp: banyandb_queue_pub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_pub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: liaison_publish_bytes - exp: banyandb_queue_pub_sent_bytes.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_queue_pub_sent_bytes.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: liaison_publish_latency_p99 exp: banyandb_queue_pub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # tier-2 publish, batch granularity (BanyanDB #1169): batches published/s by operation and the batch @@ -121,7 +121,7 @@ metricsRules: # BUILD-GATED: _batch_finished/_batch_latency are absent on current builds -> emit nothing until the # shipped BanyanDB build registers them. - name: liaison_publish_batch_throughput - exp: banyandb_queue_pub_total_batch_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_pub_total_batch_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: liaison_publish_batch_latency_p99 exp: banyandb_queue_pub_total_batch_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # liaison write-queue (wqueue) depth: pending records buffered at the front door before publish. @@ -144,34 +144,34 @@ metricsRules: exp: banyandb_measure_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_stream_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_trace_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # merge-loop iterations/s. - name: data_merge_file_rate - exp: banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # avg parts merged per merge loop on the file path (matches Grafana = rate(merged_parts{type=file}) / # rate(merge_loop_started)). type='file' is data-only on the wire (liaison emits only type='mem'). - name: data_merge_file_partitions - exp: banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_stream_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_trace_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + exp: banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_stream_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_trace_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) # avg file-merge latency (ms) per merge loop. - name: data_merge_file_latency - exp: (banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_stream_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_trace_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S'))) * 1000 + exp: (banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_stream_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_trace_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M'))) * 1000 # inverted-index (series) write rate / term-search rate / total docs. *_inverted_index_total_* are # # TYPE=gauge but cumulative, so rate() yields a per-window delta. Stream's series index is the # storage scope (stream_storage_*); the tst scope is reported separately below. Trace's series index # (trace_storage_*) is included so trace series writes/docs are not silently dropped. - name: data_series_write_rate - exp: banyandb_measure_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_series_term_search_rate - exp: banyandb_measure_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_total_series exp: banyandb_measure_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_stream_storage_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_trace_storage_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # stream time-series-table (tst) index, distinct from the stream series (storage) index above. - name: data_stream_tst_write_rate - exp: banyandb_stream_tst_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_stream_tst_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_stream_tst_term_search_rate - exp: banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_stream_tst_total_docs exp: banyandb_stream_tst_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # subscribe-side queue (data receives from liaison): throughput by operation + p99 latency. - name: data_queue_sub_throughput - exp: banyandb_queue_sub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_sub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: data_queue_sub_latency_p99 exp: banyandb_queue_sub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # subscribe-side per-message throughput (BanyanDB #1169). A data node ingests writes via the @@ -181,7 +181,7 @@ metricsRules: # intentionally not modeled here. Batch-level granularity lives on the liaison's publish side # (liaison_publish_batch_throughput / liaison_publish_batch_latency_p99 above). - name: data_queue_sub_message_throughput - exp: banyandb_queue_sub_total_message_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_sub_total_message_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') # retention disk-usage % per data-model scope (0-100 gauge). Kept per scope rather than summed (a sum # of three percentages is meaningless). Not in the upstream Grafana boards; a SkyWalking addition. - name: data_retention_measure_disk_usage_percent diff --git a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml index 97c6cac8f6c8..45d505d8f43c 100644 --- a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml +++ b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml @@ -29,11 +29,11 @@ metricsRules: # cluster writes/s across the three data-model scopes (measure, stream, trace). Each scope's # write counter is collapsed to one per-cluster series before `+`. - name: cluster_write_rate - exp: (banyandb_measure_total_written.sum(['cluster']).rate('PT15S') + banyandb_stream_tst_total_written.sum(['cluster']).rate('PT15S') + banyandb_trace_tst_total_written.sum(['cluster']).rate('PT15S')) + exp: (banyandb_measure_total_written.sum(['cluster']).rate('PT1M') + banyandb_stream_tst_total_written.sum(['cluster']).rate('PT1M') + banyandb_trace_tst_total_written.sum(['cluster']).rate('PT1M')) # cluster queries/s. `service` on this family is BanyanDB's data-model facet # (measure/stream/trace/property), not a SkyWalking service; method literal is "query". - name: cluster_query_rate - exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster']).rate('PT15S') + exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster']).rate('PT1M') # cluster errors/min. The seven liaison-side error families mirror the upstream Grafana # "Error Rate" stat (grafana-fodc-workload.json). Each is pre-aggregated to ['cluster'] # BEFORE `+` because their wire label sets differ (stream_msg_received_err carries @@ -42,7 +42,7 @@ metricsRules: # registered and emit no series; MAL treats an empty operand as the additive identity, so the # sum emits from whatever has fired and renders absent-as-0 when nothing has. - name: cluster_error_rate - exp: (banyandb_liaison_grpc_total_err.sum(['cluster']).rate('PT15S') + banyandb_liaison_grpc_total_registry_err.sum(['cluster']).rate('PT15S') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster']).rate('PT15S') + banyandb_queue_pub_total_err.sum(['cluster']).rate('PT15S') + banyandb_measure_total_sync_loop_err.sum(['cluster']).rate('PT15S') + banyandb_stream_tst_total_sync_loop_err.sum(['cluster']).rate('PT15S') + banyandb_trace_tst_total_sync_loop_err.sum(['cluster']).rate('PT15S')) * 60 + exp: (banyandb_liaison_grpc_total_err.sum(['cluster']).rate('PT1M') + banyandb_liaison_grpc_total_registry_err.sum(['cluster']).rate('PT1M') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster']).rate('PT1M') + banyandb_queue_pub_total_err.sum(['cluster']).rate('PT1M') + banyandb_measure_total_sync_loop_err.sum(['cluster']).rate('PT1M') + banyandb_stream_tst_total_sync_loop_err.sum(['cluster']).rate('PT1M') + banyandb_trace_tst_total_sync_loop_err.sum(['cluster']).rate('PT1M')) * 60 # live container count by role. count(['cluster','container_name','pod_name']) groups by all # three then re-groups excluding the last key (pod_name), yielding one sample per # (cluster, container_name) whose value = distinct pod_name count -> data=N, liaison=M. diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index cabfb75276cd..4318806cab14 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -19,9 +19,11 @@ package org.apache.skywalking.oap.server.storage.plugin.banyandb; import io.grpc.Status; +import java.time.Duration; import java.util.HashMap; import java.util.HashSet; import java.util.List; +import java.util.Locale; import java.util.Map; import java.util.Set; import java.util.stream.Collectors; @@ -40,7 +42,6 @@ import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRuleBinding; import org.apache.skywalking.banyandb.schema.v1.BanyandbSchema.SchemaKey; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.TopNAggregation; -import java.time.Duration; import org.apache.skywalking.library.banyandb.v1.client.BanyanDBClient; import org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher; import org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBException; @@ -103,6 +104,51 @@ public class BanyanDBIndexInstaller extends ModelInstaller { public BanyanDBIndexInstaller(Client client, ModuleManager moduleManager, BanyanDBStorageConfig config) { super(client, moduleManager); this.config = config; + // Let read/persist paths self-heal a missing local schema entry (MetadataRegistry.repopulateLocally): + // re-derive the model's Schema locally with zero server RPC via the same primitive the peer + // boot path uses. This closes the " is not registered" flood that arises when a + // withoutSchemaChange peer apply or a runtime-rule bundled fall-over rebuilds the dispatch + // worker but skips the populate. DownSamplingConfigService is resolved lazily per call — a + // self-heal only fires post-boot, when CoreModule is long up. + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(model -> { + final DownSamplingConfigService downSamplingConfigService = moduleManager.find(CoreModule.NAME) + .provider() + .getService(DownSamplingConfigService.class); + registerLocallyByKind(model, downSamplingConfigService); + }); + } + + @Override + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + Throwable cause = e.getCause(); + while (cause != null) { + if (cause instanceof BanyanDBException) { + return isTransientBanyanDBProbeFailure((BanyanDBException) cause); + } + cause = cause.getCause(); + } + return false; + } + + private static boolean isTransientBanyanDBProbeFailure(final BanyanDBException e) { + final Status.Code code = e.getStatus(); + if (Status.Code.UNAVAILABLE.equals(code) + || Status.Code.DEADLINE_EXCEEDED.equals(code) + || Status.Code.CANCELLED.equals(code) + || Status.Code.RESOURCE_EXHAUSTED.equals(code) + || Status.Code.ABORTED.equals(code)) { + return true; + } + if (!Status.Code.UNKNOWN.equals(code)) { + return false; + } + final String message = String.valueOf(e.getMessage()).toLowerCase(Locale.ROOT); + return message.contains("client connection is closing") + || message.contains("connection is closing") + || message.contains("transport is closing") + || message.contains("connection refused") + || message.contains("connection reset") + || message.contains("broken pipe"); } @Override @@ -242,10 +288,65 @@ public InstallInfo isExists(Model model, StorageManipulationOpt opt) throws Stor */ private void fenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, final String context) throws BanyanDBException { + if (opt.isDeferFence()) { + // Batched apply: do NOT fence per resource. Register a single flush that the apply + // orchestration runs once after all DDL is fired (StorageManipulationOpt#runDeferredFence), + // so a multi-rule file waits on ONE barrier on the cumulative max revision instead of + // one fence per metric/downsampling. The closure reads opt.getMaxModRevision() at flush + // time, after every resource has recorded its revision. + opt.setDeferredFence(() -> { + try { + doDeferredFence(client, opt, "batched apply"); + } catch (final BanyanDBException e) { + throw new StorageException("batched schema fence failed", e); + } + }); + return; + } + doFenceOnRevision(client, opt, context); + } + + private void doFenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, + final String context) throws BanyanDBException { + doFenceOnRevisionValue(client, opt.getMaxModRevision(), context); + } + + /** + * The deferred (batched) fence the runtime-rule apply runs once after all DDL. Unlike the inline + * {@link #doFenceOnRevisionValue}, this (1) honors the opt's configured timeout + * ({@link StorageManipulationOpt#getFenceTimeoutMs()}, the runtime-rule 3-min budget) instead of + * the short inline {@link #FENCE_TIMEOUT}, and (2) records the outcome (applied + laggard node + * ids) on the opt so the orchestrator can mark {@code APPLIED} vs {@code DEGRADED} and gate the + * dispatch resume on this fence. A laggard timeout is still a non-fatal WARN. + */ + private void doDeferredFence(final BanyanDBClient client, final StorageManipulationOpt opt, + final String context) throws BanyanDBException { final long rev = opt.getMaxModRevision(); if (rev <= 0L) { return; } + final Duration timeout = opt.getFenceTimeoutMs() > 0L + ? Duration.ofMillis(opt.getFenceTimeoutMs()) + : FENCE_TIMEOUT; + final SchemaWatcher.Result result = client.getSchemaWatcher().awaitRevisionApplied(rev, timeout); + if (!result.isApplied()) { + log.warn("BanyanDB schema-watch fence did NOT confirm revision {} within {} ms for {}; " + + "proceeding anyway. Laggards: {}", rev, timeout.toMillis(), context, result.getLaggards()); + final List laggardIds = result.getLaggards().stream() + .map(l -> l.getNode()) + .collect(Collectors.toList()); + opt.setFenceOutcome(new StorageManipulationOpt.FenceOutcome(false, laggardIds)); + } else { + log.debug("BanyanDB schema-watch fence confirmed revision {} for {}", rev, context); + opt.setFenceOutcome(new StorageManipulationOpt.FenceOutcome(true, List.of())); + } + } + + private void doFenceOnRevisionValue(final BanyanDBClient client, final long rev, + final String context) throws BanyanDBException { + if (rev <= 0L) { + return; + } final SchemaWatcher.Result result = client.getSchemaWatcher().awaitRevisionApplied(rev, FENCE_TIMEOUT); if (!result.isApplied()) { log.warn("BanyanDB schema-watch fence did NOT confirm revision {} within {} ms for {}; " @@ -412,6 +513,13 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc final String group = metadata.getGroup(); final String name = metadata.name(); log.info("drop BanyanDB schema kind={} {}:{}", metadata.getKind(), group, name); + // Tombstone revision of THIS drop's primary resource only — used to decide the + // deletion fence. It must be the primary's own revision, NOT opt.getMaxModRevision(): + // a single opt is reused across many files in a tick, so the cumulative max can carry + // an unrelated earlier create/binding revision and make a tombstone-less delete + // (primary revision 0) skip the AwaitSchemaDeleted fallback. 0 for trace/property, + // whose delete RPCs have no revision-returning variant — those always key-fence. + long primaryDeleteRev = StorageManipulationOpt.DEFAULT_MOD_REVISION; switch (metadata.getKind()) { case MEASURE: // Drop the TopN aggregations first (if any), then index rule bindings, index rules, then the measure. @@ -423,11 +531,13 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc } } dropIndexRuleBindingsBestEffort(client, group, name, opt); - opt.recordModRevision(client.deleteMeasureWithRevision(group, name)); + primaryDeleteRev = client.deleteMeasureWithRevision(group, name); + opt.recordModRevision(primaryDeleteRev); break; case STREAM: dropIndexRuleBindingsBestEffort(client, group, name, opt); - opt.recordModRevision(client.deleteStreamWithRevision(group, name)); + primaryDeleteRev = client.deleteStreamWithRevision(group, name); + opt.recordModRevision(primaryDeleteRev); break; case TRACE: dropIndexRuleBindingsBestEffort(client, group, name, opt); @@ -441,9 +551,9 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc "dropTable unsupported kind=" + metadata.getKind() + " for model " + model.getName()); } // Fence: prefer the revision-based wait when the server recorded a tombstone - // mod_revision; otherwise fall back to AwaitSchemaDeleted keyed on the - // primary resource so callers get a hard "removed everywhere" signal. - fenceOnRevisionOrDeletion(client, opt, metadata, "dropTable:" + model.getName()); + // mod_revision for THIS resource; otherwise fall back to AwaitSchemaDeleted keyed on + // the primary resource so callers get a hard "removed everywhere" signal. + fenceOnRevisionOrDeletion(client, metadata, primaryDeleteRev, "dropTable:" + model.getName()); } catch (BanyanDBException ex) { if (Status.Code.NOT_FOUND.equals(ex.getStatus())) { log.info("BanyanDB schema {} already absent on drop (idempotent)", model.getName()); @@ -454,22 +564,32 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc } /** - * Prefer {@code AwaitRevisionApplied(maxRev)} when the registry returned a - * non-zero tombstone revision; otherwise fall back to + * Prefer {@code AwaitRevisionApplied(primaryDeleteRev)} when the registry returned a + * non-zero tombstone revision for the primary resource; otherwise fall back to * {@code AwaitSchemaDeleted(key)} keyed on the primary resource. The fallback * exists because {@code mod_revision == 0} on a delete response means the server * did not record a tombstone — the revision-based fence cannot observe a * deletion that didn't get one. + * + *

The decision keys on {@code primaryDeleteRev} — the primary resource's own delete + * revision — NOT {@code opt.getMaxModRevision()}. A single opt is shared across every file + * in a reconciler tick, so its cumulative max can hold an unrelated earlier create/binding + * revision; using it here would make a tombstone-less primary delete take the revision + * branch and silently skip {@code AwaitSchemaDeleted}. Because the primary delete is issued + * last (after TopN + bindings), its revision is the highest of this drop and fencing on it + * also covers the earlier lower-revision deletes of the same drop. */ - private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final StorageManipulationOpt opt, + private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final MetadataRegistry.SchemaMetadata metadata, + final long primaryDeleteRev, final String context) throws BanyanDBException { - final long rev = opt.getMaxModRevision(); - if (rev > 0L) { - fenceOnRevision(client, opt, context); + if (primaryDeleteRev > 0L) { + // Drops fence inline (never deferred): a deletion's visibility is per-key and must + // not ride a batched revision flush — drops stay correct even under a deferFence opt. + doFenceOnRevisionValue(client, primaryDeleteRev, context); return; } - // mod_revision was 0 on every delete — fall back to key-based deletion fence. + // mod_revision was 0 on the primary delete — fall back to key-based deletion fence. final String kind; switch (metadata.getKind()) { case MEASURE: @@ -1290,6 +1410,24 @@ private void checkTopNAggregation(Model model, BanyanDBClient client, StorageMan * schema cache the local DAOs read from so this node can translate Model ↔ BanyanDB * proto for sample ingest / queries. */ + @Override + protected void populateLocalCacheOnly(final Model model, final StorageManipulationOpt opt) { + // inspectBackend=false (peer / local-cache-only tick): the main owns the backend + // resource; this node only (re)derives its local MetadataRegistry entry so its DAOs + // can translate this model. RPC-free, and an overwrite via register*Model — keeps a + // peer's cache in lockstep with a reshaped model that re-fires whenCreating, which the + // read-side self-heal (fill-if-absent only) cannot do. + final DownSamplingConfigService downSamplingConfigService = moduleManager.find(CoreModule.NAME) + .provider() + .getService(DownSamplingConfigService.class); + registerLocallyByKind(model, downSamplingConfigService); + } + + @Override + protected void evictLocalCache(final Model model) { + MetadataRegistry.INSTANCE.evict(model); + } + private void registerLocallyByKind(final Model model, final DownSamplingConfigService downSamplingConfigService) { if (model.isTimeSeries()) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java index 09bd256e14a4..98763aa3e115 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java @@ -40,7 +40,13 @@ public BanyanDBNoneStreamDAO(BanyanDBStorageClient client, StorageBuilder registry = new HashMap<>(); + // ConcurrentHashMap (not HashMap): boot populates single-threaded, but the self-heal path + // (repopulateLocally) writes from persistence/query threads concurrently with reads. + private final Map registry = new ConcurrentHashMap<>(); + + /** + * Re-derive and locally register a model's BanyanDB {@link Schema} with NO server RPC. + * Registered once by the active {@code BanyanDBIndexInstaller} at boot and invoked by + * {@link #repopulateLocally(Model)} when a read path finds the cache empty for a model whose + * dispatch worker is already live — e.g. a {@code withoutSchemaChange} peer apply or a + * runtime-rule bundled fall-over rebuilt the worker but skipped the local populate. The + * {@code Model} is always known locally and its schema is a pure local derivation, so such a + * miss is always re-derivable without touching the backend. + */ + @FunctionalInterface + public interface LocalSchemaPopulator { + void populateLocally(Model model); + } + + private volatile LocalSchemaPopulator localSchemaPopulator; + + /** Register the boot-time, RPC-free local schema populator. Called once by the active installer. */ + public void registerLocalSchemaPopulator(final LocalSchemaPopulator populator) { + this.localSchemaPopulator = populator; + } + + /** + * Best-effort, RPC-free re-derivation of a model's local {@link Schema} so a read/persist path + * can self-heal a missing cache entry instead of throwing {@code " is not registered"} + * forever (the registry never evicts, so an entry that was never populated on this node stays + * absent otherwise). No-op when no populator is registered (e.g. non-BanyanDB unit tests). + * Swallows derivation exceptions so a self-heal attempt is never worse than the pre-existing + * throw — the caller re-reads and surfaces its own not-registered error if still absent. + */ + public void repopulateLocally(final Model model) { + final LocalSchemaPopulator populator = this.localSchemaPopulator; + if (populator == null) { + return; + } + try { + populator.populateLocally(model); + } catch (final Exception e) { + log.debug("local schema self-heal re-derivation failed for model [{}]; " + + "caller will surface the not-registered error", model.getName(), e); + } + } public StreamModel registerStreamModel(Model model, BanyanDBStorageConfig config) { final SchemaMetadata schemaMetadata = parseMetadata(model, config, null); @@ -398,6 +443,24 @@ public Schema findMetadata(final Model model) { return findMetricMetadata(model.getName(), model.getDownsampling()); } + /** + * Drop the local {@link Schema} cache entry for {@code model}, keyed exactly as + * {@link #findMetadata(Model)} looks it up. The registry is otherwise insert-only, so this + * is the one path that removes an entry — invoked from {@code ModelInstaller.whenRemoving} + * on a runtime-rule hot-remove / reshape so a stale translation never outlives the model. + */ + public void evict(final Model model) { + final String key; + if (!model.isTimeSeries() || model.isRecord()) { + key = model.getName(); + } else { + key = SchemaMetadata.formatName(model.getName(), model.getDownsampling()); + } + if (registry.remove(key) != null) { + log.debug("evicted local BanyanDB schema cache entry [{}]", key); + } + } + private FieldSpec parseFieldSpec(ModelColumn modelColumn) { String colName = modelColumn.getColumnName().getStorageName(); if (String.class.equals(modelColumn.getType())) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java index 4f1ff1928e88..d52c70ec5ffd 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java @@ -63,12 +63,29 @@ public BanyanDBMetricsDAO(BanyanDBStorageClient client, StorageBuilder this.storageBuilder = storageBuilder; } - @Override - public List multiGet(Model model, List metrics) throws IOException { + /** + * Resolve the model's BanyanDB schema, self-healing a missing local entry once before failing. + * A null here means this node has a live persist worker for the model but its schema cache was + * never populated (or lost) — typically a {@code withoutSchemaChange} peer apply or a + * runtime-rule bundled fall-over that rebuilt the worker without the populate. Re-derive the + * schema locally with no server RPC and re-read; throw only if the entry is still absent, so + * a genuinely unknown model still fails fast instead of flooding forever. + */ + private MetadataRegistry.Schema resolveSchema(Model model) throws IOException { MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); if (schema == null) { - throw new IOException(model.getName() + " is not registered"); + MetadataRegistry.INSTANCE.repopulateLocally(model); + schema = MetadataRegistry.INSTANCE.findMetadata(model); + if (schema == null) { + throw new IOException(model.getName() + " is not registered"); + } } + return schema; + } + + @Override + public List multiGet(Model model, List metrics) throws IOException { + MetadataRegistry.Schema schema = resolveSchema(model); final Map> seriesIDColumns = new HashMap<>(); if (model.getBanyanDBModelExtension().isIndexMode()) { seriesIDColumns.put(ID, new ArrayList<>()); @@ -144,10 +161,7 @@ protected void apply(MeasureQuery query) { @Override public InsertRequest prepareBatchInsert(Model model, Metrics metrics, SessionCacheCallback callback) throws IOException { - MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); - if (schema == null) { - throw new IOException(model.getName() + " is not registered"); - } + MetadataRegistry.Schema schema = resolveSchema(model); MeasureWrite measureWrite = getClient().createMeasureWrite(schema.getMetadata().getGroup(), // group name schema.getMetadata().name(), // measure-name TimeBucket.getTimestamp(metrics.getTimeBucket(), model.getDownsampling())); // timestamp @@ -161,10 +175,7 @@ public InsertRequest prepareBatchInsert(Model model, Metrics metrics, SessionCac @Override public UpdateRequest prepareBatchUpdate(Model model, Metrics metrics, SessionCacheCallback callback) throws IOException { - MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); - if (schema == null) { - throw new IOException(model.getName() + " is not registered"); - } + MetadataRegistry.Schema schema = resolveSchema(model); MeasureWrite measureWrite = getClient().createMeasureWrite(schema.getMetadata().getGroup(), // group name schema.getMetadata().name(), // measure-name TimeBucket.getTimestamp(metrics.getTimeBucket(), model.getDownsampling())); // timestamp diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java index 8bb0d28f8009..4632982b7631 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java @@ -50,7 +50,13 @@ public BanyanDBRecordDAO(BanyanDBStorageClient client, StorageBuilder st @Override public InsertRequest prepareBatchInsert(Model model, Record record) throws IOException { + // Self-heal a missing local schema entry once (RPC-free re-derivation) before failing — + // see MetadataRegistry.repopulateLocally. Throw only if the entry is still absent. MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); + if (schema == null) { + MetadataRegistry.INSTANCE.repopulateLocally(model); + schema = MetadataRegistry.INSTANCE.findMetadata(model); + } if (schema == null) { throw new IOException(model.getName() + " is not registered"); } @@ -60,6 +66,9 @@ public InsertRequest prepareBatchInsert(Model model, Record record) throws IOExc if (record instanceof BanyanDBTrace.MergeTable) { BanyanDBTrace.MergeTable mergeTable = (BanyanDBTrace.MergeTable) record; MetadataRegistry.Schema mergeTableSchema = MetadataRegistry.INSTANCE.findRecordMetadata(mergeTable.getMergeTableName()); + if (mergeTableSchema == null) { + throw new IOException(mergeTable.getMergeTableName() + " is not registered"); + } traceWrite = getClient().createTraceWrite( schema.getMetadata().getGroup(), diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java new file mode 100644 index 000000000000..6e7ef96e35d9 --- /dev/null +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.storage.plugin.banyandb; + +import java.util.concurrent.atomic.AtomicInteger; +import org.apache.skywalking.oap.server.core.analysis.DownSampling; +import org.apache.skywalking.oap.server.core.storage.model.Model; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +/** + * Unit coverage for the local schema-cache self-heal on {@link MetadataRegistry}. A read/persist + * path that finds the cache empty for a model whose dispatch worker is already live (e.g. a + * {@code withoutSchemaChange} peer apply or a runtime-rule bundled fall-over that rebuilt the + * worker but skipped the populate) must be able to re-derive the schema locally with no server + * RPC, instead of throwing {@code " is not registered"} forever. + */ +class MetadataRegistryTest { + + @AfterEach + void clearPopulator() { + // MetadataRegistry is an enum singleton; clear the populator so global state set by a test + // does not leak into others. + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(null); + } + + @Test + void repopulateLocallyInvokesRegisteredPopulator() { + final Model model = mock(Model.class); + when(model.getName()).thenReturn("meter_test_metric"); + final AtomicInteger calls = new AtomicInteger(); + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(m -> calls.incrementAndGet()); + + MetadataRegistry.INSTANCE.repopulateLocally(model); + + assertEquals(1, calls.get(), "a registered populator must be invoked on a self-heal attempt"); + } + + @Test + void repopulateLocallyIsNoOpWhenNoPopulatorRegistered() { + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(null); + final Model model = mock(Model.class); + assertDoesNotThrow(() -> MetadataRegistry.INSTANCE.repopulateLocally(model), + "self-heal with no populator (e.g. a non-BanyanDB context) must be a no-op"); + } + + @Test + void repopulateLocallySwallowsPopulatorError() { + final Model model = mock(Model.class); + when(model.getName()).thenReturn("meter_test_metric"); + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(m -> { + throw new RuntimeException("derivation boom"); + }); + + // A failed re-derivation must never be worse than the pre-existing throw: the caller + // re-reads and surfaces its own not-registered error, so repopulateLocally itself must + // not propagate. + assertDoesNotThrow(() -> MetadataRegistry.INSTANCE.repopulateLocally(model), + "a throwing populator must be swallowed so self-heal never worsens the failure"); + } + + @Test + void evictExercisesEveryFindMetadataKeyBranchSafely() { + // evict() must key its removal exactly as findMetadata() looks an entry up — across + // management (name), record (name) and metric (formatName(name, downSampling)) kinds. + // Exercise all three so a wrong-branch or formatName regression surfaces, and confirm + // evicting an absent entry is a safe no-op (the registry is otherwise insert-only). + final Model management = mock(Model.class); + when(management.isTimeSeries()).thenReturn(false); + when(management.getName()).thenReturn("management_model"); + + final Model record = mock(Model.class); + when(record.isTimeSeries()).thenReturn(true); + when(record.isRecord()).thenReturn(true); + when(record.getName()).thenReturn("record_model"); + + final Model metric = mock(Model.class); + when(metric.isTimeSeries()).thenReturn(true); + when(metric.isRecord()).thenReturn(false); + when(metric.getName()).thenReturn("metric_model"); + when(metric.getDownsampling()).thenReturn(DownSampling.Minute); + + assertDoesNotThrow(() -> { + MetadataRegistry.INSTANCE.evict(management); + MetadataRegistry.INSTANCE.evict(record); + MetadataRegistry.INSTANCE.evict(metric); + }, "evict must be a safe no-op for an absent entry across all model kinds"); + } +} diff --git a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh index 0740a9947c6b..50499d20c5fc 100755 --- a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh @@ -141,6 +141,23 @@ await_hash() { done } +# Poll a node until its row is ACTIVE with a contentHash different from prev, then echo the new hash. +# Needed after a STRUCTURAL apply on an already-ACTIVE row: the status stays ACTIVE across the async +# apply (which returns at FENCING and persists the new content in the background after the schema +# fence), so await_status "ACTIVE" returns on the first iteration with the OLD hash. Gating on the +# contentHash advancing is the only signal the new content is durable on this node. +await_hash_change() { + local base="$1" prev="$2" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) + while true; do + local got; got="$(list_hash "${base}")" + [ -n "${got}" ] && [ "${got}" != "${prev}" ] && { echo "${got}"; return 0; } + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${base} contentHash did not advance past '${prev:0:8}…' within ${CONVERGE_TIMEOUT_S}s (last='${got:0:8}…', applyError='$(list_apply_error "${base}")')" + fi + sleep 2 + done +} + await_absent() { local base="$1" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) while true; do @@ -190,9 +207,10 @@ log "OAP-2 converged to ${hash_initial:0:8}…" # --- Phase 2: STRUCTURAL update on OAP-1 — second measure created on no-init -------- log "=== Phase 2: STRUCTURAL on OAP-1 ===" apply_on "${OAP1_BASE}" "${SEED_STRUCT}" "allowStorageChange=true" >/dev/null -await_status "${OAP1_BASE}" "ACTIVE" +# Structural apply is async: wait for OAP-1's own contentHash to ADVANCE (the row is ACTIVE before +# and after, so await_status "ACTIVE" alone would read the stale pre-apply hash) before capturing it. +hash_struct="$(await_hash_change "${OAP1_BASE}" "${hash_initial}")" assert_no_apply_error "${OAP1_BASE}" -hash_struct="$(list_hash "${OAP1_BASE}")" [ "${hash_struct}" != "${hash_initial}" ] || fail "OAP-1 contentHash unchanged after STRUCTURAL apply" log "OAP-1 → ACTIVE @ ${hash_struct:0:8}… (was ${hash_initial:0:8}…)" await_hash "${OAP2_BASE}" "${hash_struct}" diff --git a/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh index aa57fd58bda1..38e964aa343b 100755 --- a/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh @@ -76,6 +76,49 @@ list_field() { list_row "${catalog}" "${name}" | jq -r '."'"${field}"'" // empty' } +# Budget for an async apply to land in /list. A NEW / STRUCTURAL addOrUpdate is async: it returns +# immediately at FENCING and the rule row is persisted only AFTER a background schema fence, so a +# single read right after the 2xx can miss the row (or read a stale contentHash). The waiters below +# poll within this budget. FILTER_ONLY edits persist synchronously, so they return on the first poll. +APPLY_LAND_S="${APPLY_LAND_S:-200}" + +# Poll until the (catalog,name) row reaches expected_status, up to APPLY_LAND_S. +await_status() { + local catalog="$1" name="$2" expected="$3" + log " await /list ${catalog}/${name} status=${expected} (budget ${APPLY_LAND_S}s)" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) got="" + while true; do + got="$(list_field "${catalog}" "${name}" status)" + [ "${got}" = "${expected}" ] && return 0 + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${catalog}/${name} did not reach status='${expected}' within ${APPLY_LAND_S}s (last='${got}')" + fi + sleep 2 + done +} + +# Poll until the (catalog,name) row is ACTIVE with a contentHash different from prev, then echo the +# new hash. The wait-condition for a swap whose status stays ACTIVE before and after the apply, so +# the contentHash advancing is the only signal the new content actually landed (subsumes the +# async-persist window and a possible STRUCTURAL classification of the swap). +await_hash_changed() { + local catalog="$1" name="$2" prev="$3" + log " await /list ${catalog}/${name} contentHash≠${prev:0:8}… (budget ${APPLY_LAND_S}s)" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) status="" hash="" + while true; do + status="$(list_field "${catalog}" "${name}" status)" + hash="$(list_field "${catalog}" "${name}" contentHash)" + if [ "${status}" = "ACTIVE" ] && [ -n "${hash}" ] && [ "${hash}" != "${prev}" ]; then + echo "${hash}" + return 0 + fi + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${catalog}/${name} contentHash did not advance past '${prev:0:8}…' within ${APPLY_LAND_S}s (last status='${status}' hash='${hash:0:8}…')" + fi + sleep 2 + done +} + apply_rule() { local catalog="$1" name="$2" body="$3" admin runtime-rule add --catalog "${catalog}" --name "${name}" -f "${body}" >/dev/null \ @@ -155,15 +198,13 @@ log "OAP ready" # --- Phase 0: apply log-mal aggregation ----------------------------------------------- log "=== Phase 0: apply log-mal aggregation rule ===" apply_rule "${MAL_CATALOG}" "${MAL_NAME}" "${SEED_MAL}" -mal_status="$(list_field "${MAL_CATALOG}" "${MAL_NAME}" status)" -[ "${mal_status}" = "ACTIVE" ] || fail "MAL rule expected ACTIVE, got '${mal_status}'" +await_status "${MAL_CATALOG}" "${MAL_NAME}" "ACTIVE" log "log-mal → ACTIVE" # --- Phase 1: apply LAL v1 ------------------------------------------------------------ log "=== Phase 1: apply LAL v1 (extractor stamps step=v1) ===" apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V1}" -status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" -[ "${status}" = "ACTIVE" ] || fail "v1 expected ACTIVE, got '${status}'" +await_status "${LAL_CATALOG}" "${LAL_NAME}" "ACTIVE" hash_v1="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" [ -n "${hash_v1}" ] || fail "v1 contentHash empty" log "v1 → ACTIVE @ ${hash_v1:0:8}…" @@ -172,10 +213,7 @@ await_metric_for_step "v1" # --- Phase 2: swap to LAL v2 (same key, step flips to v2) ----------------------------- log "=== Phase 2: swap to LAL v2 (extractor stamps step=v2) ===" apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V2}" -status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" -[ "${status}" = "ACTIVE" ] || fail "v2 expected ACTIVE, got '${status}'" -hash_v2="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" -[ "${hash_v2}" != "${hash_v1}" ] || fail "v2 contentHash unchanged from v1 (${hash_v2:0:8}…)" +hash_v2="$(await_hash_changed "${LAL_CATALOG}" "${LAL_NAME}" "${hash_v1}")" log "v2 → ACTIVE @ ${hash_v2:0:8}… (was ${hash_v1:0:8}…) — swap applied" await_metric_for_step "v2" diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh index d9832ec10a33..860ddd2f807f 100755 --- a/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh @@ -196,40 +196,123 @@ assert_apply_status() { || fail "expected applyStatus=${expected}, not found in: ${actual}" } -# GET /runtime/rule/list and ensure the row matches the expected status. Returns -# the matching JSON line on stdout for callers that want to inspect contentHash. +# Budget for the async apply state machine to reach a terminal phase on GET /runtime/rule/status. +APPLY_TERMINAL_S="${APPLY_TERMINAL_S:-200}" + +# Drive the new async apply surface to a terminal phase. A STRUCTURAL addOrUpdate (and a +# /delete?mode=revertToBundled) returns immediately at FENCING with an applyId; the row is persisted +# and dispatch resumed in the BACKGROUND, after the schema fence. Given the apply's JSON response, +# extract its applyId and poll GET /runtime/rule/status until the phase is terminal: +# APPLIED / DEGRADED → the durable row was written (DEGRADED == committed-and-durable, only the +# cluster-wide fence confirmation lagged) → return 0 +# FAILED → a pre-commit error, nothing was committed → fail +# anything else (FENCING / DDL / ROLLING_OUT / PENDING / UNKNOWN) → keep polling +# A synchronous apply (filter_only / inactivate / default delete) carries no applyId, so this is a +# no-op — that response is already durable on return. swctl has no runtime-rule `status` subcommand, +# so this goes through curl (the status endpoint lives on the same REST port). Passing catalog+name +# lets the main answer from the durable rule row once the live apply-id is TTL-evicted, so a slow +# poll converges instead of false-timing-out. +await_apply_terminal() { + local resp="$1" + local rule_name="${2:-${NAME}}" + local apply_id + apply_id="$(echo "${resp}" | jq -r '.applyId // empty' 2>/dev/null || true)" + if [[ -z "${apply_id}" ]]; then + return 0 + fi + log "runtime-rule status → polling (≤${APPLY_TERMINAL_S}s) for apply ${apply_id} of ${CATALOG}/${rule_name} to reach a terminal phase" + local deadline=$(( $(date +%s) + APPLY_TERMINAL_S )) + local body phase="" + while :; do + body="$(curl -s "${REST_BASE}/runtime/rule/status?applyId=${apply_id}&catalog=${CATALOG}&name=${rule_name}" 2>/dev/null || true)" + phase="$(echo "${body}" | jq -r '.phase // empty' 2>/dev/null || true)" + case "${phase}" in + APPLIED|DEGRADED) + log " ✓ apply ${apply_id} → ${phase} (durable)" + return 0 + ;; + FAILED) + fail "apply ${apply_id} of ${CATALOG}/${rule_name} reached FAILED: ${body}" + ;; + esac + if (( $(date +%s) >= deadline )); then + fail "apply ${apply_id} of ${CATALOG}/${rule_name} did not reach a terminal phase within ${APPLY_TERMINAL_S}s (last phase='${phase}', body: ${body})" + fi + sleep 2 + done +} + +# Budget for an async structural apply to land in /list. A structural addOrUpdate returns +# immediately at FENCING (accepted, not yet durable): the rule row is persisted only AFTER the +# background schema fence confirms, and BanyanDB's meta→data-node schema sync can take 1-2 minutes, +# so a single /list read right after the 2xx can miss the row. The /list assertions poll within this +# budget (covers the fence timeout + the sync). ES/JDBC have no such fence — they land in under a +# second, so the poll returns on its first iteration there. +APPLY_LAND_S="${APPLY_LAND_S:-200}" + +# Poll GET /runtime/rule/list until the row for (catalog, rule_name) shows the expected status, +# up to APPLY_LAND_S. Returns the matching JSON line on stdout for callers that inspect contentHash. +# +# Optional 3rd arg differ_hash: when set, the poll additionally requires the row's contentHash to +# differ from it. This is the wait-condition for a STRUCTURAL update of an ALREADY-ACTIVE row — the +# status is ACTIVE both before and after the async apply, so a status-only poll would return on the +# first iteration with the OLD (pre-apply) contentHash, before the background fence→persist tail +# has written the new content. Gating on "status==expected AND contentHash advanced" blocks until +# the new content is durable and visible. list_row() { local expected_status="$1" local rule_name="${2:-${NAME}}" - log "runtime-rule list → looking for ${CATALOG}/${rule_name} status=${expected_status}" - local lines - lines="$(admin runtime-rule list)" \ - || fail "runtime-rule list failed" - local match - match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" - [[ -n "${match}" ]] \ - || fail "/list has no row for ${CATALOG}/${rule_name} (got: ${lines})" - local actual_status - actual_status="$(echo "${match}" | jq -r '.status')" - [[ "${actual_status}" == "${expected_status}" ]] \ - || fail "expected /list status=${expected_status}, got '${actual_status}' (row: ${match})" - echo "${match}" + local differ_hash="${3:-}" + log "runtime-rule list → waiting (≤${APPLY_LAND_S}s) for ${CATALOG}/${rule_name} status=${expected_status}${differ_hash:+ contentHash≠${differ_hash:0:8}…}" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) + local lines match actual_status="" actual_hash="" + while :; do + lines="$(admin runtime-rule list)" \ + || fail "runtime-rule list failed" + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + if [[ -n "${match}" ]]; then + actual_status="$(echo "${match}" | jq -r '.status')" + actual_hash="$(echo "${match}" | jq -r '.contentHash')" + if [[ "${actual_status}" == "${expected_status}" \ + && ( -z "${differ_hash}" || "${actual_hash}" != "${differ_hash}" ) ]]; then + echo "${match}" + return 0 + fi + fi + if (( $(date +%s) >= deadline )); then + if [[ -n "${match}" ]]; then + fail "expected /list status=${expected_status}${differ_hash:+ with advanced contentHash}, got status='${actual_status}' hash='${actual_hash}' within ${APPLY_LAND_S}s (row: ${match})" + fi + fail "/list has no row for ${CATALOG}/${rule_name} within ${APPLY_LAND_S}s (got: ${lines})" + fi + sleep 2 + done } -# Assert that /list does NOT have a row for the given (catalog, name). +# Poll until /list has NO row (or status n/a) for the given (catalog, name), up to APPLY_LAND_S. +# A /delete?mode=revertToBundled runs the async apply pipeline (the bundled re-apply), so the row's +# removal can lag the same way a structural apply's appearance does. list_no_row() { local rule_name="${1:-${NAME}}" - log "runtime-rule list → expect NO row for ${CATALOG}/${rule_name}" - local lines match - lines="$(admin runtime-rule list)" \ - || fail "runtime-rule list failed" - match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" - if [[ -n "${match}" ]]; then - local status + log "runtime-rule list → waiting (≤${APPLY_LAND_S}s) for NO row for ${CATALOG}/${rule_name}" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) + local lines match status + while :; do + lines="$(admin runtime-rule list)" \ + || fail "runtime-rule list failed" + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + if [[ -z "${match}" ]]; then + return 0 + fi status="$(echo "${match}" | jq -r '.status')" - [[ "${status}" == "n/a" ]] \ - || fail "/list still has row for ${CATALOG}/${rule_name} status=${status} (row: ${match})" - fi + if [[ "${status}" == "n/a" ]]; then + return 0 + fi + if (( $(date +%s) >= deadline )); then + fail "/list still has row for ${CATALOG}/${rule_name} status=${status} within ${APPLY_LAND_S}s (row: ${match})" + fi + sleep 2 + done } # Per-phase entity scope. SHAPE-BREAK reshapes the metric from SERVICE to @@ -445,6 +528,7 @@ log "=== Phase 1: CREATE seed-rule.yaml ===" step_set "create" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule.yaml")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" list_row "ACTIVE" >/dev/null hash_initial="$(list_row ACTIVE | jq -r '.contentHash')" log " initial contentHash=${hash_initial}" @@ -455,7 +539,10 @@ log "=== Phase 2: UPDATE-FILTER seed-rule-filter-only.yaml ===" step_set "update_filter" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-filter-only.yaml")" assert_apply_status "filter_only_applied" "${resp}" -hash_filter_only="$(list_row ACTIVE | jq -r '.contentHash')" +await_apply_terminal "${resp}" +# FILTER_ONLY persists synchronously (no applyId), so the new hash is already durable here; the +# differ-gate still hardens the read against any list lag and proves the row actually advanced. +hash_filter_only="$(list_row ACTIVE "${NAME}" "${hash_initial}" | jq -r '.contentHash')" [[ "${hash_filter_only}" != "${hash_initial}" ]] \ || fail "FILTER_ONLY apply did not advance /list contentHash" log " contentHash advanced to ${hash_filter_only}" @@ -466,7 +553,11 @@ log "=== Phase 3: UPDATE-STRUCTURAL seed-rule-structural.yaml ===" step_set "structural" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-structural.yaml" "allowStorageChange=true")" assert_apply_status "structural_applied" "${resp}" -hash_structural="$(list_row ACTIVE | jq -r '.contentHash')" +await_apply_terminal "${resp}" +# STRUCTURAL update of an already-ACTIVE row: status stays ACTIVE across the async apply, so gate on +# the contentHash advancing past the filter-only hash, not just on status — otherwise the read races +# the background fence→persist tail and returns the stale pre-apply hash. +hash_structural="$(list_row ACTIVE "${NAME}" "${hash_filter_only}" | jq -r '.contentHash')" [[ "${hash_structural}" != "${hash_filter_only}" ]] \ || fail "STRUCTURAL apply did not advance /list contentHash" log " contentHash advanced to ${hash_structural}" @@ -569,6 +660,7 @@ log "=== Phase 5h: HAPPY-PATH + RESTART dynamic-LAYER round-trip ===" struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-sibling-with-layer.yaml" "" "${SIBLING_NAME}")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" "${SIBLING_NAME}" list_row "ACTIVE" "${SIBLING_NAME}" >/dev/null sleep 2 layers_after_create="$(swctl --display yaml \ @@ -644,6 +736,7 @@ step_set "shape_break_new" log " POST INSTANCE-scope rule v4" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" hash_shape_break="$(list_row ACTIVE | jq -r '.contentHash')" log " contentHash after shape break = ${hash_shape_break}" # Rule v4 is INSTANCE-scope; swctl now needs --instance-name to resolve @@ -670,6 +763,7 @@ resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" status="$(echo "${resp}" | jq -r '.applyStatus // empty')" [[ "${status}" == "structural_applied" || "${status}" == "no_change" ]] \ || fail "ACTIVATE: unexpected applyStatus=${status} (full: ${resp})" +await_apply_terminal "${resp}" list_row "ACTIVE" >/dev/null await_metric_for_step "e2e_rr_requests" "activate" # NOTE: we do NOT re-assert "no step=inactivate rows" here. Phase 7's in-window diff --git a/test/e2e-v2/cases/storage/expected/config-dump.yml b/test/e2e-v2/cases/storage/expected/config-dump.yml index e2a37ee996cb..960e6efc809b 100644 --- a/test/e2e-v2/cases/storage/expected/config-dump.yml +++ b/test/e2e-v2/cases/storage/expected/config-dump.yml @@ -183,6 +183,7 @@ "receiver-pprof.provider": "default", "receiver-profile.provider": "default", "receiver-register.provider": "default", + "receiver-runtime-rule.default.deferredFenceTimeoutSeconds": "180", "receiver-runtime-rule.default.refreshRulesPeriod": "30", "receiver-runtime-rule.default.selfHealThresholdSeconds": "60", "receiver-runtime-rule.provider": "default",