From 4598124f9902b081a005f72f9c332c5bdf231c52 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Sun, 14 Jun 2026 17:19:55 +0800 Subject: [PATCH 01/19] Fix BanyanDB runtime-rule self-heal + v2 MAL CounterWindow collision & Elvis falsy semantics * BanyanDB schema-cache self-heal: persist DAOs re-derive a missing local schema (RPC-free) once before failing; the no-init defer loop retries a transient backend probe error (isRetryableNoInitProbeFailure, default false / BanyanDB opt-in) instead of crash-looping the pod. * v2 MAL CounterWindow key collision: rate()/increase()/irate() keyed each counter's sliding window on the rule's output metric name (shared by every input metric of a rule) instead of the counter's own name, so counters that reduce to the same labels after .sum() shared one window slot and rated against each other's values -- fabricating non-zero rates from frozen counters (BanyanDB liaison gRPC error rate). Now keyed by the counter's own metric name. * v2 MAL Elvis ?: honored only null (Optional.ofNullable().orElse()); now Groovy-falsy via MalRuntimeHelper.elvis/isTruthy, single-evaluated -- fixes BanyanDB liaison node_type="" stored instead of "n/a". * banyandb otel-rules: PT15S -> PT1M rate window. * Tests: BanyanDBErrorRateReproTest, MALElvisFalsyTest, MetadataRegistryTest, ModelInstallerNoInitTest. --- docs/en/changes/changes.md | 3 + .../v2/compiler/MALClosureCodegen.java | 8 +- .../v2/compiler/rt/MalRuntimeHelper.java | 28 +++- .../meter/analyzer/v2/dsl/SampleFamily.java | 6 +- .../v2/dsl/BanyanDBErrorRateReproTest.java | 125 ++++++++++++++++++ .../analyzer/v2/dsl/MALElvisFalsyTest.java | 98 ++++++++++++++ .../core/storage/model/ModelInstaller.java | 43 +++++- .../model/ModelInstallerNoInitTest.java | 83 +++++++++++- .../banyandb/banyandb-instance.yaml | 42 +++--- .../otel-rules/banyandb/banyandb-service.yaml | 6 +- .../banyandb/BanyanDBIndexInstaller.java | 48 ++++++- .../banyandb/BanyanDBNoneStreamDAO.java | 6 + .../plugin/banyandb/MetadataRegistry.java | 47 ++++++- .../banyandb/measure/BanyanDBMetricsDAO.java | 33 +++-- .../banyandb/stream/BanyanDBRecordDAO.java | 9 ++ .../plugin/banyandb/MetadataRegistryTest.java | 81 ++++++++++++ 16 files changed, 611 insertions(+), 55 deletions(-) create mode 100644 oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java create mode 100644 oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java create mode 100644 oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 51aa0dfa7956..59aa378a6b5a 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,6 +242,9 @@ admin-host only" entry above for the public REST retirement. #### OAP Server +* Fix BanyanDB peer nodes permanently flooding ` is not registered` when a node holds a live persist worker but its local `MetadataRegistry` schema cache was never populated for that model — e.g. a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate, and nothing (the registry never evicts, the 30s reconcile only covers runtime-rule rows) ever re-derived it. The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation before failing, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. +* Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. +* Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. * SWIP-15: rebuild BanyanDB self-observability around the cluster / container / group model (requires BanyanDB 0.11+). A BanyanDB cluster is modeled as one `Service`, each container as a `ServiceInstance` (role/tier as attributes), and each storage group as an `Endpoint`. The `otel-rules/banyandb/` rules are category-separated by role (`node_*` / `liaison_*` / `data_*` / `lifecycle_*`) and by data type (`measure_*` / `stream_*` / `trace_*` / `property_*`), mirroring the upstream FODC-proxy Grafana boards, and include queue batch/message granularity (apache/skywalking-banyandb#1169). Adds a `SERVICE_INSTANCE_RELATION` MAL scope and `serviceInstanceRelation(...)` builder powering a new intra-cluster pod-to-pod deployment topology (`banyandb-instance-relation.yaml`). The stale single-node `host_name` model is removed. * Runtime MAL/LAL hot-update rules can declare `layerDefinitions:` to introduce new layers. Ordinals are operator-pinned in the `100_000+` tier; the layer is diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java index 8b46f68cf91d..bc2806e5e393 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java @@ -300,9 +300,13 @@ void generateClosureExpr(final StringBuilder sb, } else if (expr instanceof MALExpressionModel.ClosureElvisExpr) { final MALExpressionModel.ClosureElvisExpr elvis = (MALExpressionModel.ClosureElvisExpr) expr; - sb.append("java.util.Optional.ofNullable("); + // Groovy `?:` applies the fallback when the primary is falsy (null, + // empty string/container, numeric zero, false), not only when null. + // Keep the primary single-evaluated so expressions such as tags.remove(...) + // do not observe different values between the truth check and result. + sb.append(MALCodegenHelper.RUNTIME_HELPER_FQCN).append(".elvis("); generateClosureExpr(sb, elvis.getPrimary(), paramName, beanMode); - sb.append(").orElse("); + sb.append(", "); generateClosureExpr(sb, elvis.getFallback(), paramName, beanMode); sb.append(")"); } else if (expr instanceof MALExpressionModel.ClosureRegexMatchExpr) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java index 669be6ef7cec..7275e11db459 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/rt/MalRuntimeHelper.java @@ -17,6 +17,9 @@ package org.apache.skywalking.oap.meter.analyzer.v2.compiler.rt; +import java.lang.reflect.Array; +import java.util.Collection; +import java.util.Map; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.skywalking.oap.meter.analyzer.v2.dsl.Sample; @@ -55,12 +58,9 @@ public static String[][] regexMatch(final String input, final String regex) { return new String[][] {row}; } - /** - * Reverse division: computes {@code numerator / v} for each sample value {@code v}. - * Used by generated code for {@code Number / SampleFamily} expressions. - */ /** * Groovy truth check: {@code null → false}, empty string → {@code false}, + * numeric zero → {@code false}, empty collection/map/array → {@code false}, * {@code Boolean.FALSE → false}, everything else → {@code true}. * Used by generated filter code for standalone expressions in boolean context * (e.g., {@code tags.ApiId || tags.ApiName}). @@ -75,9 +75,29 @@ public static boolean isTruthy(final Object value) { if (value instanceof CharSequence) { return ((CharSequence) value).length() > 0; } + if (value instanceof Number) { + return ((Number) value).doubleValue() != 0.0D; + } + if (value instanceof Collection) { + return !((Collection) value).isEmpty(); + } + if (value instanceof Map) { + return !((Map) value).isEmpty(); + } + if (value.getClass().isArray()) { + return Array.getLength(value) > 0; + } return true; } + public static T elvis(final T primary, final T fallback) { + return isTruthy(primary) ? primary : fallback; + } + + /** + * Reverse division: computes {@code numerator / v} for each sample value {@code v}. + * Used by generated code for {@code Number / SampleFamily} expressions. + */ public static SampleFamily divReverse(final double numerator, final SampleFamily sf) { if (sf == SampleFamily.EMPTY) { diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java index e3d1aec7b3a7..f07ccc8bf38c 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/SampleFamily.java @@ -431,7 +431,7 @@ public SampleFamily increase(String range) { Arrays.stream(samples) .map(sample -> sample.increase( range, - context.metricName, + sample.getName(), (lowerBoundValue, unused) -> sample.value - lowerBoundValue )) .toArray(Sample[]::new) @@ -448,7 +448,7 @@ public SampleFamily rate(String range) { Arrays.stream(samples) .map(sample -> sample.increase( range, - context.metricName, + sample.getName(), (lowerBoundValue, lowerBoundTime) -> { final long timeDiff = (sample.timestamp - lowerBoundTime) / 1000; return timeDiff < 1L ? 0.0 : (sample.value - lowerBoundValue) / timeDiff; @@ -466,7 +466,7 @@ public SampleFamily irate() { this.context, Arrays.stream(samples) .map(sample -> sample.increase( - context.metricName, + sample.getName(), (lowerBoundValue, lowerBoundTime) -> { final long timeDiff = (sample.timestamp - lowerBoundTime) / 1000; return timeDiff < 1L ? 0.0 : (sample.value - lowerBoundValue) / timeDiff; diff --git a/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java new file mode 100644 index 000000000000..361a35b9fff0 --- /dev/null +++ b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/BanyanDBErrorRateReproTest.java @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2.dsl; + +import com.google.common.collect.ImmutableMap; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import org.apache.skywalking.oap.meter.analyzer.v2.dsl.counter.CounterWindow; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; + +/** + * Reproduces the BanyanDB liaison_grpc_error_rate fabrication using the EXACT rule expression and the + * real (frozen) counter values scraped from the live demo FODC proxy. Counters never change across the + * simulated scrapes, so every rate term — and the summed result — MUST be 0. Any non-zero output proves + * the CounterWindow key collision: the three distinct error counters reduce to identical labels after + * .sum([...]) and, because the rate keys on the (shared, rule-level) context.metricName instead of each + * counter's own name, they share one CounterWindow slot and rate against each other's values. + */ +public class BanyanDBErrorRateReproTest { + + private static final String GROUP_BY = "['cluster','pod_name','container_name','node_role','node_type']"; + + // Verbatim from otel-rules/banyandb/banyandb-instance.yaml : liaison_grpc_error_rate (value part). + private static final String EXPR = + "(banyandb_liaison_grpc_total_err.sum(" + GROUP_BY + ").rate('PT1M')" + + " + banyandb_liaison_grpc_total_registry_err.sum(" + GROUP_BY + ").rate('PT1M')" + + " + banyandb_liaison_grpc_total_stream_msg_received_err.sum(" + GROUP_BY + ").rate('PT1M')) * 60"; + + @BeforeEach + void resetWindow() { + CounterWindow.INSTANCE.reset(); + } + + private static Sample s(final String name, final double value, final long ts, final String... kv) { + final ImmutableMap.Builder b = ImmutableMap.builder(); + for (int i = 0; i < kv.length; i += 2) { + b.put(kv[i], kv[i + 1]); + } + return Sample.builder().name(name).labels(b.build()).value(value).timestamp(ts).build(); + } + + // The three liaison-1 families, with the real frozen values (total_err=5, registry_err=166, stream=5). + // node_type is intentionally ABSENT on liaison samples, exactly as the FODC proxy exposes them. + private Map scrape(final long ts) { + final String[] common = { + "cluster", "showcase-banyandb", + "pod_name", "demo-banyandb-liaison-1", + "container_name", "liaison", + "node_role", "ROLE_LIAISON", + }; + final List totalErr = new ArrayList<>(); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metadata"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 2, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsMinute"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsHour"))); + totalErr.add(s("banyandb_liaison_grpc_total_err", 1, ts, with(common, "service", "measure", "method", "query", "group", "sw_metricsDay"))); + + final List registryErr = new ArrayList<>(); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsHour"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsMinute"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 47, ts, with(common, "service", "measure", "method", "get", "group", "sw_metricsDay"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsDay"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsHour"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 7, ts, with(common, "service", "indexRule", "method", "create", "group", "sw_metricsMinute"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 2, ts, with(common, "service", "trace", "method", "get", "group", "sw_trace"))); + registryErr.add(s("banyandb_liaison_grpc_total_registry_err", 2, ts, with(common, "service", "trace", "method", "get", "group", "sw_zipkinTrace"))); + + final List streamErr = new ArrayList<>(); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 1, ts, with(common, "service", "measure", "method", "write", "group", "sw_metadata"))); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 2, ts, with(common, "service", "trace", "method", "write", "group", "sw_trace"))); + streamErr.add(s("banyandb_liaison_grpc_total_stream_msg_received_err", 2, ts, with(common, "service", "stream", "method", "write", "group", "sw_recordsLog"))); + + final Map map = new HashMap<>(); + map.put("banyandb_liaison_grpc_total_err", SampleFamilyBuilder.newBuilder(totalErr.toArray(new Sample[0])).build()); + map.put("banyandb_liaison_grpc_total_registry_err", SampleFamilyBuilder.newBuilder(registryErr.toArray(new Sample[0])).build()); + map.put("banyandb_liaison_grpc_total_stream_msg_received_err", SampleFamilyBuilder.newBuilder(streamErr.toArray(new Sample[0])).build()); + return map; + } + + private static String[] with(final String[] common, final String... extra) { + final String[] out = new String[common.length + extra.length]; + System.arraycopy(common, 0, out, 0, common.length); + System.arraycopy(extra, 0, out, common.length, extra.length); + return out; + } + + @Test + void unchangedCounters_errorRate_mustBeZero() { + final Expression expr = DSL.parse("meter_banyandb_instance_liaison_grpc_error_rate", EXPR); + long ts = 1_700_000_000_000L; + final long step = 10_000L; // 10s scrape, matching the showcase collector + for (int scrape = 0; scrape < 6; scrape++, ts += step) { + final Result result = expr.run(scrape(ts)); + double maxAbs = 0.0; + if (result.isSuccess() && result.getData() != SampleFamily.EMPTY) { + for (final Sample out : result.getData().samples) { + maxAbs = Math.max(maxAbs, Math.abs(out.getValue())); + } + } + // Counters never changed -> error rate MUST be 0 on every scrape. + assertEquals(0.0, maxAbs, 1e-9, + "Unchanged counters must yield 0 error rate, but scrape " + scrape + " produced " + maxAbs); + } + } +} diff --git a/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java new file mode 100644 index 000000000000..08f334551a19 --- /dev/null +++ b/oap-server/analyzer/meter-analyzer/src/test/java/org/apache/skywalking/oap/meter/analyzer/v2/dsl/MALElvisFalsyTest.java @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.meter.analyzer.v2.dsl; + +import com.google.common.collect.ImmutableMap; +import java.util.Collections; +import java.util.List; +import java.util.Map; +import org.apache.skywalking.oap.meter.analyzer.v2.compiler.rt.MalRuntimeHelper; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Groovy's Elvis `?:` applies the fallback when the primary is falsy — including the empty string. + * The v2 codegen previously emitted Optional.ofNullable(P).orElse(F), which only fires on null, so an + * empty-string primary (e.g. a label that .sum() filled with "" for an absent key) leaked "" instead + * of the fallback. This is the exact mechanism behind BanyanDB liaison instances storing node_type="" + * instead of "n/a". + */ +public class MALElvisFalsyTest { + + private static String tagAfterElvis(final String nodeTypeValue) { + final ImmutableMap labels = nodeTypeValue == null + ? ImmutableMap.of("svc", "s") + : ImmutableMap.of("svc", "s", "node_type", nodeTypeValue); + final SampleFamily sf = SampleFamilyBuilder.newBuilder( + Sample.builder().name("metric").labels(labels).value(1.0).timestamp(1L).build()).build(); + final Expression expr = DSL.parse("test_elvis", + "metric.tag({tags -> tags['nt'] = tags.node_type ?: 'n/a'})"); + final Result r = expr.run(Map.of("metric", sf)); + return r.getData().samples[0].getLabels().get("nt"); + } + + private static String tagAfterSideEffectingElvis(final String nodeTypeValue) { + final SampleFamily sf = SampleFamilyBuilder.newBuilder( + Sample.builder() + .name("metric") + .labels(ImmutableMap.of("svc", "s", "node_type", nodeTypeValue)) + .value(1.0) + .timestamp(1L) + .build()).build(); + final Expression expr = DSL.parse("test_elvis_remove", + "metric.tag({tags -> tags['nt'] = tags.remove('node_type') ?: 'n/a'})"); + final Result r = expr.run(Map.of("metric", sf)); + return r.getData().samples[0].getLabels().get("nt"); + } + + @Test + void emptyStringPrimary_usesFallback() { + assertEquals("n/a", tagAfterElvis(""), "empty-string primary must fall through to 'n/a' (Groovy-falsy)"); + } + + @Test + void absentPrimary_usesFallback() { + assertEquals("n/a", tagAfterElvis(null), "absent (null) primary must fall through to 'n/a'"); + } + + @Test + void nonEmptyPrimary_keptAsIs() { + assertEquals("hot", tagAfterElvis("hot"), "non-empty primary must be kept"); + } + + @Test + void sideEffectingPrimary_evaluatedOnce() { + assertEquals("hot", tagAfterSideEffectingElvis("hot"), + "Elvis must not evaluate the primary twice; tags.remove(...) returns a value only once"); + } + + @Test + void runtimeTruthiness_matchesGroovyFalsyValues() { + assertFalse(MalRuntimeHelper.isTruthy(0)); + assertFalse(MalRuntimeHelper.isTruthy(0.0D)); + assertFalse(MalRuntimeHelper.isTruthy(Collections.emptyList())); + assertFalse(MalRuntimeHelper.isTruthy(Collections.emptyMap())); + assertFalse(MalRuntimeHelper.isTruthy(new String[0])); + assertTrue(MalRuntimeHelper.isTruthy(-1)); + assertTrue(MalRuntimeHelper.isTruthy(List.of("value"))); + } +} diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java index d65d7fb75bc1..f73bd824273d 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java @@ -99,20 +99,39 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage // resource that only this very apply would ever create. if (deferDDLToInitNode(opt)) { while (true) { - InstallInfo info = isExists(model, opt); - if (!info.isAllExist()) { - try { + boolean allExist; + try { + InstallInfo info = isExists(model, opt); + allExist = info.isAllExist(); + if (!allExist) { log.info( "install info: {}.table for model: [{}] not all required resources exist. OAP is running in 'no-init' mode, waiting create or update... retry 3s later.", info.buildInstallInfoMsg(), model.getName() ); - Thread.sleep(3000L); - } catch (InterruptedException e) { - log.error(e.getMessage()); } - } else { + } catch (final StorageException e) { + if (!isRetryableNoInitProbeFailure(e)) { + throw e; + } + // A transient backend error during the probe (e.g. a BanyanDB cluster data node + // still Init-ing, "client connection is closing") is NOT a reason to abort boot: + // the init OAP will create the resource and the next probe succeeds. Treat it like + // "not present yet" and retry in-loop, rather than letting it escape and crash-loop + // the pod — which would only re-enter this same loop after a full restart. + allExist = false; + log.warn("install info: existence probe for model: [{}] threw a transient backend " + + "error. OAP is running in 'no-init' mode, retry 3s later.", model.getName(), e); + } + if (allExist) { break; } + try { + Thread.sleep(3000L); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + throw new StorageException( + "interrupted while waiting for no-init backend resources for model " + model.getName(), e); + } } return; } @@ -170,6 +189,16 @@ protected static boolean deferDDLToInitNode(final StorageManipulationOpt opt) { return RunningMode.isNoInitMode() && opt.getFlags().isDeferDDLToInitNode(); } + /** + * Whether a {@link StorageException} from the no-init defer-loop existence probe is + * known to be transient and should be retried in-loop. The base implementation is + * conservative so permanent model/config errors do not become an infinite boot wait; + * storage backends opt in only for transport-level probe failures they can classify. + */ + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + return false; + } + public void start() { } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java index d9cda58cd7ae..5ee2bbcaf374 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java @@ -26,6 +26,7 @@ import static org.junit.jupiter.api.Assertions.assertEquals; import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertThrows; import static org.junit.jupiter.api.Assertions.assertTimeoutPreemptively; import static org.junit.jupiter.api.Assertions.assertTrue; import static org.mockito.Mockito.mock; @@ -102,25 +103,103 @@ void withSchemaChangeSkipsCreateWhenResourceAlreadyExists() throws StorageExcept "withSchemaChange must not re-create a resource that already exists"); } + @Test + void noInitDeferLoopRetriesTransientProbeErrorInsteadOfCrashing() { + RunningMode.setMode("no-init"); + // The first existence probe throws a transient StorageException (mimicking a BanyanDB + // cluster data node still Init-ing); the next probe reports the resource present. + final RecordingInstaller installer = new RecordingInstaller(true /* present after transient */, + 1 /* one transient probe failure */, true /* retryable probe failure */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_transient"); + + // Must NOT propagate the transient (which would escape whenCreating and crash-loop the pod); + // must retry in-loop, then return on the defer path without creating. 10s covers the 3s sleep. + assertTimeoutPreemptively(Duration.ofSeconds(10), () -> + installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent())); + assertEquals(0, installer.createTableCalls, + "a transient probe error must be retried, then defer to the init node without creating"); + assertTrue(installer.probeCalls >= 2, + "the loop must probe again after the transient instead of escaping on the first throw"); + } + + @Test + void noInitDeferLoopPropagatesNonRetryableProbeError() { + RunningMode.setMode("no-init"); + final RecordingInstaller installer = new RecordingInstaller(true /* unused */, + 1 /* one probe failure */, false /* permanent/non-retryable */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_bad_model"); + + assertThrows(StorageException.class, + () -> installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent()), + "permanent model/config probe failures must not be converted into an infinite no-init wait"); + assertEquals(1, installer.probeCalls, + "a non-retryable failure must escape without sleeping and probing again"); + assertEquals(0, installer.createTableCalls); + } + + @Test + void noInitDeferLoopPropagatesInterruptedSleep() { + RunningMode.setMode("no-init"); + final RecordingInstaller installer = new RecordingInstaller(false /* resource absent */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("static_metric_wait_interrupted"); + + Thread.currentThread().interrupt(); + try { + assertThrows(StorageException.class, + () -> installer.whenCreating(model, StorageManipulationOpt.schemaCreateIfAbsent()), + "an interrupted no-init wait must fail fast so shutdown can proceed"); + assertTrue(Thread.currentThread().isInterrupted(), + "the interrupt flag must be restored for upstream shutdown handling"); + } finally { + Thread.interrupted(); + } + } + /** Minimal concrete {@link ModelInstaller} that records createTable calls and reports a * fixed existence result, so the base whenCreating branching can be exercised without a - * real storage backend. */ + * real storage backend. Optionally throws a transient {@link StorageException} on the first + * {@code transientProbeFailures} existence probes to exercise the no-init defer-loop retry. */ private static final class RecordingInstaller extends ModelInstaller { private final boolean resourcePresent; + private final int transientProbeFailures; + private final boolean retryableProbeFailure; + private int probeCalls; private int createTableCalls; private RecordingInstaller(final boolean resourcePresent) { + this(resourcePresent, 0, false); + } + + private RecordingInstaller(final boolean resourcePresent, final int transientProbeFailures) { + this(resourcePresent, transientProbeFailures, true); + } + + private RecordingInstaller(final boolean resourcePresent, final int transientProbeFailures, + final boolean retryableProbeFailure) { super(null, null); this.resourcePresent = resourcePresent; + this.transientProbeFailures = transientProbeFailures; + this.retryableProbeFailure = retryableProbeFailure; } @Override - public InstallInfo isExists(final Model model, final StorageManipulationOpt opt) { + public InstallInfo isExists(final Model model, final StorageManipulationOpt opt) throws StorageException { + if (probeCalls++ < transientProbeFailures) { + throw new StorageException("transient backend error"); + } final TestInstallInfo info = new TestInstallInfo(model); info.setAllExist(resourcePresent); return info; } + @Override + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + return retryableProbeFailure; + } + @Override public void createTable(final Model model) { createTableCalls++; diff --git a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml index def21fc3086d..add9cdc45522 100644 --- a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml +++ b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-instance.yaml @@ -52,7 +52,7 @@ metricsRules: exp: banyandb_system_up_time # CPU usage (cores). process_* rides on every container including lifecycle. - name: cpu_usage - exp: process_cpu_seconds_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: process_cpu_seconds_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # resident memory (bytes). Raw gauge, present on all containers. - name: rss_memory exp: process_resident_memory_bytes @@ -75,45 +75,45 @@ metricsRules: exp: banyandb_system_disk.tagEqual('kind','used_percent').avg(['cluster','pod_name','container_name','node_role','node_type','path']) * 100 # network throughput (bytes/s) by interface name. - name: network_recv - exp: banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S') + exp: banyandb_system_net_state.tagEqual('kind','bytes_recv').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT1M') - name: network_sent - exp: banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT15S') + exp: banyandb_system_net_state.tagEqual('kind','bytes_sent').sum(['cluster','pod_name','container_name','node_role','node_type','name']).rate('PT1M') # Go runtime. - name: goroutines exp: go_goroutines # average GC pause (s) = rate(Σpause) / rate(Σcount). go_gc_duration_seconds is a summary (no buckets), # so this ratio of _sum/_count is the only valid average — do not apply histogram_percentile to it. - name: gc_pause_avg - exp: go_gc_duration_seconds_sum.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(go_gc_duration_seconds_count.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + exp: go_gc_duration_seconds_sum.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(go_gc_duration_seconds_count.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) - name: heap_inuse exp: go_memstats_heap_inuse_bytes - name: heap_next_gc exp: go_memstats_next_gc_bytes - name: alloc_rate - exp: go_memstats_alloc_bytes_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: go_memstats_alloc_bytes_total.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # ---- Liaison only (front door; the UI gates these on container_name == liaison) ---- # query rate (req/s) by data-model service (measure/stream/trace/property). method literal is "query". - name: liaison_query_rate - exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type','service']).rate('PT15S') + exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster','pod_name','container_name','node_role','node_type','service']).rate('PT1M') # gRPC errors/min. Three liaison-side error families (mirrors the Grafana "gRPC Error Rate" panel, # which sums total_err + registry_err + stream_msg_received_err). All lazily registered -> empty on a # healthy cluster; each pre-aggregated to the same label set before '+'. - name: liaison_grpc_error_rate - exp: (banyandb_liaison_grpc_total_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_liaison_grpc_total_registry_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) * 60 + exp: (banyandb_liaison_grpc_total_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_liaison_grpc_total_registry_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) * 60 # registry operation rate (req/s): schema registry ops on the liaison front door. total_started is # query-only on the wire, so the former tagNotEqual('method','query') term was empty and is dropped; # registry_started carries the non-query op count. - name: liaison_registry_op_rate - exp: banyandb_liaison_grpc_total_registry_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_liaison_grpc_total_registry_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # write rate (writes/s) seen at the liaison front door. group label dropped (instance-level total). - name: liaison_write_rate - exp: banyandb_measure_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_tst_total_written.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # tier-2 publish pipeline (liaison -> data): throughput by operation, bytes/s, and p99 send latency. - name: liaison_publish_throughput - exp: banyandb_queue_pub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_pub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: liaison_publish_bytes - exp: banyandb_queue_pub_sent_bytes.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_queue_pub_sent_bytes.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: liaison_publish_latency_p99 exp: banyandb_queue_pub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # tier-2 publish, batch granularity (BanyanDB #1169): batches published/s by operation and the batch @@ -121,7 +121,7 @@ metricsRules: # BUILD-GATED: _batch_finished/_batch_latency are absent on current builds -> emit nothing until the # shipped BanyanDB build registers them. - name: liaison_publish_batch_throughput - exp: banyandb_queue_pub_total_batch_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_pub_total_batch_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: liaison_publish_batch_latency_p99 exp: banyandb_queue_pub_total_batch_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # liaison write-queue (wqueue) depth: pending records buffered at the front door before publish. @@ -144,34 +144,34 @@ metricsRules: exp: banyandb_measure_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_stream_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_trace_tst_pending_data_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # merge-loop iterations/s. - name: data_merge_file_rate - exp: banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') # avg parts merged per merge loop on the file path (matches Grafana = rate(merged_parts{type=file}) / # rate(merge_loop_started)). type='file' is data-only on the wire (liaison emits only type='mem'). - name: data_merge_file_partitions - exp: banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_stream_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_trace_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + exp: banyandb_measure_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_stream_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_trace_tst_total_merged_parts.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) # avg file-merge latency (ms) per merge loop. - name: data_merge_file_latency - exp: (banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_stream_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S')) + banyandb_trace_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S'))) * 1000 + exp: (banyandb_measure_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_measure_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_stream_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_stream_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M')) + banyandb_trace_tst_total_merge_latency.tagEqual('type','file').sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M').safeDiv(banyandb_trace_tst_total_merge_loop_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M'))) * 1000 # inverted-index (series) write rate / term-search rate / total docs. *_inverted_index_total_* are # # TYPE=gauge but cumulative, so rate() yields a per-window delta. Stream's series index is the # storage scope (stream_storage_*); the tst scope is reported separately below. Trace's series index # (trace_storage_*) is included so trace series writes/docs are not silently dropped. - name: data_series_write_rate - exp: banyandb_measure_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_storage_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_series_term_search_rate - exp: banyandb_measure_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + banyandb_trace_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_measure_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_stream_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') + banyandb_trace_storage_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_total_series exp: banyandb_measure_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_stream_storage_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) + banyandb_trace_storage_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # stream time-series-table (tst) index, distinct from the stream series (storage) index above. - name: data_stream_tst_write_rate - exp: banyandb_stream_tst_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_stream_tst_inverted_index_total_updates.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_stream_tst_term_search_rate - exp: banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT15S') + exp: banyandb_stream_tst_inverted_index_total_term_searchers_started.sum(['cluster','pod_name','container_name','node_role','node_type']).rate('PT1M') - name: data_stream_tst_total_docs exp: banyandb_stream_tst_inverted_index_total_doc_count.sum(['cluster','pod_name','container_name','node_role','node_type']) # subscribe-side queue (data receives from liaison): throughput by operation + p99 latency. - name: data_queue_sub_throughput - exp: banyandb_queue_sub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_sub_total_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') - name: data_queue_sub_latency_p99 exp: banyandb_queue_sub_total_latency.sum(['cluster','pod_name','container_name','node_role','node_type','operation','le']).histogram().histogram_percentile([99]) # subscribe-side per-message throughput (BanyanDB #1169). A data node ingests writes via the @@ -181,7 +181,7 @@ metricsRules: # intentionally not modeled here. Batch-level granularity lives on the liaison's publish side # (liaison_publish_batch_throughput / liaison_publish_batch_latency_p99 above). - name: data_queue_sub_message_throughput - exp: banyandb_queue_sub_total_message_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT15S') + exp: banyandb_queue_sub_total_message_finished.sum(['cluster','pod_name','container_name','node_role','node_type','operation']).rate('PT1M') # retention disk-usage % per data-model scope (0-100 gauge). Kept per scope rather than summed (a sum # of three percentages is meaningless). Not in the upstream Grafana boards; a SkyWalking addition. - name: data_retention_measure_disk_usage_percent diff --git a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml index 97c6cac8f6c8..45d505d8f43c 100644 --- a/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml +++ b/oap-server/server-starter/src/main/resources/otel-rules/banyandb/banyandb-service.yaml @@ -29,11 +29,11 @@ metricsRules: # cluster writes/s across the three data-model scopes (measure, stream, trace). Each scope's # write counter is collapsed to one per-cluster series before `+`. - name: cluster_write_rate - exp: (banyandb_measure_total_written.sum(['cluster']).rate('PT15S') + banyandb_stream_tst_total_written.sum(['cluster']).rate('PT15S') + banyandb_trace_tst_total_written.sum(['cluster']).rate('PT15S')) + exp: (banyandb_measure_total_written.sum(['cluster']).rate('PT1M') + banyandb_stream_tst_total_written.sum(['cluster']).rate('PT1M') + banyandb_trace_tst_total_written.sum(['cluster']).rate('PT1M')) # cluster queries/s. `service` on this family is BanyanDB's data-model facet # (measure/stream/trace/property), not a SkyWalking service; method literal is "query". - name: cluster_query_rate - exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster']).rate('PT15S') + exp: banyandb_liaison_grpc_total_started.tagEqual('method','query').sum(['cluster']).rate('PT1M') # cluster errors/min. The seven liaison-side error families mirror the upstream Grafana # "Error Rate" stat (grafana-fodc-workload.json). Each is pre-aggregated to ['cluster'] # BEFORE `+` because their wire label sets differ (stream_msg_received_err carries @@ -42,7 +42,7 @@ metricsRules: # registered and emit no series; MAL treats an empty operand as the additive identity, so the # sum emits from whatever has fired and renders absent-as-0 when nothing has. - name: cluster_error_rate - exp: (banyandb_liaison_grpc_total_err.sum(['cluster']).rate('PT15S') + banyandb_liaison_grpc_total_registry_err.sum(['cluster']).rate('PT15S') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster']).rate('PT15S') + banyandb_queue_pub_total_err.sum(['cluster']).rate('PT15S') + banyandb_measure_total_sync_loop_err.sum(['cluster']).rate('PT15S') + banyandb_stream_tst_total_sync_loop_err.sum(['cluster']).rate('PT15S') + banyandb_trace_tst_total_sync_loop_err.sum(['cluster']).rate('PT15S')) * 60 + exp: (banyandb_liaison_grpc_total_err.sum(['cluster']).rate('PT1M') + banyandb_liaison_grpc_total_registry_err.sum(['cluster']).rate('PT1M') + banyandb_liaison_grpc_total_stream_msg_received_err.sum(['cluster']).rate('PT1M') + banyandb_queue_pub_total_err.sum(['cluster']).rate('PT1M') + banyandb_measure_total_sync_loop_err.sum(['cluster']).rate('PT1M') + banyandb_stream_tst_total_sync_loop_err.sum(['cluster']).rate('PT1M') + banyandb_trace_tst_total_sync_loop_err.sum(['cluster']).rate('PT1M')) * 60 # live container count by role. count(['cluster','container_name','pod_name']) groups by all # three then re-groups excluding the last key (pod_name), yielding one sample per # (cluster, container_name) whose value = distinct pod_name count -> data=N, liaison=M. diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index cabfb75276cd..34c00bfabffd 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -19,9 +19,11 @@ package org.apache.skywalking.oap.server.storage.plugin.banyandb; import io.grpc.Status; +import java.time.Duration; import java.util.HashMap; import java.util.HashSet; import java.util.List; +import java.util.Locale; import java.util.Map; import java.util.Set; import java.util.stream.Collectors; @@ -40,7 +42,6 @@ import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.IndexRuleBinding; import org.apache.skywalking.banyandb.schema.v1.BanyandbSchema.SchemaKey; import org.apache.skywalking.banyandb.database.v1.BanyandbDatabase.TopNAggregation; -import java.time.Duration; import org.apache.skywalking.library.banyandb.v1.client.BanyanDBClient; import org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher; import org.apache.skywalking.library.banyandb.v1.client.grpc.exception.BanyanDBException; @@ -103,6 +104,51 @@ public class BanyanDBIndexInstaller extends ModelInstaller { public BanyanDBIndexInstaller(Client client, ModuleManager moduleManager, BanyanDBStorageConfig config) { super(client, moduleManager); this.config = config; + // Let read/persist paths self-heal a missing local schema entry (MetadataRegistry.repopulateLocally): + // re-derive the model's Schema locally with zero server RPC via the same primitive the peer + // boot path uses. This closes the " is not registered" flood that arises when a + // withoutSchemaChange peer apply or a runtime-rule bundled fall-over rebuilds the dispatch + // worker but skips the populate. DownSamplingConfigService is resolved lazily per call — a + // self-heal only fires post-boot, when CoreModule is long up. + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(model -> { + final DownSamplingConfigService downSamplingConfigService = moduleManager.find(CoreModule.NAME) + .provider() + .getService(DownSamplingConfigService.class); + registerLocallyByKind(model, downSamplingConfigService); + }); + } + + @Override + protected boolean isRetryableNoInitProbeFailure(final StorageException e) { + Throwable cause = e.getCause(); + while (cause != null) { + if (cause instanceof BanyanDBException) { + return isTransientBanyanDBProbeFailure((BanyanDBException) cause); + } + cause = cause.getCause(); + } + return false; + } + + private static boolean isTransientBanyanDBProbeFailure(final BanyanDBException e) { + final Status.Code code = e.getStatus(); + if (Status.Code.UNAVAILABLE.equals(code) + || Status.Code.DEADLINE_EXCEEDED.equals(code) + || Status.Code.CANCELLED.equals(code) + || Status.Code.RESOURCE_EXHAUSTED.equals(code) + || Status.Code.ABORTED.equals(code)) { + return true; + } + if (!Status.Code.UNKNOWN.equals(code)) { + return false; + } + final String message = String.valueOf(e.getMessage()).toLowerCase(Locale.ROOT); + return message.contains("client connection is closing") + || message.contains("connection is closing") + || message.contains("transport is closing") + || message.contains("connection refused") + || message.contains("connection reset") + || message.contains("broken pipe"); } @Override diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java index 09bd256e14a4..98763aa3e115 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBNoneStreamDAO.java @@ -40,7 +40,13 @@ public BanyanDBNoneStreamDAO(BanyanDBStorageClient client, StorageBuilder registry = new HashMap<>(); + // ConcurrentHashMap (not HashMap): boot populates single-threaded, but the self-heal path + // (repopulateLocally) writes from persistence/query threads concurrently with reads. + private final Map registry = new ConcurrentHashMap<>(); + + /** + * Re-derive and locally register a model's BanyanDB {@link Schema} with NO server RPC. + * Registered once by the active {@code BanyanDBIndexInstaller} at boot and invoked by + * {@link #repopulateLocally(Model)} when a read path finds the cache empty for a model whose + * dispatch worker is already live — e.g. a {@code withoutSchemaChange} peer apply or a + * runtime-rule bundled fall-over rebuilt the worker but skipped the local populate. The + * {@code Model} is always known locally and its schema is a pure local derivation, so such a + * miss is always re-derivable without touching the backend. + */ + @FunctionalInterface + public interface LocalSchemaPopulator { + void populateLocally(Model model); + } + + private volatile LocalSchemaPopulator localSchemaPopulator; + + /** Register the boot-time, RPC-free local schema populator. Called once by the active installer. */ + public void registerLocalSchemaPopulator(final LocalSchemaPopulator populator) { + this.localSchemaPopulator = populator; + } + + /** + * Best-effort, RPC-free re-derivation of a model's local {@link Schema} so a read/persist path + * can self-heal a missing cache entry instead of throwing {@code " is not registered"} + * forever (the registry never evicts, so an entry that was never populated on this node stays + * absent otherwise). No-op when no populator is registered (e.g. non-BanyanDB unit tests). + * Swallows derivation exceptions so a self-heal attempt is never worse than the pre-existing + * throw — the caller re-reads and surfaces its own not-registered error if still absent. + */ + public void repopulateLocally(final Model model) { + final LocalSchemaPopulator populator = this.localSchemaPopulator; + if (populator == null) { + return; + } + try { + populator.populateLocally(model); + } catch (final Exception e) { + log.debug("local schema self-heal re-derivation failed for model [{}]; " + + "caller will surface the not-registered error", model.getName(), e); + } + } public StreamModel registerStreamModel(Model model, BanyanDBStorageConfig config) { final SchemaMetadata schemaMetadata = parseMetadata(model, config, null); diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java index 4f1ff1928e88..d52c70ec5ffd 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/measure/BanyanDBMetricsDAO.java @@ -63,12 +63,29 @@ public BanyanDBMetricsDAO(BanyanDBStorageClient client, StorageBuilder this.storageBuilder = storageBuilder; } - @Override - public List multiGet(Model model, List metrics) throws IOException { + /** + * Resolve the model's BanyanDB schema, self-healing a missing local entry once before failing. + * A null here means this node has a live persist worker for the model but its schema cache was + * never populated (or lost) — typically a {@code withoutSchemaChange} peer apply or a + * runtime-rule bundled fall-over that rebuilt the worker without the populate. Re-derive the + * schema locally with no server RPC and re-read; throw only if the entry is still absent, so + * a genuinely unknown model still fails fast instead of flooding forever. + */ + private MetadataRegistry.Schema resolveSchema(Model model) throws IOException { MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); if (schema == null) { - throw new IOException(model.getName() + " is not registered"); + MetadataRegistry.INSTANCE.repopulateLocally(model); + schema = MetadataRegistry.INSTANCE.findMetadata(model); + if (schema == null) { + throw new IOException(model.getName() + " is not registered"); + } } + return schema; + } + + @Override + public List multiGet(Model model, List metrics) throws IOException { + MetadataRegistry.Schema schema = resolveSchema(model); final Map> seriesIDColumns = new HashMap<>(); if (model.getBanyanDBModelExtension().isIndexMode()) { seriesIDColumns.put(ID, new ArrayList<>()); @@ -144,10 +161,7 @@ protected void apply(MeasureQuery query) { @Override public InsertRequest prepareBatchInsert(Model model, Metrics metrics, SessionCacheCallback callback) throws IOException { - MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); - if (schema == null) { - throw new IOException(model.getName() + " is not registered"); - } + MetadataRegistry.Schema schema = resolveSchema(model); MeasureWrite measureWrite = getClient().createMeasureWrite(schema.getMetadata().getGroup(), // group name schema.getMetadata().name(), // measure-name TimeBucket.getTimestamp(metrics.getTimeBucket(), model.getDownsampling())); // timestamp @@ -161,10 +175,7 @@ public InsertRequest prepareBatchInsert(Model model, Metrics metrics, SessionCac @Override public UpdateRequest prepareBatchUpdate(Model model, Metrics metrics, SessionCacheCallback callback) throws IOException { - MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); - if (schema == null) { - throw new IOException(model.getName() + " is not registered"); - } + MetadataRegistry.Schema schema = resolveSchema(model); MeasureWrite measureWrite = getClient().createMeasureWrite(schema.getMetadata().getGroup(), // group name schema.getMetadata().name(), // measure-name TimeBucket.getTimestamp(metrics.getTimeBucket(), model.getDownsampling())); // timestamp diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java index 8bb0d28f8009..4632982b7631 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/stream/BanyanDBRecordDAO.java @@ -50,7 +50,13 @@ public BanyanDBRecordDAO(BanyanDBStorageClient client, StorageBuilder st @Override public InsertRequest prepareBatchInsert(Model model, Record record) throws IOException { + // Self-heal a missing local schema entry once (RPC-free re-derivation) before failing — + // see MetadataRegistry.repopulateLocally. Throw only if the entry is still absent. MetadataRegistry.Schema schema = MetadataRegistry.INSTANCE.findMetadata(model); + if (schema == null) { + MetadataRegistry.INSTANCE.repopulateLocally(model); + schema = MetadataRegistry.INSTANCE.findMetadata(model); + } if (schema == null) { throw new IOException(model.getName() + " is not registered"); } @@ -60,6 +66,9 @@ public InsertRequest prepareBatchInsert(Model model, Record record) throws IOExc if (record instanceof BanyanDBTrace.MergeTable) { BanyanDBTrace.MergeTable mergeTable = (BanyanDBTrace.MergeTable) record; MetadataRegistry.Schema mergeTableSchema = MetadataRegistry.INSTANCE.findRecordMetadata(mergeTable.getMergeTableName()); + if (mergeTableSchema == null) { + throw new IOException(mergeTable.getMergeTableName() + " is not registered"); + } traceWrite = getClient().createTraceWrite( schema.getMetadata().getGroup(), diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java new file mode 100644 index 000000000000..479644ca94c6 --- /dev/null +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.storage.plugin.banyandb; + +import java.util.concurrent.atomic.AtomicInteger; +import org.apache.skywalking.oap.server.core.storage.model.Model; +import org.junit.jupiter.api.AfterEach; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.when; + +/** + * Unit coverage for the local schema-cache self-heal on {@link MetadataRegistry}. A read/persist + * path that finds the cache empty for a model whose dispatch worker is already live (e.g. a + * {@code withoutSchemaChange} peer apply or a runtime-rule bundled fall-over that rebuilt the + * worker but skipped the populate) must be able to re-derive the schema locally with no server + * RPC, instead of throwing {@code " is not registered"} forever. + */ +class MetadataRegistryTest { + + @AfterEach + void clearPopulator() { + // MetadataRegistry is an enum singleton; clear the populator so global state set by a test + // does not leak into others. + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(null); + } + + @Test + void repopulateLocallyInvokesRegisteredPopulator() { + final Model model = mock(Model.class); + when(model.getName()).thenReturn("meter_test_metric"); + final AtomicInteger calls = new AtomicInteger(); + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(m -> calls.incrementAndGet()); + + MetadataRegistry.INSTANCE.repopulateLocally(model); + + assertEquals(1, calls.get(), "a registered populator must be invoked on a self-heal attempt"); + } + + @Test + void repopulateLocallyIsNoOpWhenNoPopulatorRegistered() { + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(null); + final Model model = mock(Model.class); + assertDoesNotThrow(() -> MetadataRegistry.INSTANCE.repopulateLocally(model), + "self-heal with no populator (e.g. a non-BanyanDB context) must be a no-op"); + } + + @Test + void repopulateLocallySwallowsPopulatorError() { + final Model model = mock(Model.class); + when(model.getName()).thenReturn("meter_test_metric"); + MetadataRegistry.INSTANCE.registerLocalSchemaPopulator(m -> { + throw new RuntimeException("derivation boom"); + }); + + // A failed re-derivation must never be worse than the pre-existing throw: the caller + // re-reads and surfaces its own not-registered error, so repopulateLocally itself must + // not propagate. + assertDoesNotThrow(() -> MetadataRegistry.INSTANCE.repopulateLocally(model), + "a throwing populator must be swallowed so self-heal never worsens the failure"); + } +} From a684d6135a04082e8403f1bf1644a74d4b4b42ee Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Sun, 14 Jun 2026 22:19:02 +0800 Subject: [PATCH 02/19] CI: bump GHA third-party actions to ASF-approved v4 SHAs ASF infrastructure-actions approved_patterns.yml dropped the v3 SHAs for these actions, so the stale pins were rejected and the CI workflow failed with startup_failure. Updated to the newest approved v4 SHA each: * docker/login-action v3.7.0 -> v4.2.0 (650006c6) * docker/setup-buildx-action v3.12.0 -> v4.1.0 (d7f5e7f5) * docker/setup-qemu-action v3.6.0 -> v4.1.0 (06116385) * dorny/paths-filter v3.0.2 -> v4.0.1 (fbd0ab8f) --- .github/workflows/publish-docker-e2e-service.yaml | 6 +++--- .github/workflows/publish-docker.yaml | 6 +++--- .github/workflows/skywalking.yaml | 10 +++++----- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/.github/workflows/publish-docker-e2e-service.yaml b/.github/workflows/publish-docker-e2e-service.yaml index e96ab745eb6f..0887df009d12 100644 --- a/.github/workflows/publish-docker-e2e-service.yaml +++ b/.github/workflows/publish-docker-e2e-service.yaml @@ -51,14 +51,14 @@ jobs: restore-keys: | ${{ runner.os }}-maven- - name: Log in to the Container registry - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ${{ env.HUB }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Set up QEMU - uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 + uses: docker/setup-qemu-action@06116385d9baf250c9f4dcb4858b16962ea869c3 # v4.1.0 - name: Set up Docker Buildx - uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f + uses: docker/setup-buildx-action@d7f5e7f509e45cec5c76c4d5afdd7de93d0b3df5 # v4.1.0 - name: Build and push images run: make -C test build.e2e-service docker.push-e2e-service diff --git a/.github/workflows/publish-docker.yaml b/.github/workflows/publish-docker.yaml index 765c8e2340dd..41bd15b3fa5d 100644 --- a/.github/workflows/publish-docker.yaml +++ b/.github/workflows/publish-docker.yaml @@ -65,15 +65,15 @@ jobs: echo "TAG=${{ github.sha }}" >> $GITHUB_ENV fi - name: Log in to the Container registry - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ${{ env.DOCKER_REGISTRY }} username: ${{ env.DOCKER_USERNAME }} password: ${{ env.DOCKER_PASSWORD }} - name: Set up QEMU - uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 + uses: docker/setup-qemu-action@06116385d9baf250c9f4dcb4858b16962ea869c3 # v4.1.0 - name: Set up Docker Buildx - uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f + uses: docker/setup-buildx-action@d7f5e7f509e45cec5c76c4d5afdd7de93d0b3df5 # v4.1.0 - name: Build and push docker images based on Java 11 env: SW_OAP_BASE_IMAGE: eclipse-temurin:11-jre diff --git a/.github/workflows/skywalking.yaml b/.github/workflows/skywalking.yaml index 05adfffa9e65..29102a7fc6e0 100644 --- a/.github/workflows/skywalking.yaml +++ b/.github/workflows/skywalking.yaml @@ -122,7 +122,7 @@ jobs: persist-credentials: false - name: Filter id: filter - uses: dorny/paths-filter@de90cc6fb38fc0963ad72b210f1f284cd68cea36 + uses: dorny/paths-filter@fbd0ab8f3e69293af611ebaee6363fc25e6d187d # v4.0.1 with: list-files: 'shell' predicate-quantifier: 'every' @@ -765,7 +765,7 @@ jobs: name: dist path: dist - name: Login to ghcr - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ghcr.io username: ${{ github.repository_owner }} @@ -854,7 +854,7 @@ jobs: find docker-images -name "*.tar" -exec docker load -i {} \; find docker-images -name "*.tar" -exec rm {} \; - name: Login to ghcr - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ghcr.io username: ${{ github.repository_owner }} @@ -926,7 +926,7 @@ jobs: find docker-images -name "*.tar" -exec docker load -i {} \; find docker-images -name "*.tar" -exec rm {} \; - name: Login to ghcr - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ghcr.io username: ${{ github.repository_owner }} @@ -1042,7 +1042,7 @@ jobs: name: dist path: dist - name: Login to ghcr - uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 + uses: docker/login-action@650006c6eb7dba73a995cc03b0b2d7f5ca915bee # v4.2.0 with: registry: ghcr.io username: ${{ github.repository_owner }} From 0032d96495e1241c9478f11f0df784a798027894 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Sun, 14 Jun 2026 23:29:22 +0800 Subject: [PATCH 03/19] Fix MALExpressionExecutionTest isolation after CounterWindow key change MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The v2 MAL CounterWindow collision fix re-keyed rate()/increase() windows on each counter's own sample name instead of the rule-level context.metricName. MALExpressionExecutionTest relied on context.metricName (set to a unique sourceFile/metricName) to keep each rule's prime/real pair isolated in the process-wide CounterWindow.INSTANCE singleton — the new keying ignores that field, so leftover samples from one rule leaked into the next across the ~1350 sequential dynamic tests, producing wrong/negative deltas (e.g. 8.333 = 50/6, a lower bound pulled from an earlier rule). Reset CounterWindow.INSTANCE per rule (the pattern BanyanDBErrorRateReproTest already uses via @BeforeEach) and drop the now-dead setMetricName scaffolding (context.metricName has no readers after the keying change). No production code or expected values changed; 1350/1350 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../mal/MALExpressionExecutionTest.java | 26 +++++++------------ 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java index 1d0f22a1ec18..296be78fb5b3 100644 --- a/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java +++ b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/java/org/apache/skywalking/oap/server/dsl/tester/mal/MALExpressionExecutionTest.java @@ -198,33 +198,27 @@ private void executeWithInput( final Map inputSection, final Map expectedSection) { final String metricName = rule.getName(); - // Unique per file+rule to isolate CounterWindow entries across files - final String cwMetricName = rule.getSourceFile().getName() + "/" + metricName; final String expression = rule.getFullExpression(); final boolean hasIncrease = expression.contains(".increase(") || expression.contains(".rate("); - // v2 prime + v2 real (also consecutive, same delta) + // rate()/increase() resolve their lower bound from the process-wide + // CounterWindow.INSTANCE, keyed by each counter's own (name, labels) — not + // the rule-level metric name. Input counter names recur across rules and + // files, so without a reset one rule's prime/real pair would rate against + // another rule's leftover window samples. Clear it so each rule is isolated + // to its own prime (t0) + real (t0+2s) pair. + org.apache.skywalking.oap.meter.analyzer.v2.dsl.counter.CounterWindow.INSTANCE.reset(); + + // v2 prime + v2 real (consecutive scrapes 2 s apart, same delta) final Map v2Data; if (hasIncrease) { try { - final Map primeData = - buildV2MockDataFromInput(inputSection, 0.5); - for (final org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily s : primeData.values()) { - if (s != org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily.EMPTY) { - s.context.setMetricName(cwMetricName); - } - } - v2MalExpr.run(primeData); + v2MalExpr.run(buildV2MockDataFromInput(inputSection, 0.5)); } catch (Exception ignored) { } } v2Data = buildV2MockDataFromInput(inputSection, 1.0); - for (final org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily s : v2Data.values()) { - if (s != org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily.EMPTY) { - s.context.setMetricName(cwMetricName); - } - } // V2 run org.apache.skywalking.oap.meter.analyzer.v2.dsl.SampleFamily v2Sf; From 18bcf0fb601cc7d57b6c6ccb69e2f284f208fe78 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 10:04:45 +0800 Subject: [PATCH 04/19] Couple BanyanDB local schema-cache to worker lifecycle; refresh tip #16 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The runtime-rule schema-cache self-heal only filled a MISSING MetadataRegistry entry, never refreshed a STALE one. A cluster peer applies schema changes with withoutSchemaChange (inspectBackend=false), whose contract says the installer "must populate local caches from the declared model" — but whenCreating gated ahead of isExists and skipped that populate. Combined with an insert-only registry that never evicts, a reshape (remove+add) left the peer translating writes with the old shape; a drop left a stale entry behind. C-1: ModelInstaller.whenCreating now calls a new RPC-free populateLocalCacheOnly hook on the inspectBackend=false branch (honoring the flag contract). BanyanDBIndexInstaller overrides it -> registerLocallyByKind, a blind overwrite, so a reshape's re-add refreshes the entry. No-op for ES/JDBC. C-2: ModelInstaller.whenRemoving now calls a new evictLocalCache hook on both the peer (skip-drop) and main (post-dropTable) paths. BanyanDBIndexInstaller overrides it -> MetadataRegistry.evict(model), keyed exactly as findMetadata, so a dropped/reshaped model leaves no stale translation. The read-side self-heal stays as a defensive backstop. Also refresh CLAUDE.md tip #16: etcd was removed; schema now lives in BanyanDB's _schema property store, mod_revision is a client-stamped UnixNano timestamp, and data-node propagation is async (WatchSchemas + 30s reconcile), so the fence is still required. API names unchanged; SchemaWatcher lives in OAP's in-tree client. Tests: ModelInstallerNoInitTest +3 (populate-on-peer-create, evict-on-peer-remove, drop-then-evict-on-main); MetadataRegistryTest +1 (evict across all key branches). Co-Authored-By: Claude Opus 4.8 (1M context) --- CLAUDE.md | 2 +- docs/en/changes/changes.md | 2 +- .../core/storage/model/ModelInstaller.java | 43 +++++++++++- .../model/ModelInstallerNoInitTest.java | 70 +++++++++++++++++++ .../banyandb/BanyanDBIndexInstaller.java | 18 +++++ .../plugin/banyandb/MetadataRegistry.java | 18 +++++ .../plugin/banyandb/MetadataRegistryTest.java | 29 ++++++++ 7 files changed, 179 insertions(+), 3 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index 7503cdf517e8..99a004e7964e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -328,7 +328,7 @@ Actions owned by `actions/*` (GitHub), `github/*`, and `apache/*` are always all 13. **`moduleManager.find(X.NAME)` requires `X.NAME` in `requiredModules()`**: every call to `moduleManager.find(SomeModule.NAME)` (direct or through a helper) must have `SomeModule.NAME` in the provider's `requiredModules()` array. Missing declarations cause runtime exceptions the first time the code path fires — not at module boot. Wrapping the call in `try { ... } catch (Throwable)` is NOT a substitute; declare the module and keep the try/catch only for defensive handling of transient provider outages. When auditing a branch, grep for `moduleManager.find(` across the touched module and verify each target name appears in `requiredModules()`. Example modules that frequently catch teams out: `AlarmModule` (used by alarm-kernel reset), `LogAnalyzerModule` (used by LAL factory lookup). 14. **Don't look up `ClusterModule` services directly**: the `ClusterModule` (ZooKeeper / K8s / Nacos coordination) exposes `ClusterRegister` / `ClusterNodesQuery` / `ClusterCoordinator`. Most receiver / analyzer modules don't declare `ClusterModule` in `requiredModules()`, so calling `moduleManager.find(ClusterModule.NAME)` will throw at runtime. Instead, go through `CoreModule`'s `RemoteClientManager` service — it's already populated by the cluster module and exposes the peer list every OAP needs. If a module genuinely needs cluster-coordinator primitives, declare `ClusterModule.NAME` in `requiredModules()` explicitly. 15. **No `ThreadLocal` side-channels to hijack downstream behaviour**: routing a caller's intent through a `ThreadLocal` that downstream code reads (e.g., `if (PeerMode.isActive()) skipSomething()`) is almost always the wrong answer — it creates invisible coupling between far-apart code paths, leaks across async hand-offs (executors, gRPC threads, Armeria event loops), and makes the behaviour impossible to understand from a method signature. The correct fix is almost always to **extend the interface** — add a parameter, a new method, a new mode enum that appears in the signature. Rare exceptions: propagating OpenTelemetry context where the whole industry has standardised on `ThreadLocal`, or security principals enforced by a framework. In all other cases, prefer an explicit API extension, even if it costs more lines. -16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: every BanyanDB Create / Update / Delete returns an etcd `mod_revision` (0 on a delete that didn't record a tombstone). After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks until every data node has caught up, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom existed before the `SchemaBarrierService` proto landed and has been replaced — do not reintroduce it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence. +16. **BanyanDB schema-visibility: fence on `mod_revision`, do NOT poll metadata**: schema no longer lives in etcd — etcd was removed. It is now stored as property documents in BanyanDB's own `_schema` store and propagated to each data node's in-memory cache **asynchronously** (a `SchemaUpdateService.WatchSchemas` gRPC stream + a 30s reconcile sync, see `banyand/metadata/schema/property/client.go`). DDL `Create` / `Update` / `Delete` broadcast only to the schema-server (`ROLE_META`) nodes and **return immediately with a `mod_revision`** — now a client-stamped `time.Now().UnixNano()` timestamp, NOT an etcd index (which is why it still changes on every content-changing DDL; it is `0` on a delete that didn't record a tombstone) — **without waiting for data nodes**. A data write whose schema hasn't propagated yet is dropped at the data-node executor (`cannot find measure definition`, logged + skipped, `banyand/measure/write_standalone.go`), so the fence is still required — arguably more than under the old etcd-watch model. After firing DDL, fence on `BanyanDBClient.getSchemaWatcher().awaitRevisionApplied(maxRev, timeout)` before unparking dispatch / firing data writes — this blocks (liaison-side, all alive nodes, bounded by the timeout) until every data node's `notifiedModRevision` watermark reaches `maxRev`, which the registry's read-after-write does not guarantee. For deletes that returned `mod_revision == 0`, fall back to `awaitSchemaDeleted(SchemaKey, timeout)`. The previous "poll `findMeasure` until you can read your own write" idiom predates the `SchemaBarrierService` proto and has been replaced — do not reintroduce it. The `SchemaWatcher` class is in OAP's **in-tree** `library-banyandb-client` (`org.apache.skywalking.library.banyandb.v1.client.SchemaWatcher`); the standalone `skywalking-banyandb-java-client` repo does NOT have these symbols, so don't conclude the API is gone from grepping it. JDBC and ES are synchronous-DDL on the coordinator so they don't need a fence. ## Analysis and Design Principles diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 59aa378a6b5a..ece0f2c4063c 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,7 +242,7 @@ admin-host only" entry above for the public REST retirement. #### OAP Server -* Fix BanyanDB peer nodes permanently flooding ` is not registered` when a node holds a live persist worker but its local `MetadataRegistry` schema cache was never populated for that model — e.g. a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the populate, and nothing (the registry never evicts, the 30s reconcile only covers runtime-rule rows) ever re-derived it. The persist DAOs now self-heal a missing entry once with an RPC-free local re-derivation before failing, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. +* Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. * Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. * SWIP-15: rebuild BanyanDB self-observability around the cluster / container / group model (requires BanyanDB 0.11+). A BanyanDB cluster is modeled as one `Service`, each container as a `ServiceInstance` (role/tier as attributes), and each storage group as an `Endpoint`. The `otel-rules/banyandb/` rules are category-separated by role (`node_*` / `liaison_*` / `data_*` / `lifecycle_*`) and by data type (`measure_*` / `stream_*` / `trace_*` / `property_*`), mirroring the upstream FODC-proxy Grafana boards, and include queue batch/message granularity (apache/skywalking-banyandb#1169). Adds a `SERVICE_INSTANCE_RELATION` MAL scope and `serviceInstanceRelation(...)` builder powering a new intra-cluster pod-to-pod deployment topology (`banyandb-instance-relation.yaml`). The stale single-node `host_name` model is removed. diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java index f73bd824273d..cbb6602b9928 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstaller.java @@ -55,11 +55,20 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage // which made the contract a half-truth. Gate ahead of isExists so a peer apply // is genuinely zero-RPC. if (!flags.isInspectBackend()) { + // Local-cache-only (peer reconciler) tick: zero server RPCs, but the local schema + // cache MUST still be (re)derived from the declared model — the inspectBackend flag + // contract requires exactly this. Without it, a peer holds a live dispatch worker + // whose cache entry is either missing (first apply) or STALE (a reshape re-fires + // whenCreating with a new shape after StorageModels.remove+add). The read-side + // self-heal only fills a MISSING entry, never refreshes a stale one, so the peer + // would keep translating writes with the old shape. RPC-free; no-op for backends + // without a local schema cache (ES, JDBC). + populateLocalCacheOnly(model, opt); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, "local-cache-only mode; main-node is expected to have installed this resource"); log.debug( - "install: model [{}] not installed; local-cache-only mode — skipping (no isExists probe)", + "install: model [{}] not installed; local-cache-only mode — local schema cache refreshed, no isExists probe", model.getName() ); return; @@ -162,12 +171,20 @@ public void whenCreating(Model model, StorageManipulationOpt opt) throws Storage @Override public void whenRemoving(Model model, StorageManipulationOpt opt) throws StorageException { if (!opt.getFlags().isDropOnRemoval()) { + // Peer (or boot path that never drops): the backend drop is the main node's job, + // but this node must still evict its own local schema-cache entry so a removed + // model leaves no stale translation behind in an otherwise insert-only cache. + // RPC-free; no-op for backends without a local cache. + evictLocalCache(model); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.SKIPPED_NOT_ALLOWED, "dropOnRemoval flag is off; server drop is main-node responsibility (or boot path that never drops)"); return; } dropTable(model, opt); + // Evict only after a successful drop — a thrown dropTable leaves the model in the + // registry for retry (see StorageModels.remove), so its cache entry must stay too. + evictLocalCache(model); opt.recordOutcome("table", model.getName(), StorageManipulationOpt.Outcome.DROPPED, null); } @@ -266,6 +283,30 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc dropTable(model); } + /** + * Refresh THIS node's local schema cache for {@code model} from the declared model, with + * no server RPC. Called on the local-cache-only path + * ({@link StorageManipulationOpt.Flags#isInspectBackend() inspectBackend == false}, i.e. + * {@link StorageManipulationOpt#withoutSchemaChange()}), where the cluster main owns + * backend DDL and this node only needs an up-to-date entry to translate its own + * reads/writes. Backends with a local schema cache (BanyanDB) override to (re)derive and + * overwrite the entry — overwrite, not fill-if-absent, so a reshape that + * re-fires {@link #whenCreating} replaces a now-stale entry instead of leaving the old + * shape in place. Default no-op: backends without a local cache (ES, JDBC) have nothing to + * refresh. + */ + protected void populateLocalCacheOnly(Model model, StorageManipulationOpt opt) throws StorageException { + } + + /** + * Drop THIS node's local schema-cache entry for a removed {@code model}, with no server + * RPC. Called from {@link #whenRemoving} on every node so a removed model never leaves a + * stale translation in an otherwise insert-only cache. Default no-op: backends without a + * local schema cache (ES, JDBC) have nothing to evict; BanyanDB overrides. + */ + protected void evictLocalCache(Model model) { + } + @Getter @Setter public abstract static class InstallInfo { diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java index 5ee2bbcaf374..8e91b28f0470 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/ModelInstallerNoInitTest.java @@ -158,6 +158,58 @@ void noInitDeferLoopPropagatesInterruptedSleep() { } } + @Test + void withoutSchemaChangePopulatesLocalCacheAndIssuesNoBackendRpc() throws StorageException { + // Peer reconciler tick (inspectBackend=false): the installer must refresh the local + // schema cache (so a reshape re-add overwrites a now-stale entry) WITHOUT any backend + // existence probe or create. This is the C-1 fix for the worker-without-cache / + // stale-cache desync. + final RecordingInstaller installer = new RecordingInstaller(false /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_peer"); + + installer.whenCreating(model, StorageManipulationOpt.withoutSchemaChange()); + + assertEquals(1, installer.populateLocalCacheCalls, + "a withoutSchemaChange (peer) apply must refresh the local schema cache"); + assertEquals(0, installer.probeCalls, + "a withoutSchemaChange (peer) apply must issue zero backend existence probes"); + assertEquals(0, installer.createTableCalls, + "a withoutSchemaChange (peer) apply must never create backend resources"); + } + + @Test + void whenRemovingPeerEvictsLocalCacheWithoutDroppingBackend() throws StorageException { + // Peer remove (dropOnRemoval=false): the backend drop is the main's job, but the peer + // must still evict its own cache entry so a removed model leaves no stale translation. + final RecordingInstaller installer = new RecordingInstaller(true /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_remove_peer"); + + installer.whenRemoving(model, StorageManipulationOpt.withoutSchemaChange()); + + assertEquals(1, installer.evictLocalCacheCalls, + "a peer remove must evict the local schema cache entry"); + assertEquals(0, installer.dropTableCalls, + "a peer remove (dropOnRemoval off) must not drop backend resources"); + } + + @Test + void whenRemovingMainDropsBackendThenEvictsLocalCache() throws StorageException { + // Main remove (withSchemaChange, dropOnRemoval=true): drop the backend AND evict the + // local cache so the insert-only registry does not keep a tombstoned model's entry. + final RecordingInstaller installer = new RecordingInstaller(true /* unused */); + final Model model = mock(Model.class); + when(model.getName()).thenReturn("runtime_metric_remove_main"); + + installer.whenRemoving(model, StorageManipulationOpt.withSchemaChange()); + + assertEquals(1, installer.dropTableCalls, + "a main remove must drop the backend resource"); + assertEquals(1, installer.evictLocalCacheCalls, + "a main remove must evict the local schema cache entry after the drop"); + } + /** Minimal concrete {@link ModelInstaller} that records createTable calls and reports a * fixed existence result, so the base whenCreating branching can be exercised without a * real storage backend. Optionally throws a transient {@link StorageException} on the first @@ -168,6 +220,9 @@ private static final class RecordingInstaller extends ModelInstaller { private final boolean retryableProbeFailure; private int probeCalls; private int createTableCalls; + private int populateLocalCacheCalls; + private int evictLocalCacheCalls; + private int dropTableCalls; private RecordingInstaller(final boolean resourcePresent) { this(resourcePresent, 0, false); @@ -204,6 +259,21 @@ protected boolean isRetryableNoInitProbeFailure(final StorageException e) { public void createTable(final Model model) { createTableCalls++; } + + @Override + public void dropTable(final Model model) { + dropTableCalls++; + } + + @Override + protected void populateLocalCacheOnly(final Model model, final StorageManipulationOpt opt) { + populateLocalCacheCalls++; + } + + @Override + protected void evictLocalCache(final Model model) { + evictLocalCacheCalls++; + } } private static final class TestInstallInfo extends ModelInstaller.InstallInfo { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index 34c00bfabffd..6df8eb793634 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -1336,6 +1336,24 @@ private void checkTopNAggregation(Model model, BanyanDBClient client, StorageMan * schema cache the local DAOs read from so this node can translate Model ↔ BanyanDB * proto for sample ingest / queries. */ + @Override + protected void populateLocalCacheOnly(final Model model, final StorageManipulationOpt opt) { + // inspectBackend=false (peer / local-cache-only tick): the main owns the backend + // resource; this node only (re)derives its local MetadataRegistry entry so its DAOs + // can translate this model. RPC-free, and an overwrite via register*Model — keeps a + // peer's cache in lockstep with a reshaped model that re-fires whenCreating, which the + // read-side self-heal (fill-if-absent only) cannot do. + final DownSamplingConfigService downSamplingConfigService = moduleManager.find(CoreModule.NAME) + .provider() + .getService(DownSamplingConfigService.class); + registerLocallyByKind(model, downSamplingConfigService); + } + + @Override + protected void evictLocalCache(final Model model) { + MetadataRegistry.INSTANCE.evict(model); + } + private void registerLocallyByKind(final Model model, final DownSamplingConfigService downSamplingConfigService) { if (model.isTimeSeries()) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistry.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistry.java index c644b7435192..a8591e5ee9a3 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistry.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistry.java @@ -443,6 +443,24 @@ public Schema findMetadata(final Model model) { return findMetricMetadata(model.getName(), model.getDownsampling()); } + /** + * Drop the local {@link Schema} cache entry for {@code model}, keyed exactly as + * {@link #findMetadata(Model)} looks it up. The registry is otherwise insert-only, so this + * is the one path that removes an entry — invoked from {@code ModelInstaller.whenRemoving} + * on a runtime-rule hot-remove / reshape so a stale translation never outlives the model. + */ + public void evict(final Model model) { + final String key; + if (!model.isTimeSeries() || model.isRecord()) { + key = model.getName(); + } else { + key = SchemaMetadata.formatName(model.getName(), model.getDownsampling()); + } + if (registry.remove(key) != null) { + log.debug("evicted local BanyanDB schema cache entry [{}]", key); + } + } + private FieldSpec parseFieldSpec(ModelColumn modelColumn) { String colName = modelColumn.getColumnName().getStorageName(); if (String.class.equals(modelColumn.getType())) { diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java index 479644ca94c6..6e7ef96e35d9 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/test/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/MetadataRegistryTest.java @@ -19,6 +19,7 @@ package org.apache.skywalking.oap.server.storage.plugin.banyandb; import java.util.concurrent.atomic.AtomicInteger; +import org.apache.skywalking.oap.server.core.analysis.DownSampling; import org.apache.skywalking.oap.server.core.storage.model.Model; import org.junit.jupiter.api.AfterEach; import org.junit.jupiter.api.Test; @@ -78,4 +79,32 @@ void repopulateLocallySwallowsPopulatorError() { assertDoesNotThrow(() -> MetadataRegistry.INSTANCE.repopulateLocally(model), "a throwing populator must be swallowed so self-heal never worsens the failure"); } + + @Test + void evictExercisesEveryFindMetadataKeyBranchSafely() { + // evict() must key its removal exactly as findMetadata() looks an entry up — across + // management (name), record (name) and metric (formatName(name, downSampling)) kinds. + // Exercise all three so a wrong-branch or formatName regression surfaces, and confirm + // evicting an absent entry is a safe no-op (the registry is otherwise insert-only). + final Model management = mock(Model.class); + when(management.isTimeSeries()).thenReturn(false); + when(management.getName()).thenReturn("management_model"); + + final Model record = mock(Model.class); + when(record.isTimeSeries()).thenReturn(true); + when(record.isRecord()).thenReturn(true); + when(record.getName()).thenReturn("record_model"); + + final Model metric = mock(Model.class); + when(metric.isTimeSeries()).thenReturn(true); + when(metric.isRecord()).thenReturn(false); + when(metric.getName()).thenReturn("metric_model"); + when(metric.getDownsampling()).thenReturn(DownSampling.Minute); + + assertDoesNotThrow(() -> { + MetadataRegistry.INSTANCE.evict(management); + MetadataRegistry.INSTANCE.evict(record); + MetadataRegistry.INSTANCE.evict(metric); + }, "evict must be a safe no-op for an absent entry across all model kinds"); + } } From 24e7d744693ce4967d06a376909a3602a93c6b88 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 13:35:51 +0800 Subject: [PATCH 05/19] Batch the BanyanDB schema fence per runtime-rule apply (Phase 1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A runtime-rule file changes dozens of rules at once, but the post-DDL fence (SchemaWatcher.awaitRevisionApplied) ran once per metric/downsampling — a large file did K×M sequential <=2s fences, overrunning the apply's REST budget on a laggy cluster. Add StorageManipulationOpt.withSchemaChangeDeferredFence(): same flags as withSchemaChange() plus a deferFence toggle + a DeferredFence callback holder. Under deferFence, BanyanDBIndexInstaller.fenceOnRevision records each resource's mod_revision and registers a single flush instead of fencing inline; the apply (MalFileApplier, after MetricConvert) runs StorageManipulationOpt.runDeferredFence() ONCE on the cumulative max revision — collapsing the whole file to one barrier. The main apply paths (DSLManager picker + applyNowForRuleFile, DSLRuntimeDelete revert) switch to the deferred-fence opt. Drops keep fencing inline (doFenceOnRevision) — a deletion's visibility is per-key and must not ride a batched revision flush. Peer / withoutSchemaChange applies are unaffected (no revision recorded -> runDeferredFence is a no-op). Tests: StorageManipulationOptTest (5) covers the deferred-fence mechanics (same-flags, run-once, no-op-when-empty, exception propagation, latest-wins). Verified: full -Pall build + javadoc, checkstyle, license; runtime-rule suite (MalFileApplierTest etc.) + MetadataRegistryTest + ModelInstallerNoInitTest green. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 1 + .../runtimerule/apply/MalFileApplier.java | 12 +++ .../runtimerule/reconcile/DSLManager.java | 6 +- .../reconcile/DSLRuntimeDelete.java | 2 +- .../storage/model/StorageManipulationOpt.java | 60 +++++++++++ .../model/StorageManipulationOptTest.java | 101 ++++++++++++++++++ .../banyandb/BanyanDBIndexInstaller.java | 24 ++++- 7 files changed, 201 insertions(+), 5 deletions(-) create mode 100644 oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index a35cc21d952e..4a6689523e99 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,6 +242,7 @@ admin-host only" entry above for the public REST retirement. #### OAP Server +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the cumulative max revision, collapsing the whole file to one barrier. Drops still fence inline (a deletion's visibility is per-key and must not ride a batched revision flush); peer / `withoutSchemaChange` applies are unaffected (no fence). * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. * Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java index 25d1c349a7e8..d835f61170ec 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java @@ -39,6 +39,7 @@ import org.apache.skywalking.oap.server.core.classloader.Catalog; import org.apache.skywalking.oap.server.core.classloader.DSLClassLoaderManager; import org.apache.skywalking.oap.server.core.classloader.RuleClassLoader; +import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.library.module.ModuleManager; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.AppliedClaims; @@ -168,6 +169,17 @@ public Applied apply(final String yamlContent, final String sourceName, layerRegistry.rollback(appliedClaims); throw new ApplyException("MAL compile failed for " + sourceName, t, Collections.emptySet()); } + // All DDL for this file's metrics is now fired. If the opt deferred its schema fence + // (batched apply via withSchemaChangeDeferredFence), run the single barrier here so the + // whole file waits ONCE instead of one fence per metric/downsampling. A fence timeout is + // a non-fatal WARN inside the closure; only a barrier transport error throws, which + // aborts this apply exactly as an inline per-resource fence would have. + try { + storageOpt.runDeferredFence(); + } catch (final StorageException e) { + layerRegistry.rollback(appliedClaims); + throw new ApplyException("schema fence failed for " + sourceName, e, metricNames); + } return new Applied(rule, convert, metricNames, ruleLoader, appliedClaims); } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java index 102a05ebbe85..910cfb838dad 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -361,7 +361,7 @@ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.Runtim */ public DSLRuntimeState applyNowForRuleFile(final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile, final boolean deferCommit) { - return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.withSchemaChange()); + return applyNowForRuleFile(ruleFile, deferCommit, StorageManipulationOpt.withSchemaChangeDeferredFence()); } /** @@ -759,10 +759,10 @@ private StorageManipulationOpt tickStorageOpt(final boolean atBoot) { selfMain = MainRouter.isSelfMain(apm); } catch (final Throwable t) { // Cluster module not wired (embedded / single-process) — always main. - return StorageManipulationOpt.withSchemaChange(); + return StorageManipulationOpt.withSchemaChangeDeferredFence(); } if (selfMain) { - return StorageManipulationOpt.withSchemaChange(); + return StorageManipulationOpt.withSchemaChangeDeferredFence(); } return atBoot ? StorageManipulationOpt.verifySchemaOnly() diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java index ce6af280a0ce..0579e9771576 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLRuntimeDelete.java @@ -230,7 +230,7 @@ private Result runRevert(final RuleEngine engine, new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, bundledContent, /* status */ null, /* updateTime */ 0L); final ApplyInputs withSchema = new ApplyInputs( - moduleManager, StorageManipulationOpt.withSchemaChange(), + moduleManager, StorageManipulationOpt.withSchemaChangeDeferredFence(), alarmResetter, rules); final DSLRuntimeApply.Outcome outcome = dslRuntimeApply.apply( bundledFile, Classification.STRUCTURAL, diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index 9b6d8cb04a3e..43ab5b570c37 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -24,6 +24,7 @@ import java.util.concurrent.atomic.AtomicLong; import lombok.Builder; import lombok.Getter; +import org.apache.skywalking.oap.server.core.storage.StorageException; /** * Per-call policy + outcome for a storage model manipulation — threaded through the @@ -307,6 +308,21 @@ public static StorageManipulationOpt withoutSchemaChange() { return new StorageManipulationOpt(Mode.WITHOUT_SCHEMA_CHANGE); } + /** + * {@link Mode#WITH_SCHEMA_CHANGE} but with the post-install schema fence DEFERRED. The + * installer records each resource's {@code mod_revision} without fencing, then registers a + * single flush via {@link #setDeferredFence(DeferredFence)}; the caller runs that flush ONCE + * with {@link #runDeferredFence()} after the whole apply (e.g. a multi-rule file) so the + * bundle waits on one barrier instead of one fence per metric/downsampling. All flags are + * identical to {@link #withSchemaChange()} — only the create/update fence is batched; drops + * still fence inline. + */ + public static StorageManipulationOpt withSchemaChangeDeferredFence() { + final StorageManipulationOpt opt = new StorageManipulationOpt(Mode.WITH_SCHEMA_CHANGE); + opt.deferFence = true; + return opt; + } + /** * True for {@link Mode#WITH_SCHEMA_CHANGE}. The on-demand operator workflow — drops, * updates, and reshapes are permitted because the caller explicitly asked for them. @@ -380,6 +396,50 @@ public long getMaxModRevision() { return maxModRevision.get(); } + /** + * A storage-backend schema fence whose execution is deferred to the end of a batched + * apply. The backend installer (e.g. BanyanDB) registers one on a + * {@link #withSchemaChangeDeferredFence()} opt instead of fencing per resource; the apply + * orchestration runs it once via {@link #runDeferredFence()}. Implemented as a closure in + * the storage plugin so core stays backend-agnostic (same pattern as the local-cache + * populator). A timeout inside the fence is a non-fatal WARN; only a barrier transport + * error surfaces as {@link StorageException}. + */ + @FunctionalInterface + public interface DeferredFence { + void await() throws StorageException; + } + + /** + * True only for {@link #withSchemaChangeDeferredFence()}. The installer reads this to skip + * the per-resource create/update fence and register a single {@link DeferredFence} instead. + */ + @Getter + private boolean deferFence = false; + + private volatile DeferredFence deferredFence; + + /** + * Register the single fence to run after the batched apply completes. Idempotent — the + * installer may call it once per resource; the latest (equivalent) closure wins. No-op + * carrier for backends without a revision concept (they never call it). + */ + public void setDeferredFence(final DeferredFence fence) { + this.deferredFence = fence; + } + + /** + * Run the registered {@link DeferredFence} once, if any. Called by the apply orchestration + * after all DDL for the batch is fired. No-op when nothing was registered (peer/no-change + * applies, or non-BanyanDB backends). + */ + public void runDeferredFence() throws StorageException { + final DeferredFence fence = this.deferredFence; + if (fence != null) { + fence.await(); + } + } + /** * Append a per-resource outcome. Called by the installer as it examines each * underlying storage resource. diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java new file mode 100644 index 000000000000..76a900811b4d --- /dev/null +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.core.storage.model; + +import java.util.concurrent.atomic.AtomicInteger; +import org.apache.skywalking.oap.server.core.storage.StorageException; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertThrows; +import static org.junit.jupiter.api.Assertions.assertTrue; + +/** + * Unit coverage for the batched-fence machinery on {@link StorageManipulationOpt}: a + * {@link StorageManipulationOpt#withSchemaChangeDeferredFence()} opt carries the same flags as + * {@link StorageManipulationOpt#withSchemaChange()} but lets the installer register a single + * {@link StorageManipulationOpt.DeferredFence} that the apply orchestration runs ONCE instead of + * one fence per metric/downsampling. + */ +class StorageManipulationOptTest { + + @Test + void deferredFenceOptHasSameFlagsAsWithSchemaChange() { + final StorageManipulationOpt deferred = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final StorageManipulationOpt plain = StorageManipulationOpt.withSchemaChange(); + // Behaviour must be identical except for the batching toggle — same mode/flags so every + // create/update/drop privilege the installer checks is unchanged. + assertEquals(plain.getMode(), deferred.getMode(), + "deferred-fence opt must keep the WITH_SCHEMA_CHANGE mode"); + assertTrue(deferred.isWithSchemaChange(), + "deferred-fence opt must still report withSchemaChange semantics"); + assertTrue(deferred.isDeferFence(), + "deferred-fence opt must flag deferFence"); + assertFalse(plain.isDeferFence(), + "the plain withSchemaChange opt must NOT defer the fence"); + assertFalse(StorageManipulationOpt.withoutSchemaChange().isDeferFence()); + } + + @Test + void runDeferredFenceInvokesRegisteredFenceOnce() throws StorageException { + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger calls = new AtomicInteger(); + opt.setDeferredFence(calls::incrementAndGet); + + opt.runDeferredFence(); + + assertEquals(1, calls.get(), "the registered fence must run exactly once on flush"); + } + + @Test + void runDeferredFenceIsNoOpWhenNothingRegistered() { + // Peer / no-change applies (and non-BanyanDB backends) never register a fence; flushing + // must be a safe no-op, not an NPE. + assertDoesNotThrow(() -> StorageManipulationOpt.withSchemaChangeDeferredFence().runDeferredFence()); + assertDoesNotThrow(() -> StorageManipulationOpt.withoutSchemaChange().runDeferredFence()); + } + + @Test + void runDeferredFencePropagatesStorageException() { + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport error"); + }); + // A barrier transport error must surface so the apply aborts exactly like an inline fence. + assertThrows(StorageException.class, opt::runDeferredFence); + } + + @Test + void laterSetDeferredFenceWins() throws StorageException { + // The installer may register the closure once per resource; the latest (equivalent) one + // wins and still runs a single time. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger first = new AtomicInteger(); + final AtomicInteger second = new AtomicInteger(); + opt.setDeferredFence(first::incrementAndGet); + opt.setDeferredFence(second::incrementAndGet); + + opt.runDeferredFence(); + + assertEquals(0, first.get(), "an overwritten fence must not run"); + assertEquals(1, second.get(), "the latest registered fence runs once"); + } +} diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index 6df8eb793634..9657574ce1cf 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -288,6 +288,26 @@ public InstallInfo isExists(Model model, StorageManipulationOpt opt) throws Stor */ private void fenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, final String context) throws BanyanDBException { + if (opt.isDeferFence()) { + // Batched apply: do NOT fence per resource. Register a single flush that the apply + // orchestration runs once after all DDL is fired (StorageManipulationOpt#runDeferredFence), + // so a multi-rule file waits on ONE barrier on the cumulative max revision instead of + // one fence per metric/downsampling. The closure reads opt.getMaxModRevision() at flush + // time, after every resource has recorded its revision. + opt.setDeferredFence(() -> { + try { + doFenceOnRevision(client, opt, "batched apply"); + } catch (final BanyanDBException e) { + throw new StorageException("batched schema fence failed", e); + } + }); + return; + } + doFenceOnRevision(client, opt, context); + } + + private void doFenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, + final String context) throws BanyanDBException { final long rev = opt.getMaxModRevision(); if (rev <= 0L) { return; @@ -512,7 +532,9 @@ private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final Storag final String context) throws BanyanDBException { final long rev = opt.getMaxModRevision(); if (rev > 0L) { - fenceOnRevision(client, opt, context); + // Drops fence inline (never deferred): a deletion's visibility is per-key and must + // not ride a batched revision flush — drops stay correct even under a deferFence opt. + doFenceOnRevision(client, opt, context); return; } // mod_revision was 0 on every delete — fall back to key-based deletion fence. From 6a340f68dddbe10a50c1915fa29dc778dbf9b811 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 13:47:20 +0800 Subject: [PATCH 06/19] Add SchemaApplyCoordinator apply-status state machine (Phase 2) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In-memory owner of runtime-rule apply progress on the cluster main: each apply opens a status entry keyed by a generated applyId and advances through ApplyPhase (PENDING→VALIDATING→DDL→FENCING→ROLLING_OUT→APPLIED) with DEGRADED (committed but fence-unconfirmed) and FAILED (pre-commit error) off-ramps. Two indexes back the two query shapes: by applyId (live handle) and by (catalog,name)→latest applyId (content-based path, resolved against the durable content hash for when the apply-id is gone after a refresh / main restart). Immutable ApplyStatus snapshots in a ConcurrentHashMap — single-writer per apply (apply orchestration serializes per file), lock-free concurrent reads. Clock is injectable for deterministic tests and the timed watch added later. Self-contained building block; the apply-lifecycle wiring, DSLRuntimeState failureReason + per-node breakdown, the GetApplyStatus query surface, and the background convergence watch with TTL eviction land in Phase 3. State is in-memory by design — the durable content hash reconstructs truth after restart. Tests: SchemaApplyCoordinatorTest (8) — begin/index, phase transitions + updatedAt, terminal markApplied/markDegraded/markFailed + reason, forward-transition clears stale reason, unknown-id no-ops, content-hash-gated content lookup, latest-wins. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../runtimerule/status/ApplyPhase.java | 59 ++++++++ .../runtimerule/status/ApplyStatus.java | 72 +++++++++ .../status/SchemaApplyCoordinator.java | 143 ++++++++++++++++++ .../status/SchemaApplyCoordinatorTest.java | 141 +++++++++++++++++ 4 files changed, 415 insertions(+) create mode 100644 oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java create mode 100644 oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java create mode 100644 oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java create mode 100644 oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java new file mode 100644 index 000000000000..577186192268 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +/** + * Lifecycle phase of one runtime-rule apply, as tracked by {@link SchemaApplyCoordinator} on the + * cluster main. A progress query (added in a later phase) reports this so an operator can see how + * far an apply got — and, when it stops short, whether it is converging ({@link #DEGRADED}) or + * failed outright ({@link #FAILED}). + * + *

Normal progression: + * {@link #PENDING} → {@link #VALIDATING} → {@link #DDL} → {@link #FENCING} → {@link #ROLLING_OUT} + * → {@link #APPLIED}. Two off-ramps: + *

    + *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). The change + * was rolled back; nothing was committed.
  • + *
  • {@link #DEGRADED} — committed and durable, but the post-DDL fence did not confirm + * cluster-wide propagation within budget (a node is lagging). Forward-progress: BanyanDB + * keeps converging; this is NOT a revert.
  • + *
+ * {@link #UNKNOWN} is returned for an apply-id the main no longer holds (evicted / main restarted); + * callers fall back to a content-hash comparison. + */ +public enum ApplyPhase { + PENDING, + VALIDATING, + DDL, + FENCING, + ROLLING_OUT, + APPLIED, + DEGRADED, + FAILED, + UNKNOWN; + + /** + * A terminal phase no longer advances on its own: the apply succeeded ({@link #APPLIED}), + * failed and rolled back ({@link #FAILED}), or committed-but-unconfirmed ({@link #DEGRADED}, + * which a background re-check may still flip to {@link #APPLIED}). + */ + public boolean isTerminal() { + return this == APPLIED || this == DEGRADED || this == FAILED; + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java new file mode 100644 index 000000000000..15bdad441366 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import lombok.Getter; + +/** + * Immutable snapshot of one runtime-rule apply's progress on the cluster main. Held by + * {@link SchemaApplyCoordinator}; transitions produce a new snapshot (copy-on-write) so concurrent + * readers (a progress query) never see a torn value. + * + *

{@code contentHash} is the durable identity of the applied file — a status query can be + * answered by it even after the ephemeral {@code applyId} is gone. {@code failureReason} is set + * only for {@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}. + * + *

This snapshot carries the apply-level (main-orchestrated) phase. Per-node breakdown + * (storage-plane laggards, per-OAP-node applied state) is layered on in a later phase via the + * status query DTO; this type stays a small, immutable core. + */ +@Getter +public final class ApplyStatus { + private final String applyId; + private final String catalog; + private final String name; + private final String contentHash; + private final ApplyPhase phase; + /** Non-null only for {@link ApplyPhase#FAILED} (pre-commit error) and {@link ApplyPhase#DEGRADED} + * (committed but a node lagging at fence timeout). Null otherwise. */ + private final String failureReason; + private final long startedAtMs; + private final long updatedAtMs; + + public ApplyStatus(final String applyId, final String catalog, final String name, + final String contentHash, final ApplyPhase phase, final String failureReason, + final long startedAtMs, final long updatedAtMs) { + this.applyId = applyId; + this.catalog = catalog; + this.name = name; + this.contentHash = contentHash; + this.phase = phase; + this.failureReason = failureReason; + this.startedAtMs = startedAtMs; + this.updatedAtMs = updatedAtMs; + } + + /** A copy advanced to {@code newPhase}, clearing any prior failure reason (forward progress). */ + public ApplyStatus withPhase(final ApplyPhase newPhase, final long nowMs) { + return new ApplyStatus(applyId, catalog, name, contentHash, newPhase, null, startedAtMs, nowMs); + } + + /** A copy moved to a non-success terminal ({@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}) + * carrying {@code reason}. */ + public ApplyStatus withFailure(final ApplyPhase terminalPhase, final String reason, final long nowMs) { + return new ApplyStatus(applyId, catalog, name, contentHash, terminalPhase, reason, startedAtMs, nowMs); + } +} diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java new file mode 100644 index 000000000000..887927df03d4 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import java.util.Map; +import java.util.UUID; +import java.util.concurrent.ConcurrentHashMap; +import java.util.function.LongSupplier; +import java.util.function.UnaryOperator; +import lombok.extern.slf4j.Slf4j; + +/** + * Single in-memory owner of runtime-rule apply progress on the cluster main. Each apply opens a + * status entry keyed by a generated {@code applyId} and advances it through {@link ApplyPhase} + * as the apply runs (validate → DDL → fence → roll out → applied), or to a terminal + * {@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}. + * + *

Two indexes back the two ways a caller asks "is this live yet?": + *

    + *
  • by {@code applyId} — the live handle a just-submitted apply polls;
  • + *
  • by {@code (catalog, name)} → latest applyId — the content-based path used when the + * apply-id is gone (page refresh / main restart), resolved against the durable content hash.
  • + *
+ * + *

Concurrency: a {@link ConcurrentHashMap} of immutable {@link ApplyStatus} snapshots. Writes + * for a given apply are effectively single-threaded (the apply orchestration serializes per file), + * reads (progress queries) are concurrent and lock-free against the snapshot. This phase is the + * building block; the apply-lifecycle wiring, per-node breakdown, the {@code GetApplyStatus} query + * surface, and a background convergence watch with TTL eviction layer on later. State is in-memory + * by design — the durable content hash reconstructs truth after a main restart. + */ +@Slf4j +public class SchemaApplyCoordinator { + + private final Map byApplyId = new ConcurrentHashMap<>(); + private final Map latestApplyIdByFile = new ConcurrentHashMap<>(); + private final LongSupplier clock; + + public SchemaApplyCoordinator() { + this(System::currentTimeMillis); + } + + /** Clock-injectable for deterministic tests (and the timed background watch added later). */ + public SchemaApplyCoordinator(final LongSupplier clock) { + this.clock = clock; + } + + /** + * Open a new apply in {@link ApplyPhase#PENDING} and return its {@code applyId}. Records it as + * the latest apply for {@code (catalog, name)} so a content-based query resolves to it. + */ + public String begin(final String catalog, final String name, final String contentHash) { + final String applyId = UUID.randomUUID().toString(); + final long now = clock.getAsLong(); + byApplyId.put(applyId, new ApplyStatus( + applyId, catalog, name, contentHash, ApplyPhase.PENDING, null, now, now)); + latestApplyIdByFile.put(fileKey(catalog, name), applyId); + if (log.isDebugEnabled()) { + log.debug("apply [{}] begin: {}/{} hash={}", applyId, catalog, name, contentHash); + } + return applyId; + } + + /** Advance an apply to {@code phase} (forward progress; clears any prior failure reason). No-op + * for an unknown apply-id. */ + public void transition(final String applyId, final ApplyPhase phase) { + update(applyId, s -> s.withPhase(phase, clock.getAsLong())); + } + + /** Terminal success. */ + public void markApplied(final String applyId) { + transition(applyId, ApplyPhase.APPLIED); + } + + /** Terminal: committed and durable, but cluster-wide propagation unconfirmed within budget + * (a node is lagging). Not a revert — a background re-check may flip it to APPLIED later. */ + public void markDegraded(final String applyId, final String reason) { + update(applyId, s -> s.withFailure(ApplyPhase.DEGRADED, reason, clock.getAsLong())); + } + + /** Terminal: a pre-commit error (compile / verify / DDL RPC / persist); the change was rolled + * back. */ + public void markFailed(final String applyId, final String reason) { + update(applyId, s -> s.withFailure(ApplyPhase.FAILED, reason, clock.getAsLong())); + } + + private void update(final String applyId, final UnaryOperator op) { + byApplyId.computeIfPresent(applyId, (k, s) -> op.apply(s)); + } + + /** The live status for an apply-id, or {@code null} when the main no longer holds it (caller + * treats null as {@link ApplyPhase#UNKNOWN} and falls back to the content-based path). */ + public ApplyStatus get(final String applyId) { + return byApplyId.get(applyId); + } + + /** + * The latest apply status for a file, for the content-based query when the apply-id is unknown. + * Returns {@code null} if no apply for that file is tracked, or if {@code expectedContentHash} + * is non-null and does not match the latest apply's hash (the latest tracked apply is for a + * different content than the caller asked about). + */ + public ApplyStatus getLatestByFile(final String catalog, final String name, + final String expectedContentHash) { + final String applyId = latestApplyIdByFile.get(fileKey(catalog, name)); + if (applyId == null) { + return null; + } + final ApplyStatus status = byApplyId.get(applyId); + if (status == null) { + return null; + } + if (expectedContentHash != null && !expectedContentHash.equals(status.getContentHash())) { + return null; + } + return status; + } + + /** Number of tracked applies — for tests and the future TTL-eviction watch. */ + public int trackedCount() { + return byApplyId.size(); + } + + private static String fileKey(final String catalog, final String name) { + return catalog + "/" + name; + } +} diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java new file mode 100644 index 000000000000..85e2765ab50c --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.status; + +import java.util.concurrent.atomic.AtomicLong; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertNull; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.junit.jupiter.api.Assertions.assertDoesNotThrow; + +class SchemaApplyCoordinatorTest { + + private final AtomicLong clock = new AtomicLong(1_000L); + + private SchemaApplyCoordinator newCoordinator() { + return new SchemaApplyCoordinator(clock::get); + } + + @Test + void beginOpensPendingStatusAndIndexesByApplyIdAndFile() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "hash-1"); + + final ApplyStatus byId = coord.get(applyId); + assertNotNull(byId, "a just-begun apply must be retrievable by apply-id"); + assertEquals(ApplyPhase.PENDING, byId.getPhase()); + assertEquals("otel-rules", byId.getCatalog()); + assertEquals("vm", byId.getName()); + assertEquals("hash-1", byId.getContentHash()); + assertEquals(1_000L, byId.getStartedAtMs()); + assertNull(byId.getFailureReason(), "a fresh apply has no failure reason"); + + final ApplyStatus byFile = coord.getLatestByFile("otel-rules", "vm", null); + assertEquals(applyId, byFile.getApplyId(), "content-based lookup must resolve to the latest apply"); + } + + @Test + void transitionsAdvancePhaseAndStampUpdatedAt() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + clock.set(2_000L); + coord.transition(applyId, ApplyPhase.DDL); + clock.set(3_000L); + coord.transition(applyId, ApplyPhase.FENCING); + + final ApplyStatus s = coord.get(applyId); + assertEquals(ApplyPhase.FENCING, s.getPhase()); + assertEquals(1_000L, s.getStartedAtMs(), "startedAt is fixed at begin"); + assertEquals(3_000L, s.getUpdatedAtMs(), "updatedAt advances with each transition"); + } + + @Test + void markAppliedIsTerminalSuccess() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + coord.markApplied(applyId); + final ApplyStatus s = coord.get(applyId); + assertEquals(ApplyPhase.APPLIED, s.getPhase()); + assertTrue(s.getPhase().isTerminal()); + assertNull(s.getFailureReason()); + } + + @Test + void markFailedAndDegradedCarryReasonAndAreTerminal() { + final SchemaApplyCoordinator coord = newCoordinator(); + final String failed = coord.begin("otel-rules", "a", "h"); + coord.markFailed(failed, "previous schema not found"); + assertEquals(ApplyPhase.FAILED, coord.get(failed).getPhase()); + assertEquals("previous schema not found", coord.get(failed).getFailureReason()); + assertTrue(coord.get(failed).getPhase().isTerminal()); + + final String degraded = coord.begin("otel-rules", "b", "h"); + coord.markDegraded(degraded, "data node lagging: node-3"); + assertEquals(ApplyPhase.DEGRADED, coord.get(degraded).getPhase()); + assertEquals("data node lagging: node-3", coord.get(degraded).getFailureReason()); + assertTrue(coord.get(degraded).getPhase().isTerminal()); + } + + @Test + void forwardTransitionClearsAStaleFailureReason() { + // A DEGRADED apply whose background re-check later confirms convergence flips to APPLIED, + // and the stale "lagging" reason must not linger. + final SchemaApplyCoordinator coord = newCoordinator(); + final String applyId = coord.begin("otel-rules", "vm", "h"); + coord.markDegraded(applyId, "node-3 lagging"); + coord.markApplied(applyId); + assertEquals(ApplyPhase.APPLIED, coord.get(applyId).getPhase()); + assertNull(coord.get(applyId).getFailureReason(), + "advancing past DEGRADED must clear the failure reason"); + } + + @Test + void unknownApplyIdReturnsNullAndTransitionsAreNoOps() { + final SchemaApplyCoordinator coord = newCoordinator(); + assertNull(coord.get("does-not-exist")); + assertDoesNotThrow(() -> coord.transition("does-not-exist", ApplyPhase.DDL)); + assertDoesNotThrow(() -> coord.markFailed("does-not-exist", "x")); + } + + @Test + void getLatestByFileHonorsExpectedContentHash() { + final SchemaApplyCoordinator coord = newCoordinator(); + coord.begin("otel-rules", "vm", "hash-A"); + + // Matching hash resolves; a different hash (caller asks about content the latest apply is + // NOT for) returns null so the caller doesn't misreport an unrelated apply. + assertNotNull(coord.getLatestByFile("otel-rules", "vm", "hash-A")); + assertNull(coord.getLatestByFile("otel-rules", "vm", "hash-B")); + assertNull(coord.getLatestByFile("otel-rules", "absent", null)); + } + + @Test + void latestByFileFollowsTheNewestApply() { + final SchemaApplyCoordinator coord = newCoordinator(); + coord.begin("otel-rules", "vm", "hash-old"); + final String newer = coord.begin("otel-rules", "vm", "hash-new"); + final ApplyStatus latest = coord.getLatestByFile("otel-rules", "vm", null); + assertEquals(newer, latest.getApplyId()); + assertEquals("hash-new", latest.getContentHash()); + assertEquals(2, coord.trackedCount(), "both applies remain tracked until eviction"); + } +} From e58326c83db92f8b8d20f86b525439c7ed95b2a0 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 14:37:51 +0800 Subject: [PATCH 07/19] Wire apply lifecycle into SchemaApplyCoordinator (Phase 3a) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add SchemaApplyCoordinator.INSTANCE (process-wide, mirrors MetadataRegistry.INSTANCE) so the apply / cluster-RPC / reconcile paths share one coordinator without constructor threading; tests keep using new SchemaApplyCoordinator(clock). RuntimeRuleService.applyStructural now begins an apply (keyed by content hash) right before the apply attempt and marks a terminal phase on every exit: APPLIED on success, FAILED (with the specific reason — layer conflict, apply threw, getLastApplyError, persist-failed) on the pre-commit/failure paths, and DEGRADED on commit-deferred (DB persisted but this node's commit tail threw — durable, peers converge, not a revert). A missed branch leaves only a stale PENDING the background watch reaps (Phase 3c) — not a correctness bug. Filter-only applies do no DDL/fence, so they are not tracked here. The query surface (GetApplyStatus RPC + REST progress endpoint), the async apply-id response, and the background convergence watch land in 3b/3c. Verified: RuntimeRuleRestHandlerTest (20) unchanged, SchemaApplyCoordinatorTest (8), MalFileApplierTest (13) green; checkstyle + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../runtimerule/rest/RuntimeRuleService.java | 16 ++++++++++++++++ .../status/SchemaApplyCoordinator.java | 8 ++++++++ 2 files changed, 24 insertions(+) diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 01d645a4e8a3..3638f2bf13c1 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -78,6 +78,7 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; /** @@ -1002,6 +1003,11 @@ private HttpResponse applyStructural(final String catalog, final String name, // removedMetrics, swap appliedMal/appliedContent, retire old loader, alarm reset, // advance snapshot) is stashed in the dslManager — we drain it below once persist // resolves. + // Track this apply in the coordinator so a progress query (and peers, later) can observe + // its outcome. From here every exit path marks a terminal phase (APPLIED / FAILED / + // DEGRADED); a missed branch leaves only a stale PENDING the background watch reaps. + final String applyId = SchemaApplyCoordinator.INSTANCE.begin( + catalog, name, ContentHash.sha256Hex(content)); final long updateTime = System.currentTimeMillis(); final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); @@ -1016,6 +1022,7 @@ private HttpResponse applyStructural(final String catalog, final String name, catalog, name, lce.getMessage()); dslManager.getSuspendCoord().localResume(catalog, name); broadcastResume(catalog, name, "layer_conflict"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "layer conflict: " + lce.getMessage()); return badRequest(lce.applyStatus(), catalog, name, lce.getMessage()); } catch (final Throwable t) { // Some exceptions wrap LayerConflictException as cause; unwrap so the operator @@ -1027,6 +1034,7 @@ private HttpResponse applyStructural(final String catalog, final String name, + "(wrapped): {}", catalog, name, wrapped.getMessage()); dslManager.getSuspendCoord().localResume(catalog, name); broadcastResume(catalog, name, "layer_conflict"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "layer conflict: " + wrapped.getMessage()); return badRequest(wrapped.applyStatus(), catalog, name, wrapped.getMessage()); } log.error("runtime-rule STRUCTURAL apply threw for {}/{}", catalog, name, t); @@ -1034,9 +1042,11 @@ private HttpResponse applyStructural(final String catalog, final String name, // Peers went SUSPENDED on our earlier broadcast; let them know the apply // aborted so they flip back to RUNNING within an RPC round-trip. broadcastResume(catalog, name, "apply_threw"); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "apply threw: " + t.getMessage()); return serverError("apply_failed", catalog, name, t.getMessage()); } if (postApply != null && postApply.getLastApplyError() != null) { + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, postApply.getLastApplyError()); // Apply failed (DDL verify mismatch, compile surprise, applier exception). Row // is NOT yet persisted. applyOneRuleFile already rolled back its own partial // registration on the exception path; the pendingCommits stash is only @@ -1100,6 +1110,7 @@ private HttpResponse applyStructural(final String catalog, final String name, + "for {}/{} — discarded pending commit; local node re-aligned with old " + "content. Operator action: re-push via /addOrUpdate once storage is healthy.", catalog, name); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "persist failed after apply"); return persistError; } @@ -1124,6 +1135,10 @@ private HttpResponse applyStructural(final String catalog, final String name, + "inspect log for the underlying cause.", catalog, name, t); } if (commitFailure != null) { + // Durable (DB persisted) but this node's commit-tail threw — peers converge from DB, + // this node retries on the next tick. Committed-but-unconfirmed = DEGRADED, not FAILED. + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "commit-tail deferred: DB persisted, local backend may be stale until next tick"); return serverError("commit_deferred", catalog, name, "DB row persisted, but local commit-tail threw — backend shape on this " + "node may not have fully landed. Peers converge from DB; this node " @@ -1140,6 +1155,7 @@ private HttpResponse applyStructural(final String catalog, final String name, broadcastResume(catalog, name, "force_no_change"); } + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); return ok(HttpStatus.OK, "structural_applied", catalog, name, "structural apply succeeded" + describeDelta(delta)); } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java index 887927df03d4..20b96ce74c74 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java @@ -48,6 +48,14 @@ @Slf4j public class SchemaApplyCoordinator { + /** + * Process-wide instance the production apply / cluster-RPC / reconcile paths share, mirroring + * the {@code MetadataRegistry.INSTANCE} / {@code DSLClassLoaderManager.INSTANCE} pattern so the + * coordinator need not be threaded through every constructor. Uses the wall-clock; tests + * construct their own instance with an injected clock instead. + */ + public static final SchemaApplyCoordinator INSTANCE = new SchemaApplyCoordinator(); + private final Map byApplyId = new ConcurrentHashMap<>(); private final Map latestApplyIdByFile = new ConcurrentHashMap<>(); private final LongSupplier clock; From 4f684f8323415a0ed8484db54a83313890af6569 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 15:13:00 +0800 Subject: [PATCH 08/19] Add GetApplyStatus RPC + /runtime/rule/status query surface (Phase 3b) Expose the apply-status the main tracks (Phase 3a) to the UI/operator: - proto: GetApplyStatus(ApplyStatusRequest) -> ApplyStatusResponse on RuntimeRuleClusterService, with ApplyStatusPhase mirroring the Java ApplyPhase. - RuntimeRuleClusterServiceImpl.getApplyStatus: main-served; reads SchemaApplyCoordinator.INSTANCE by apply_id, else by (catalog,name) gated on content_hash; maps to the response (found=false / UNKNOWN when nothing matches). - RuntimeRuleClusterClient.getApplyStatus: routes the read to the deterministic main (MainRouter.mainPeer); null on unreachable -> caller degrades. - RuntimeRuleService.queryApplyStatus + GET /runtime/rule/status: self-main (or single-process) reads the local coordinator; otherwise routes to the main. Query by applyId (live handle) or catalog+name(+contentHash) once it's gone (page refresh / main restart). Always 200 JSON; found=false for no match. Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20) unchanged, SchemaApplyCoordinatorTest (8), MainRouterTest green; checkstyle + license clean. Background convergence watch + TTL eviction land in 3c. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 1 + .../cluster/RuntimeRuleClusterClient.java | 31 +++++++ .../RuntimeRuleClusterServiceImpl.java | 64 +++++++++++++++ .../rest/RuntimeRuleRestHandler.java | 14 ++++ .../runtimerule/rest/RuntimeRuleService.java | 82 +++++++++++++++++++ .../src/main/proto/runtime-rule-cluster.proto | 51 ++++++++++++ 6 files changed, 243 insertions(+) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 4a6689523e99..f118c6611c45 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -243,6 +243,7 @@ #### OAP Server * Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the cumulative max revision, collapsing the whole file to one barrier. Drops still fence inline (a deletion's visibility is per-key and must not ride a batched revision flush); peer / `withoutSchemaChange` applies are unaffected (no fence). +* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → validating → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-fence-unconfirmed apply and `failed` carrying the specific reason). A new `GET /runtime/rule/status` REST endpoint (and a `GetApplyStatus` admin-internal gRPC served by the main) returns it — query by the apply's `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. * Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java index 3652f17feac4..cd709345ff7d 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java @@ -29,6 +29,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.RuntimeRuleClusterServiceGrpc; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendRequest; @@ -113,6 +115,35 @@ private SuspendAck suspendOne(final AdminClusterChannelManager.Peer peer, final } } + /** + * Route a read-only apply-status query to the cluster main (the only node that runs applies + * and holds the status). Returns {@code null} if no main is resolvable or the call fails — + * the REST caller then degrades to a content-hash comparison against the durable DAO row. + */ + public ApplyStatusResponse getApplyStatus(final ApplyStatusRequest request) { + final AdminClusterChannelManager.Peer main = MainRouter.mainPeer(peerChannelManager); + if (main == null) { + log.warn("runtime-rule GetApplyStatus skipped: no cluster main resolvable"); + return null; + } + final ManagedChannel channel = main.getChannel(); + if (channel == null) { + log.warn("runtime-rule GetApplyStatus skipped: main {} channel not yet established", + main.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.getApplyStatus(request); + } catch (final Throwable t) { + log.warn("runtime-rule GetApplyStatus to main {} failed: {}", + main.getAddress(), t.getMessage()); + return null; + } + } + /** * Fan out Resume to every non-self peer. Same transport, same sequential-with-deadline * policy as {@link #broadcastSuspend}. Called by the REST handler's failure branches so diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java index 7b01b6f5d8ea..22d2ac83df25 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -25,6 +25,9 @@ import java.util.Objects; import lombok.Setter; import lombok.extern.slf4j.Slf4j; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; @@ -41,6 +44,9 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleService; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyStatus; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; /** * Server-side handler for the three cluster-internal runtime-rule RPCs — see @@ -102,6 +108,64 @@ public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String s this.selfNodeId = selfNodeId; } + /** + * Read-only apply-status query, served by the main (only the main runs applies and holds the + * status). Resolves by apply_id when present, else by (catalog, name) gated on content_hash. + * Returns {@code found=false} / {@code APPLY_PHASE_UNKNOWN} when nothing matches — the caller + * falls back to comparing the durable content hash against the DAO row. + */ + @Override + public void getApplyStatus(final ApplyStatusRequest request, + final StreamObserver responseObserver) { + final ApplyStatus status; + if (!request.getApplyId().isEmpty()) { + status = SchemaApplyCoordinator.INSTANCE.get(request.getApplyId()); + } else { + final String hash = request.getContentHash().isEmpty() ? null : request.getContentHash(); + status = SchemaApplyCoordinator.INSTANCE.getLatestByFile( + request.getCatalog(), request.getName(), hash); + } + final ApplyStatusResponse.Builder resp = ApplyStatusResponse.newBuilder().setNodeId(selfNodeId); + if (status == null) { + resp.setFound(false).setPhase(ApplyStatusPhase.APPLY_PHASE_UNKNOWN); + } else { + resp.setFound(true) + .setApplyId(status.getApplyId()) + .setCatalog(status.getCatalog()) + .setName(status.getName()) + .setContentHash(status.getContentHash() == null ? "" : status.getContentHash()) + .setPhase(toProtoPhase(status.getPhase())) + .setFailureReason(status.getFailureReason() == null ? "" : status.getFailureReason()) + .setStartedAtMs(status.getStartedAtMs()) + .setUpdatedAtMs(status.getUpdatedAtMs()); + } + responseObserver.onNext(resp.build()); + responseObserver.onCompleted(); + } + + private static ApplyStatusPhase toProtoPhase(final ApplyPhase phase) { + switch (phase) { + case PENDING: + return ApplyStatusPhase.APPLY_PHASE_PENDING; + case VALIDATING: + return ApplyStatusPhase.APPLY_PHASE_VALIDATING; + case DDL: + return ApplyStatusPhase.APPLY_PHASE_DDL; + case FENCING: + return ApplyStatusPhase.APPLY_PHASE_FENCING; + case ROLLING_OUT: + return ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT; + case APPLIED: + return ApplyStatusPhase.APPLY_PHASE_APPLIED; + case DEGRADED: + return ApplyStatusPhase.APPLY_PHASE_DEGRADED; + case FAILED: + return ApplyStatusPhase.APPLY_PHASE_FAILED; + default: + return ApplyStatusPhase.APPLY_PHASE_UNKNOWN; + } + } + @Override public void suspend(final SuspendRequest request, final StreamObserver responseObserver) { diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java index 77084f614fcb..bdb4dd0c5b74 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandler.java @@ -113,6 +113,20 @@ public HttpResponse list(@Param("catalog") @Default("") final String catalog) { return service.list(catalog); } + /** + * Apply-status query for the UI / operator. Served by the cluster main (the service routes + * there when self isn't main). Query by {@code applyId} (the live handle from a just-submitted + * apply) or, once it's gone, by {@code catalog} + {@code name} (+ optional {@code contentHash} + * — the durable identity). Always 200 with a JSON status; {@code found=false} when nothing matches. + */ + @Get("/runtime/rule/status") + public HttpResponse applyStatus(@Param("catalog") @Default("") final String catalog, + @Param("name") @Default("") final String name, + @Param("contentHash") @Default("") final String contentHash, + @Param("applyId") @Default("") final String applyId) { + return service.queryApplyStatus(catalog, name, contentHash, applyId); + } + @Get("/runtime/rule") public HttpResponse get(@Param("catalog") final String catalog, @Param("name") final String name, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 3638f2bf13c1..164005b0959a 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -69,6 +69,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.layer.LayerConflictException; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.MainRouter; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendState; @@ -78,6 +80,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.SuspendResult; import org.apache.skywalking.oap.server.receiver.runtimerule.state.AppliedRuleScript; import org.apache.skywalking.oap.server.receiver.runtimerule.state.DSLRuntimeState; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyStatus; import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; import org.apache.skywalking.oap.server.receiver.runtimerule.util.ContentHash; @@ -270,6 +274,84 @@ private HttpResponse forwardToMain(final String mainAddr, } } + /** + * Read-only apply-status query for the UI / operator. Served by the main: when self is main + * (or single-process / no cluster client), reads the local {@link SchemaApplyCoordinator}; + * otherwise routes to the main via {@code GetApplyStatus}. Resolve by {@code applyId} (the + * live handle) or by {@code catalog}/{@code name} (+ optional {@code contentHash}) once the + * apply-id is gone (page refresh / main restart). Always 200 with a JSON status; {@code + * found=false} / phase {@code UNKNOWN} when nothing matches (caller compares the durable + * content hash itself). + */ + public HttpResponse queryApplyStatus(final String catalog, final String name, + final String contentHash, final String applyId) { + final AdminClusterChannelManager apm = resolvePeerChannelManager(); + if (apm == null || MainRouter.isSelfMain(apm) || clusterClient == null) { + final ApplyStatus local = applyId.isEmpty() + ? SchemaApplyCoordinator.INSTANCE.getLatestByFile( + catalog, name, contentHash.isEmpty() ? null : contentHash) + : SchemaApplyCoordinator.INSTANCE.get(applyId); + return applyStatusJson(local); + } + final ApplyStatusResponse remote = clusterClient.getApplyStatus(ApplyStatusRequest.newBuilder() + .setApplyId(applyId).setCatalog(catalog).setName(name).setContentHash(contentHash).build()); + if (remote == null) { + return HttpResponse.of(HttpStatus.BAD_GATEWAY, MediaType.JSON_UTF_8, + jsonBody("status_unavailable", catalog, name, + "could not reach the cluster main for apply status; retry shortly")); + } + return applyStatusJsonFromProto(remote); + } + + private HttpResponse applyStatusJson(final ApplyStatus s) { + final JsonObject o = new JsonObject(); + if (s == null) { + o.addProperty("found", false); + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + } else { + o.addProperty("found", true); + o.addProperty("applyId", s.getApplyId()); + o.addProperty("catalog", s.getCatalog()); + o.addProperty("name", s.getName()); + o.addProperty("contentHash", s.getContentHash()); + o.addProperty("phase", s.getPhase().name()); + if (s.getFailureReason() != null) { + o.addProperty("failureReason", s.getFailureReason()); + } + o.addProperty("startedAtMs", s.getStartedAtMs()); + o.addProperty("updatedAtMs", s.getUpdatedAtMs()); + } + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + + private HttpResponse applyStatusJsonFromProto(final ApplyStatusResponse r) { + final JsonObject o = new JsonObject(); + o.addProperty("found", r.getFound()); + if (r.getFound()) { + o.addProperty("applyId", r.getApplyId()); + o.addProperty("catalog", r.getCatalog()); + o.addProperty("name", r.getName()); + o.addProperty("contentHash", r.getContentHash()); + o.addProperty("phase", stripApplyPhasePrefix(r.getPhase().name())); + if (!r.getFailureReason().isEmpty()) { + o.addProperty("failureReason", r.getFailureReason()); + } + o.addProperty("startedAtMs", r.getStartedAtMs()); + o.addProperty("updatedAtMs", r.getUpdatedAtMs()); + } else { + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + } + o.addProperty("servedBy", r.getNodeId()); + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + + /** Strip the proto {@code APPLY_PHASE_} prefix so the routed-from-main JSON phase matches the + * local-path {@link ApplyPhase} names (e.g. {@code APPLY_PHASE_APPLIED} → {@code APPLIED}). */ + private static String stripApplyPhasePrefix(final String protoName) { + final String prefix = "APPLY_PHASE_"; + return protoName.startsWith(prefix) ? protoName.substring(prefix.length()) : protoName; + } + private AdminClusterChannelManager resolvePeerChannelManager() { AdminClusterChannelManager local = peerChannelManager; if (local != null) { diff --git a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto index 898204985bad..7f7e23935674 100644 --- a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto +++ b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto @@ -51,6 +51,11 @@ service RuntimeRuleClusterService { // body in the ack. Eliminates the "client must resubmit" round-trip from the old // 421 Misdirected Request behaviour. rpc Forward(ForwardRequest) returns (ForwardResponse); + // Query the apply-status the main tracks for a runtime-rule file. Read-only; served by the + // main (non-main nodes route here, since only the main runs applies and holds the status). + // Lets the UI/operator poll an apply's progress, and — after the ephemeral apply_id is gone + // (page refresh / main restart) — ask by (catalog, name[, content_hash]), the durable identity. + rpc GetApplyStatus(ApplyStatusRequest) returns (ApplyStatusResponse); } message SuspendRequest { @@ -166,3 +171,49 @@ message ForwardResponse { // operator can correlate which node refused. string node_id = 3; } + +// Identifies the apply whose status the caller wants. Either apply_id (the live handle a +// just-submitted apply was given) OR catalog+name (+ optional content_hash) — the content-based +// path used once the apply_id is gone. content_hash, when set, requires the latest tracked apply +// for that file to match (else found=false), so a stale query doesn't misreport a newer apply. +message ApplyStatusRequest { + string apply_id = 1; + string catalog = 2; + string name = 3; + string content_hash = 4; +} + +// Lifecycle phase of an apply on the main. Mirrors the Java ApplyPhase; prefixed to avoid +// collision with the other enums in this package. +enum ApplyStatusPhase { + APPLY_PHASE_UNSPECIFIED = 0; + APPLY_PHASE_PENDING = 1; + APPLY_PHASE_VALIDATING = 2; + APPLY_PHASE_DDL = 3; + APPLY_PHASE_FENCING = 4; + APPLY_PHASE_ROLLING_OUT = 5; + APPLY_PHASE_APPLIED = 6; + APPLY_PHASE_DEGRADED = 7; + APPLY_PHASE_FAILED = 8; + // The main no longer holds an apply matching the request (evicted / main restarted, or the + // content_hash did not match the latest tracked apply). Caller falls back to a content-hash + // comparison against the durable DAO row. + APPLY_PHASE_UNKNOWN = 9; +} + +message ApplyStatusResponse { + // False when no matching apply is tracked (phase is then APPLY_PHASE_UNKNOWN and the rest + // are empty/zero). True otherwise, with the fields below populated. + bool found = 1; + string apply_id = 2; + string catalog = 3; + string name = 4; + string content_hash = 5; + ApplyStatusPhase phase = 6; + // Non-empty only for APPLY_PHASE_FAILED / APPLY_PHASE_DEGRADED. + string failure_reason = 7; + int64 started_at_ms = 8; + int64 updated_at_ms = 9; + // Instance id of the node that produced this status (the main). + string node_id = 10; +} From bd10ea922d4bd671aba74a42d3b988e7114df6cc Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 15:18:07 +0800 Subject: [PATCH 09/19] Add apply-status TTL eviction sweep (Phase 3c) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SchemaApplyCoordinator.evictExpired(ttlMs) reaps tracked applies whose last update is older than the retention window — terminal ones linger long enough for a post-apply UI poll / post-refresh content query, then are dropped to bound memory; a stale PENDING left by a missed apply branch is reaped too (a later query then returns UNKNOWN and the caller falls back to the durable content hash). The (catalog,name)->latest index entry is cleared only when it still points at an evicted apply, so a newer apply for the same file keeps its mapping. RuntimeRuleModuleProvider schedules the sweep on the existing reconciler executor (every 5 min, 1 h TTL). The live DEGRADED->APPLIED re-fence is intentionally NOT done here: it needs BanyanDB-client access runtime-rule does not hold, and BanyanDB's own 30s reconcile converges the actual schema regardless — an operator re-query reflects it. Tests: SchemaApplyCoordinatorTest +2 (evict reaps old + clears its file index; newer apply keeps the index when the older is evicted). checkstyle + license clean. Completes Phase 3 (3a wiring + 3b query surface + 3c eviction). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../module/RuntimeRuleModuleProvider.java | 16 ++++++++ .../status/SchemaApplyCoordinator.java | 33 ++++++++++++++++- .../status/SchemaApplyCoordinatorTest.java | 37 +++++++++++++++++++ 3 files changed, 85 insertions(+), 1 deletion(-) diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java index ec7197d5de63..b5e307b5b6e1 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java @@ -41,6 +41,7 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterClient; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.RuntimeRuleClusterServiceImpl; import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; import org.apache.skywalking.oap.server.receiver.runtimerule.rest.RuntimeRuleRestHandler; import org.apache.skywalking.oap.server.telemetry.TelemetryModule; import org.apache.skywalking.oap.server.telemetry.api.TelemetryRelatedContext; @@ -218,6 +219,11 @@ public class RuntimeRuleModuleProvider extends ModuleProvider { * unchanged bundles. */ private static final long SCHEDULER_INITIAL_DELAY_SECONDS = 2L; + /** Retain a tracked apply-status this long after its last update so a post-apply UI poll (and + * a post-refresh content query) still resolves, then reap it to bound memory. */ + private static final long APPLY_STATUS_TTL_MS = 3_600_000L; + /** How often the apply-status eviction sweep runs on the reconciler executor. */ + private static final long APPLY_STATUS_EVICT_INTERVAL_SECONDS = 300L; /** * Env var carrying this OAP's unique per-node identity — the Kubernetes pod UID, injected @@ -406,6 +412,16 @@ public void notifyAfterCompleted() throws ModuleStartException { ); log.info("Runtime rule dslManager scheduled: first tick in {} s, then every {} s.", SCHEDULER_INITIAL_DELAY_SECONDS, intervalSeconds); + + // Bound the apply-status coordinator's memory: reap tracked applies past the retention + // window (terminal ones linger long enough for a post-apply UI poll; a stale PENDING from + // a missed branch is reaped too — a later query then returns UNKNOWN and the caller falls + // back to the durable content hash). Reuses the same single-thread executor; the sweep is + // O(tracked) and cheap. + reconcilerExecutor.scheduleWithFixedDelay( + () -> SchemaApplyCoordinator.INSTANCE.evictExpired(APPLY_STATUS_TTL_MS), + APPLY_STATUS_EVICT_INTERVAL_SECONDS, APPLY_STATUS_EVICT_INTERVAL_SECONDS, TimeUnit.SECONDS + ); } /** diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java index 20b96ce74c74..2ecf09d497ec 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java @@ -18,7 +18,9 @@ package org.apache.skywalking.oap.server.receiver.runtimerule.status; +import java.util.HashSet; import java.util.Map; +import java.util.Set; import java.util.UUID; import java.util.concurrent.ConcurrentHashMap; import java.util.function.LongSupplier; @@ -140,11 +142,40 @@ public ApplyStatus getLatestByFile(final String catalog, final String name, return status; } - /** Number of tracked applies — for tests and the future TTL-eviction watch. */ + /** Number of tracked applies — for tests and the TTL-eviction watch. */ public int trackedCount() { return byApplyId.size(); } + /** + * Evict tracked applies whose last update is older than {@code ttlMs}, bounding memory. Both + * terminal entries (kept around so a post-refresh query within the window still resolves) and + * stale non-terminal ones (a missed apply branch left in PENDING) are reaped once past the + * TTL — a later query then returns {@code null}/UNKNOWN and the caller falls back to comparing + * the durable content hash against the DAO row. The {@code (catalog, name) → latest} index + * entry is cleared only when it still points at an evicted apply, so a newer apply for the + * same file keeps its mapping. Returns the number evicted. + */ + public int evictExpired(final long ttlMs) { + final long cutoff = clock.getAsLong() - ttlMs; + final Set evicted = new HashSet<>(); + byApplyId.entrySet().removeIf(e -> { + if (e.getValue().getUpdatedAtMs() < cutoff) { + evicted.add(e.getKey()); + return true; + } + return false; + }); + if (!evicted.isEmpty()) { + latestApplyIdByFile.values().removeIf(evicted::contains); + if (log.isDebugEnabled()) { + log.debug("apply-status: evicted {} entr{} older than {} ms", + evicted.size(), evicted.size() == 1 ? "y" : "ies", ttlMs); + } + } + return evicted.size(); + } + private static String fileKey(final String catalog, final String name) { return catalog + "/" + name; } diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java index 85e2765ab50c..14dfa98ad641 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java @@ -128,6 +128,43 @@ void getLatestByFileHonorsExpectedContentHash() { assertNull(coord.getLatestByFile("otel-rules", "absent", null)); } + @Test + void evictExpiredReapsOldEntriesAndClearsTheirFileIndex() { + final SchemaApplyCoordinator coord = newCoordinator(); + clock.set(1_000L); + final String old = coord.begin("otel-rules", "old", "h"); + coord.markApplied(old); + clock.set(10_000L); + final String fresh = coord.begin("otel-rules", "fresh", "h2"); + + // cutoff = now(10_000) - ttl(5_000) = 5_000; 'old' (updatedAt 1_000) is evicted, 'fresh' kept. + final int evicted = coord.evictExpired(5_000L); + + assertEquals(1, evicted); + assertNull(coord.get(old), "expired apply must be evicted"); + assertNull(coord.getLatestByFile("otel-rules", "old", null), + "the file index entry pointing at an evicted apply must be cleared"); + assertNotNull(coord.get(fresh), "a fresh apply must survive eviction"); + assertEquals(1, coord.trackedCount()); + } + + @Test + void evictExpiredKeepsFileIndexWhenANewerApplyReplacedTheEvictedOne() { + final SchemaApplyCoordinator coord = newCoordinator(); + clock.set(1_000L); + final String first = coord.begin("otel-rules", "vm", "h1"); + coord.markApplied(first); + clock.set(10_000L); + final String second = coord.begin("otel-rules", "vm", "h2"); + + coord.evictExpired(5_000L); + + assertNull(coord.get(first), "the older apply for the file is evicted"); + final ApplyStatus latest = coord.getLatestByFile("otel-rules", "vm", null); + assertNotNull(latest, "the file index must still resolve via the newer apply"); + assertEquals(second, latest.getApplyId()); + } + @Test void latestByFileFollowsTheNewestApply() { final SchemaApplyCoordinator coord = newCoordinator(); From 7cb9b60ea6a12dad7398ae6d69d2325be9770569 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 15:33:09 +0800 Subject: [PATCH 10/19] Push runtime-rule convergence to peers on commit (Phase 4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit After a successful structural commit the main broadcasts a new NotifyApplied admin-internal RPC so peers converge NOW instead of waiting up to one ~30s refresh tick to notice the new DB row. - proto: NotifyApplied(NotifyAppliedRequest{catalog,name,content_hash,sender,ts}) -> NotifyAppliedAck. - RuntimeRuleClusterServiceImpl.notifyApplied: self-broadcast suppressed; else submits a full reconcile (dslManager.tick) to a single daemon executor off the gRPC thread and acks immediately. The reconcile is per-file-locked + idempotent (unchanged files short-circuit on hash); a lost/failed notify is non-fatal — the peer self-converges on its own tick. - RuntimeRuleClusterClient.broadcastNotifyApplied: best-effort fan-out to non-self peers, same sequential-with-deadline transport as Suspend/Resume. - RuntimeRuleService: on the drained success path, broadcastNotifyApplied with the committed content hash (the !drained force-no-change path keeps its Resume). Design note: the apply correlation rides this POST-commit notify, not Suspend — Suspend is broadcast before the apply runs, when the apply-id/revision don't yet exist. Tightens the convergence window without a hard peer->main dependency. Deferred follow-up: per-node failure breakdown (DSLRuntimeState.failureReason aggregated into GetApplyStatus) — the status is main-orchestrated today. Proto regenerates cleanly; RuntimeRuleRestHandlerTest (20), MalFileApplierTest (13), SchemaApplyCoordinatorTest (10), MainRouterTest green; checkstyle + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 1 + .../cluster/RuntimeRuleClusterClient.java | 51 ++++++++++++++++++ .../RuntimeRuleClusterServiceImpl.java | 53 +++++++++++++++++++ .../runtimerule/rest/RuntimeRuleService.java | 22 ++++++++ .../src/main/proto/runtime-rule-cluster.proto | 25 +++++++++ 5 files changed, 152 insertions(+) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index f118c6611c45..4d64101d8f65 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -244,6 +244,7 @@ #### OAP Server * Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the cumulative max revision, collapsing the whole file to one barrier. Drops still fence inline (a deletion's visibility is per-key and must not ride a batched revision flush); peer / `withoutSchemaChange` applies are unaffected (no fence). * Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → validating → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-fence-unconfirmed apply and `failed` carrying the specific reason). A new `GET /runtime/rule/status` REST endpoint (and a `GetApplyStatus` admin-internal gRPC served by the main) returns it — query by the apply's `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. +* Push runtime-rule convergence to peers on commit. After a successful structural apply the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile off the gRPC thread; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. * Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java index cd709345ff7d..f2bc2438926b 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterClient.java @@ -31,6 +31,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.RuntimeRuleClusterServiceGrpc; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.SuspendRequest; @@ -144,6 +146,55 @@ public ApplyStatusResponse getApplyStatus(final ApplyStatusRequest request) { } } + /** + * Fan out NotifyApplied to every non-self peer after a successful commit so peers converge + * NOW (run a reconcile against the just-persisted DB row) rather than on their next ~30s tick. + * Best-effort, same sequential-with-deadline transport as the others; unreachable peers + * self-converge on their own tick. + */ + public List broadcastNotifyApplied(final String catalog, final String name, + final String contentHash) { + final List peers = peerChannelManager.getPeers(); + final List acks = new ArrayList<>(peers.size()); + for (final AdminClusterChannelManager.Peer peer : peers) { + if (peer.isSelf()) { + continue; + } + final NotifyAppliedAck ack = notifyAppliedOne(peer, catalog, name, contentHash); + if (ack != null) { + acks.add(ack); + } + } + return acks; + } + + private NotifyAppliedAck notifyAppliedOne(final AdminClusterChannelManager.Peer peer, + final String catalog, final String name, + final String contentHash) { + final ManagedChannel channel = peer.getChannel(); + if (channel == null) { + log.warn("runtime-rule NotifyApplied skipped for peer {}: channel not yet established", + peer.getAddress()); + return null; + } + final RuntimeRuleClusterServiceGrpc.RuntimeRuleClusterServiceBlockingStub stub = + RuntimeRuleClusterServiceGrpc.newBlockingStub(channel) + .withDeadlineAfter(perCallDeadlineMs, TimeUnit.MILLISECONDS); + try { + return stub.notifyApplied(NotifyAppliedRequest.newBuilder() + .setCatalog(catalog) + .setName(name) + .setContentHash(contentHash == null ? "" : contentHash) + .setSenderNodeId(selfNodeId) + .setIssuedAtMs(System.currentTimeMillis()) + .build()); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied to peer {} failed for {}/{}: {}", + peer.getAddress(), catalog, name, t.getMessage()); + return null; + } + } + /** * Fan out Resume to every non-self peer. Same transport, same sequential-with-deadline * policy as {@link #broadcastSuspend}. Called by the REST handler's failure branches so diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java index 22d2ac83df25..706f6b3ed550 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -23,6 +23,8 @@ import io.grpc.stub.StreamObserver; import java.nio.charset.StandardCharsets; import java.util.Objects; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; import lombok.Setter; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; @@ -30,6 +32,8 @@ import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ForwardResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeAck; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeRequest; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ResumeState; @@ -103,11 +107,60 @@ public class RuntimeRuleClusterServiceImpl @Setter private volatile RuntimeRuleService runtimeRuleService; + /** Off-RPC-thread runner for notify-triggered reconciles so {@link #notifyApplied} acks + * immediately. Single daemon thread — reconciles are per-file-locked + idempotent, so + * serializing them is fine; daemon so it never blocks JVM shutdown. */ + private final ExecutorService reconcileNudgeExecutor = Executors.newSingleThreadExecutor(r -> { + final Thread t = new Thread(r, "runtime-rule-notify-reconcile"); + t.setDaemon(true); + return t; + }); + public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String selfNodeId) { this.dslManager = dslManager; this.selfNodeId = selfNodeId; } + /** + * Push-notify from the main after a successful commit: converge NOW rather than on the next + * ~30s tick. Runs a full reconcile off the gRPC thread (idempotent, per-file-locked — unchanged + * files short-circuit on hash). Best-effort: the peer self-converges on its own tick if this is + * lost, so a self-broadcast or a schedule failure is non-fatal. + */ + @Override + public void notifyApplied(final NotifyAppliedRequest request, + final StreamObserver responseObserver) { + if (Objects.equals(selfNodeId, request.getSenderNodeId())) { + responseObserver.onNext(NotifyAppliedAck.newBuilder() + .setNodeId(selfNodeId).setAccepted(false) + .setDetail("self-broadcast suppressed").build()); + responseObserver.onCompleted(); + return; + } + boolean accepted = false; + try { + reconcileNudgeExecutor.submit(() -> { + try { + dslManager.tick(); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied reconcile for {}/{} failed; peer will " + + "self-converge on its next tick: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + }); + accepted = true; + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied could not schedule reconcile for {}/{}: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + responseObserver.onNext(NotifyAppliedAck.newBuilder() + .setNodeId(selfNodeId) + .setAccepted(accepted) + .setDetail(accepted ? "reconcile scheduled" : "schedule failed; self-converge on next tick") + .build()); + responseObserver.onCompleted(); + } + /** * Read-only apply-status query, served by the main (only the main runs applies and holds the * status). Resolves by apply_id when present, else by (catalog, name) gated on content_hash. diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 164005b0959a..846d1f44ffa4 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -1235,6 +1235,11 @@ private HttpResponse applyStructural(final String catalog, final String name, // recover within an RPC round-trip. if (!drained) { broadcastResume(catalog, name, "force_no_change"); + } else { + // Real change committed and durable — push peers to converge NOW (reconcile against + // the just-persisted DB row) instead of waiting up to one ~30s refresh tick. + // Best-effort; peers self-converge on their own tick if the notify is lost. + broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); } SchemaApplyCoordinator.INSTANCE.markApplied(applyId); @@ -1404,6 +1409,23 @@ private List broadcastSuspend(final String catalog, final String nam } } + /** + * Fire NotifyApplied to every non-self peer after a successful commit so peers converge NOW + * instead of waiting up to one ~30s refresh tick. Best-effort and fire-and-forget: a failure + * is non-fatal because peers self-converge on their own tick regardless. + */ + private void broadcastNotifyApplied(final String catalog, final String name, final String contentHash) { + if (clusterClient == null) { + return; + } + try { + clusterClient.broadcastNotifyApplied(catalog, name, contentHash); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied broadcast failed for {}/{}; peers will converge " + + "on their next tick", catalog, name, t); + } + } + /** * Inspect Suspend acks for the split-brain guard: if any peer responded with REJECTED * (origin conflict — it believes it is the main and is mid-apply), surface that to the diff --git a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto index 7f7e23935674..d6d15c5c0e32 100644 --- a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto +++ b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto @@ -56,6 +56,11 @@ service RuntimeRuleClusterService { // Lets the UI/operator poll an apply's progress, and — after the ephemeral apply_id is gone // (page refresh / main restart) — ask by (catalog, name[, content_hash]), the durable identity. rpc GetApplyStatus(ApplyStatusRequest) returns (ApplyStatusResponse); + // Fired by the main right after it commits a structural apply, so peers converge NOW + // (run a reconcile against the just-persisted DB row) instead of waiting up to one refresh + // tick (~30s). Best-effort: the ack only confirms the peer accepted the nudge; the peer + // still self-converges on its next tick if the notify is lost. + rpc NotifyApplied(NotifyAppliedRequest) returns (NotifyAppliedAck); } message SuspendRequest { @@ -217,3 +222,23 @@ message ApplyStatusResponse { // Instance id of the node that produced this status (the main). string node_id = 10; } + +// Sent by the main after a successful structural commit so peers reconcile immediately. +message NotifyAppliedRequest { + string catalog = 1; + string name = 2; + // The committed content hash — informational/correlation; the peer reconciles from the DB row. + string content_hash = 3; + // Main's instance id; receivers compare against their own to suppress self-broadcast loops. + string sender_node_id = 4; + // Epoch millis on the main; diagnostic only. + int64 issued_at_ms = 5; +} + +message NotifyAppliedAck { + string node_id = 1; + // True when the peer scheduled an immediate reconcile; false on self-broadcast suppression + // or when the receiver isn't ready (it still self-converges on its next tick). + bool accepted = 2; + string detail = 3; +} From 6c6325fc690e10673aa1fc940ee3921222a6fd09 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 16:23:02 +0800 Subject: [PATCH 11/19] Fix runtime-rule fence correctness: one-shot deferred flush + per-delete drop fence MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two HIGH issues found reviewing the batched-fence work on this branch: - NC2: StorageManipulationOpt.runDeferredFence() was not one-shot. A reconciler tick reuses ONE opt across every rule file (RuleSync#runOnce), so a later file that performed no DDL re-ran the previous file's stale fence, and the cumulative revision made each file over-fence on prior files' DDL. runDeferredFence now clears the closure and resets the accumulated revision after it runs (in finally, so a transport failure still isolates the next file); each file fences on its own DDL only. The closure reads getMaxModRevision() during await, so reset happens after. - NC1: BanyanDBIndexInstaller drop fence decided rev>0 vs AwaitSchemaDeleted on opt.getMaxModRevision() (cumulative across the shared opt). An earlier create/ binding revision on the same opt made a tombstone-less primary delete take the revision branch and silently skip the deletion barrier. dropTable now captures the primary resource's OWN delete revision and threads it to fenceOnRevisionOrDeletion, which decides on that value (0 for trace/property, whose delete RPCs have no revision variant — those always key-fence). Added doFenceOnRevisionValue(client, rev, ctx) as the value-based core. Tests: 3 new StorageManipulationOptTest cases (one-shot across files; revision seen during await then reset; reset even when the fence throws). Changelog: the batched-fence bullet now describes the one-shot flush and per-delete drop fence. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 2 +- .../storage/model/StorageManipulationOpt.java | 16 ++++++- .../model/StorageManipulationOptTest.java | 46 ++++++++++++++++++ .../banyandb/BanyanDBIndexInstaller.java | 47 ++++++++++++++----- 4 files changed, 96 insertions(+), 15 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 4d64101d8f65..fc94048aaa2e 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,7 +242,7 @@ admin-host only" entry above for the public REST retirement. #### OAP Server -* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the cumulative max revision, collapsing the whole file to one barrier. Drops still fence inline (a deletion's visibility is per-key and must not ride a batched revision flush); peer / `withoutSchemaChange` applies are unaffected (no fence). +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. Peer / `withoutSchemaChange` applies are unaffected (no fence). * Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → validating → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-fence-unconfirmed apply and `failed` carrying the specific reason). A new `GET /runtime/rule/status` REST endpoint (and a `GetApplyStatus` admin-internal gRPC served by the main) returns it — query by the apply's `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. * Push runtime-rule convergence to peers on commit. After a successful structural apply the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile off the gRPC thread; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index 43ab5b570c37..5ef07ff7ba35 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -432,11 +432,25 @@ public void setDeferredFence(final DeferredFence fence) { * Run the registered {@link DeferredFence} once, if any. Called by the apply orchestration * after all DDL for the batch is fired. No-op when nothing was registered (peer/no-change * applies, or non-BanyanDB backends). + * + *

One-shot. A single reconciler tick reuses ONE opt across every rule + * file ({@code RuleSync#runOnce}), calling this once per file. After the fence runs, the + * closure is cleared and {@link #maxModRevision} is reset so the next file neither re-runs + * this file's stale fence (when that file performed no DDL) nor waits on this file's + * revision — each file fences on its own DDL only. The reset happens in a {@code finally} + * so a fence transport failure still isolates the next file. The closure reads + * {@link #getMaxModRevision()} during {@code await()}, so it is reset only after. */ public void runDeferredFence() throws StorageException { final DeferredFence fence = this.deferredFence; - if (fence != null) { + if (fence == null) { + return; + } + this.deferredFence = null; + try { fence.await(); + } finally { + maxModRevision.set(0L); } } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java index 76a900811b4d..669dc1c095ab 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java @@ -83,6 +83,52 @@ void runDeferredFencePropagatesStorageException() { assertThrows(StorageException.class, opt::runDeferredFence); } + @Test + void runDeferredFenceIsOneShotAcrossFiles() throws StorageException { + // A reconciler tick reuses ONE opt across every rule file. After a file flushes its + // fence, a later file that performed no DDL (registers no new closure) must NOT re-run + // the earlier file's stale fence. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + final AtomicInteger calls = new AtomicInteger(); + opt.setDeferredFence(calls::incrementAndGet); + + opt.runDeferredFence(); + opt.runDeferredFence(); + + assertEquals(1, calls.get(), "the fence must run once and be cleared, not re-run for the next file"); + } + + @Test + void runDeferredFenceResetsRevisionAfterAwait() throws StorageException { + // The closure must observe this file's accumulated revision DURING await, then the opt + // resets so the next file fences on its own DDL only — not the cumulative max. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(42L); + final AtomicInteger seenDuringAwait = new AtomicInteger(); + opt.setDeferredFence(() -> seenDuringAwait.set((int) opt.getMaxModRevision())); + + opt.runDeferredFence(); + + assertEquals(42L, seenDuringAwait.get(), "the fence must see the recorded revision during await"); + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "the revision must reset after the fence so a later file is not over-fenced"); + } + + @Test + void runDeferredFenceResetsRevisionEvenWhenFenceThrows() { + // A barrier transport failure on one file must not leave a stale revision that the next + // file would inherit; the reset runs in finally. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(99L); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport error"); + }); + + assertThrows(StorageException.class, opt::runDeferredFence); + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "the revision must reset even when the fence throws"); + } + @Test void laterSetDeferredFenceWins() throws StorageException { // The installer may register the closure once per resource; the latest (equivalent) one diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index 9657574ce1cf..ea9e2b169073 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -308,7 +308,11 @@ private void fenceOnRevision(final BanyanDBClient client, final StorageManipulat private void doFenceOnRevision(final BanyanDBClient client, final StorageManipulationOpt opt, final String context) throws BanyanDBException { - final long rev = opt.getMaxModRevision(); + doFenceOnRevisionValue(client, opt.getMaxModRevision(), context); + } + + private void doFenceOnRevisionValue(final BanyanDBClient client, final long rev, + final String context) throws BanyanDBException { if (rev <= 0L) { return; } @@ -478,6 +482,13 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc final String group = metadata.getGroup(); final String name = metadata.name(); log.info("drop BanyanDB schema kind={} {}:{}", metadata.getKind(), group, name); + // Tombstone revision of THIS drop's primary resource only — used to decide the + // deletion fence. It must be the primary's own revision, NOT opt.getMaxModRevision(): + // a single opt is reused across many files in a tick, so the cumulative max can carry + // an unrelated earlier create/binding revision and make a tombstone-less delete + // (primary revision 0) skip the AwaitSchemaDeleted fallback. 0 for trace/property, + // whose delete RPCs have no revision-returning variant — those always key-fence. + long primaryDeleteRev = StorageManipulationOpt.DEFAULT_MOD_REVISION; switch (metadata.getKind()) { case MEASURE: // Drop the TopN aggregations first (if any), then index rule bindings, index rules, then the measure. @@ -489,11 +500,13 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc } } dropIndexRuleBindingsBestEffort(client, group, name, opt); - opt.recordModRevision(client.deleteMeasureWithRevision(group, name)); + primaryDeleteRev = client.deleteMeasureWithRevision(group, name); + opt.recordModRevision(primaryDeleteRev); break; case STREAM: dropIndexRuleBindingsBestEffort(client, group, name, opt); - opt.recordModRevision(client.deleteStreamWithRevision(group, name)); + primaryDeleteRev = client.deleteStreamWithRevision(group, name); + opt.recordModRevision(primaryDeleteRev); break; case TRACE: dropIndexRuleBindingsBestEffort(client, group, name, opt); @@ -507,9 +520,9 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc "dropTable unsupported kind=" + metadata.getKind() + " for model " + model.getName()); } // Fence: prefer the revision-based wait when the server recorded a tombstone - // mod_revision; otherwise fall back to AwaitSchemaDeleted keyed on the - // primary resource so callers get a hard "removed everywhere" signal. - fenceOnRevisionOrDeletion(client, opt, metadata, "dropTable:" + model.getName()); + // mod_revision for THIS resource; otherwise fall back to AwaitSchemaDeleted keyed on + // the primary resource so callers get a hard "removed everywhere" signal. + fenceOnRevisionOrDeletion(client, metadata, primaryDeleteRev, "dropTable:" + model.getName()); } catch (BanyanDBException ex) { if (Status.Code.NOT_FOUND.equals(ex.getStatus())) { log.info("BanyanDB schema {} already absent on drop (idempotent)", model.getName()); @@ -520,24 +533,32 @@ public void dropTable(Model model, StorageManipulationOpt opt) throws StorageExc } /** - * Prefer {@code AwaitRevisionApplied(maxRev)} when the registry returned a - * non-zero tombstone revision; otherwise fall back to + * Prefer {@code AwaitRevisionApplied(primaryDeleteRev)} when the registry returned a + * non-zero tombstone revision for the primary resource; otherwise fall back to * {@code AwaitSchemaDeleted(key)} keyed on the primary resource. The fallback * exists because {@code mod_revision == 0} on a delete response means the server * did not record a tombstone — the revision-based fence cannot observe a * deletion that didn't get one. + * + *

The decision keys on {@code primaryDeleteRev} — the primary resource's own delete + * revision — NOT {@code opt.getMaxModRevision()}. A single opt is shared across every file + * in a reconciler tick, so its cumulative max can hold an unrelated earlier create/binding + * revision; using it here would make a tombstone-less primary delete take the revision + * branch and silently skip {@code AwaitSchemaDeleted}. Because the primary delete is issued + * last (after TopN + bindings), its revision is the highest of this drop and fencing on it + * also covers the earlier lower-revision deletes of the same drop. */ - private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final StorageManipulationOpt opt, + private void fenceOnRevisionOrDeletion(final BanyanDBClient client, final MetadataRegistry.SchemaMetadata metadata, + final long primaryDeleteRev, final String context) throws BanyanDBException { - final long rev = opt.getMaxModRevision(); - if (rev > 0L) { + if (primaryDeleteRev > 0L) { // Drops fence inline (never deferred): a deletion's visibility is per-key and must // not ride a batched revision flush — drops stay correct even under a deferFence opt. - doFenceOnRevision(client, opt, context); + doFenceOnRevisionValue(client, primaryDeleteRev, context); return; } - // mod_revision was 0 on every delete — fall back to key-based deletion fence. + // mod_revision was 0 on the primary delete — fall back to key-based deletion fence. final String kind; switch (metadata.getKind()) { case MEASURE: From c2ea08b66854eb87f0f273435fc981718fd645ae Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 19:14:14 +0800 Subject: [PATCH 12/19] Runtime-rule apply status: async 3-min schema fence + review-finding fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses the review findings on the apply-status / batched-fence work and adds the configurable async schema fence the design called for. Async configurable schema fence (operator decision: "3 min, generous progress"): - New receiver-runtime-rule.deferredFenceTimeoutSeconds (default 180), carried onto StorageManipulationOpt (fenceTimeoutMs; 0 = installer's short 2s default). The REST operator apply runs the deferred fence in the BACKGROUND after the durable commit + peer resume (fenceRunByCaller + a daemon fence executor), so a slow cluster never blocks the apply or holds peers suspended. The reconciler tick keeps the short inline 2s fence; inline/static/delete fences are unchanged. - POST /addOrUpdate returns its applyId immediately at ROLLING_OUT; the background fence drives FENCING -> APPLIED, or DEGRADED with the lagging data-node ids (fenceLaggards) surfaced on the status + proto (new repeated fence_laggards) + JSON. The fence-phase listener on the opt lets the installer emit FENCING the instant it starts blocking. Phase machine (H1/#3): wire transition(DDL) before the apply and transition(ROLLING_OUT) after persist; re-add FENCING (now a real, observable long wait) and trim the never-emitted VALIDATING from the enum/proto/toProtoPhase so the contract matches the code. Content-hash fallback (H2/#4): queryApplyStatus degrades to the durable rule row on coordinator-miss / main-unreachable / found=false — an ACTIVE row whose hash matches reports APPLIED (derivedFrom=durable-dao), persist-is-commit. Notify (#1/#2/M2): main-side NotifyApplied is fire-and-forget off the REST thread; the commit_deferred (durable) path also notifies; peer-side reconcile nudges coalesce a burst to one queued tick. Lows: applyId in the structural_applied envelope (#5); servedBy parity on the local-path JSON (L1); defensive stripApplyPhasePrefix maps unknown -> UNKNOWN (NC3); provider shutdown stops the notify + fence executors (L3). Tests: StorageManipulationOptTest one-shot/reset cases; RuntimeRuleRestHandlerTest + GuardrailIntegrationTest track the structural apply's 3-arg (fence-opt) overload. Verified: server-core 306/306, runtime-rule 139/139, checkstyle + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 6 +- .../setup/backend/admin-api/runtime-rule.md | 1 + .../runtimerule/apply/MalFileApplier.java | 17 +- .../RuntimeRuleClusterServiceImpl.java | 60 +++-- .../module/RuntimeRuleModuleConfig.java | 12 + .../module/RuntimeRuleModuleProvider.java | 3 +- .../runtimerule/reconcile/DSLManager.java | 10 +- .../runtimerule/rest/RuntimeRuleService.java | 251 ++++++++++++++++-- .../runtimerule/status/ApplyPhase.java | 21 +- .../runtimerule/status/ApplyStatus.java | 31 ++- .../status/SchemaApplyCoordinator.java | 15 +- .../src/main/proto/runtime-rule-cluster.proto | 12 +- .../rest/GuardrailIntegrationTest.java | 24 +- .../rest/RuntimeRuleRestHandlerTest.java | 12 +- .../status/SchemaApplyCoordinatorTest.java | 4 +- .../storage/model/StorageManipulationOpt.java | 82 ++++++ .../src/main/resources/application.yml | 5 + .../banyandb/BanyanDBIndexInstaller.java | 38 ++- 18 files changed, 530 insertions(+), 74 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index fc94048aaa2e..b33afa82e74e 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,9 +242,9 @@ admin-host only" entry above for the public REST retirement. #### OAP Server -* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. Peer / `withoutSchemaChange` applies are unaffected (no fence). -* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → validating → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-fence-unconfirmed apply and `failed` carrying the specific reason). A new `GET /runtime/rule/status` REST endpoint (and a `GetApplyStatus` admin-internal gRPC served by the main) returns it — query by the apply's `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. -* Push runtime-rule convergence to peers on commit. After a successful structural apply the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile off the gRPC thread; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background after the durable commit (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). +* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → rolling-out → fencing → applied, with `degraded` for a committed-but-unconfirmed apply — local commit-tail threw, or the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) in the **background** after the durable commit + peer resume, so a slow cluster never blocks the apply or holds peers suspended: the `POST /addOrUpdate` returns its `applyId` immediately at `rolling-out`, and the operator polls `GET /runtime/rule/status` to watch `fencing → applied` (or `degraded`). A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (persist-is-commit). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. +* Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. * Fix the v2 MAL Elvis operator `?:` to honor Groovy-falsy semantics. It compiled to `Optional.ofNullable(primary).orElse(fallback)`, applying the fallback only when the primary is `null`, so an empty-string primary kept `""` instead — e.g. a BanyanDB liaison `ServiceInstance` stored `node_type=""` rather than `n/a`, because `.sum([...,'node_type'])` fills an absent group-by label with `""`. The fallback now applies for falsy primaries such as null, false, numeric zero, and empty strings/containers. diff --git a/docs/en/setup/backend/admin-api/runtime-rule.md b/docs/en/setup/backend/admin-api/runtime-rule.md index 4f056fb543f1..a42f54ff33e7 100644 --- a/docs/en/setup/backend/admin-api/runtime-rule.md +++ b/docs/en/setup/backend/admin-api/runtime-rule.md @@ -111,6 +111,7 @@ server returns `400 compile_failed`. | GET | `/runtime/rule/bundled?catalog=&withContent=false` | Returns bundled rules for one catalog as JSON. `withContent` defaults to true; `false` omits each YAML body. Each item includes whether an operator override exists. | | GET | `/runtime/rule/list[?catalog=]` | Returns a single JSON envelope `{generatedAt, loaderStats, rules}` merged from stored rules and this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Optional `catalog=` narrows the output; unknown values return `400 invalid_catalog`. | | GET | `/runtime/rule/dump[/]` | Downloads a tar.gz of stored runtime rules plus `manifest.yaml`. The server has no bulk import endpoint; the CLI restore command replays individual `addOrUpdate` and `inactivate` calls. | +| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `ROLLING_OUT`, `FENCING`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `ROLLING_OUT` (durable + peers resumed) with the `applyId`; the schema fence then confirms cluster-wide propagation in the background, advancing to `FENCING` then `APPLIED`, or `DEGRADED` with `fenceLaggards` (the data-node ids that did not confirm within `deferredFenceTimeoutSeconds`). When the live status is gone (or the main is unreachable) it degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | ### `/delete` storage semantics — per backend diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java index d835f61170ec..17cf7a5560ef 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplier.java @@ -174,11 +174,18 @@ public Applied apply(final String yamlContent, final String sourceName, // whole file waits ONCE instead of one fence per metric/downsampling. A fence timeout is // a non-fatal WARN inside the closure; only a barrier transport error throws, which // aborts this apply exactly as an inline per-resource fence would have. - try { - storageOpt.runDeferredFence(); - } catch (final StorageException e) { - layerRegistry.rollback(appliedClaims); - throw new ApplyException("schema fence failed for " + sourceName, e, metricNames); + // + // EXCEPTION: when fenceRunByCaller is set (the runtime-rule REST apply), the orchestrator + // runs the fence itself AFTER the durable commit + peer resume, on a background thread, so + // a long (3-min) cluster-propagation wait neither blocks the apply nor holds peers + // suspended. We only fire the DDL here and leave the closure for the caller to run. + if (!storageOpt.isFenceRunByCaller()) { + try { + storageOpt.runDeferredFence(); + } catch (final StorageException e) { + layerRegistry.rollback(appliedClaims); + throw new ApplyException("schema fence failed for " + sourceName, e, metricNames); + } } return new Applied(rule, convert, metricNames, ruleLoader, appliedClaims); } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java index 706f6b3ed550..de155f051d35 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -25,6 +25,7 @@ import java.util.Objects; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; +import java.util.concurrent.atomic.AtomicBoolean; import lombok.Setter; import lombok.extern.slf4j.Slf4j; import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; @@ -116,6 +117,13 @@ public class RuntimeRuleClusterServiceImpl return t; }); + /** Coalesces a burst of NotifyApplied into a single queued reconcile. {@code dslManager.tick()} + * is a full reconcile over every rule file, so when a multi-file apply (or several applies) + * fires many notifies, the first queued tick already converges all of them — the rest would be + * redundant full {@code dao.getAll()} scans. Set on schedule, cleared at the START of the task + * so a notify arriving while a tick runs still queues exactly one follow-up (no lost update). */ + private final AtomicBoolean tickPending = new AtomicBoolean(false); + public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String selfNodeId) { this.dslManager = dslManager; this.selfNodeId = selfNodeId; @@ -137,26 +145,41 @@ public void notifyApplied(final NotifyAppliedRequest request, responseObserver.onCompleted(); return; } - boolean accepted = false; - try { - reconcileNudgeExecutor.submit(() -> { - try { - dslManager.tick(); - } catch (final Throwable t) { - log.warn("runtime-rule NotifyApplied reconcile for {}/{} failed; peer will " - + "self-converge on its next tick: {}", - request.getCatalog(), request.getName(), t.getMessage()); - } - }); - accepted = true; - } catch (final Throwable t) { - log.warn("runtime-rule NotifyApplied could not schedule reconcile for {}/{}: {}", - request.getCatalog(), request.getName(), t.getMessage()); + boolean accepted = true; + // Not final: assigned in both the try and catch arms of the schedule attempt below. + String detail; + if (tickPending.compareAndSet(false, true)) { + try { + reconcileNudgeExecutor.submit(() -> { + // Clear before running so a notify that arrives during this tick queues exactly + // one follow-up rather than being dropped. + tickPending.set(false); + try { + dslManager.tick(); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied reconcile for {}/{} failed; peer will " + + "self-converge on its next tick: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + }); + detail = "reconcile scheduled"; + } catch (final Throwable t) { + // Submit rejected (executor shut down). Release the flag so the next notify retries. + tickPending.set(false); + accepted = false; + detail = "schedule failed; self-converge on next tick"; + log.warn("runtime-rule NotifyApplied could not schedule reconcile for {}/{}: {}", + request.getCatalog(), request.getName(), t.getMessage()); + } + } else { + // A reconcile is already queued; this notify is covered by it. tick() is a full + // cluster-wide reconcile, so no per-file work is lost by coalescing. + detail = "coalesced into pending reconcile"; } responseObserver.onNext(NotifyAppliedAck.newBuilder() .setNodeId(selfNodeId) .setAccepted(accepted) - .setDetail(accepted ? "reconcile scheduled" : "schedule failed; self-converge on next tick") + .setDetail(detail) .build()); responseObserver.onCompleted(); } @@ -191,6 +214,9 @@ public void getApplyStatus(final ApplyStatusRequest request, .setFailureReason(status.getFailureReason() == null ? "" : status.getFailureReason()) .setStartedAtMs(status.getStartedAtMs()) .setUpdatedAtMs(status.getUpdatedAtMs()); + if (status.getFenceLaggards() != null && !status.getFenceLaggards().isEmpty()) { + resp.addAllFenceLaggards(status.getFenceLaggards()); + } } responseObserver.onNext(resp.build()); responseObserver.onCompleted(); @@ -200,8 +226,6 @@ private static ApplyStatusPhase toProtoPhase(final ApplyPhase phase) { switch (phase) { case PENDING: return ApplyStatusPhase.APPLY_PHASE_PENDING; - case VALIDATING: - return ApplyStatusPhase.APPLY_PHASE_VALIDATING; case DDL: return ApplyStatusPhase.APPLY_PHASE_DDL; case FENCING: diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java index 58eec1fc78f9..4c4394105e1d 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleConfig.java @@ -40,4 +40,16 @@ public class RuntimeRuleModuleConfig extends ModuleConfig { * refresh + storage replica lag + RPC jitter. Default 60 s is conservative. */ private long selfHealThresholdSeconds = 60; + /** + * Timeout (seconds) for the runtime-rule deferred/batched BanyanDB schema fence — the wait + * for every data node to apply a newly created/updated measure's schema revision. Applies + * ONLY to the operator REST apply path, where the fence runs in the background after the + * durable commit + peer resume (so it never blocks the apply or holds peers suspended), and + * the operator polls {@code GET /runtime/rule/status} to watch {@code FENCING → APPLIED} or + * {@code DEGRADED} (with the lagging node ids). The inline / static / delete fences keep their + * short 2 s constant. Default 180 s is the multi-node cluster-propagation budget for a large + * reshape; on timeout the apply is reported {@code DEGRADED} (durable + forward-progressing, + * never reverted) with the laggard node list. + */ + private long deferredFenceTimeoutSeconds = 180; } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java index b5e307b5b6e1..6c50beab2a97 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/module/RuntimeRuleModuleProvider.java @@ -277,7 +277,8 @@ public void start() throws ServiceNotProvidedException, ModuleStartException { // RuleEngineRegistry from the per-DSL state maps it owns. dslManager = new DSLManager( getManager(), - moduleConfig.getSelfHealThresholdSeconds() * 1000L + moduleConfig.getSelfHealThresholdSeconds() * 1000L, + moduleConfig.getDeferredFenceTimeoutSeconds() * 1000L ); // Cluster-facing Suspend client: fans out to every non-self peer on the OAP cluster bus diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java index 910cfb838dad..8266668cbca0 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -150,6 +150,12 @@ public final class DSLManager { @Getter private final long selfHealThresholdMs; + /** Timeout for the runtime-rule deferred/batched BanyanDB schema fence on the operator REST + * apply path (default 3 min). Carried onto the deferred-fence opt; see + * {@link #newDeferredFenceOpt()} and {@code RuntimeRuleModuleConfig.deferredFenceTimeoutSeconds}. */ + @Getter + private final long deferredFenceTimeoutMs; + /** Lock-observability wrapper. Owned by the DSLManager; the REST handler borrows via * {@link #getLockMetrics()} so every lock acquire path reports to the same histograms. */ @Getter @@ -183,12 +189,14 @@ public final class DSLManager { private final RuleEngineRegistry engineRegistry; public DSLManager(final ModuleManager moduleManager, - final long selfHealThresholdMs) { + final long selfHealThresholdMs, + final long deferredFenceTimeoutMs) { this.moduleManager = Objects.requireNonNull(moduleManager, "moduleManager"); this.engineRegistry = new RuleEngineRegistry(); this.engineRegistry.register(new MalRuleEngine(this.rules, this.moduleManager)); this.engineRegistry.register(new LalRuleEngine(this.rules, this.moduleManager)); this.selfHealThresholdMs = selfHealThresholdMs; + this.deferredFenceTimeoutMs = deferredFenceTimeoutMs; this.lockMetrics = new LockMetrics(moduleManager); this.suspendCoord = new SuspendResumeCoordinator( diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 846d1f44ffa4..7d1e821111a8 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -42,6 +42,9 @@ import java.util.Objects; import java.util.Optional; import java.util.Set; +import java.time.Duration; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; import java.util.concurrent.locks.ReentrantLock; import java.util.regex.Pattern; import java.util.zip.GZIPOutputStream; @@ -57,6 +60,7 @@ import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; +import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; @@ -183,6 +187,34 @@ private Set validCatalogs() { */ private volatile AdminClusterChannelManager peerChannelManager; + /** + * Off-REST-thread runner for the best-effort peer {@code NotifyApplied} fan-out. The fan-out is + * sequential with a per-peer deadline, so an unreachable peer would otherwise add + * {@code peerCount × deadline} to the operator's apply latency — the notify is a convergence + * optimization (peers self-converge on their next tick if it's lost), so it must never hold the + * HTTP response. Single daemon thread: applies are serialized per file upstream and the fan-out + * is idempotent, so ordering across applies doesn't matter; daemon so it never blocks JVM + * shutdown. Shut down by {@link #shutdown()}. + */ + private final ExecutorService notifyExecutor = Executors.newSingleThreadExecutor(r -> { + final Thread t = new Thread(r, "runtime-rule-notify-broadcast"); + t.setDaemon(true); + return t; + }); + + /** + * Runs the runtime-rule deferred schema fence in the background after a structural apply's + * durable commit (see {@link #scheduleBackgroundFence}). A small fixed daemon pool so several + * concurrent applies (different files) can each confirm cluster-wide propagation without one + * file's up-to-3-min fence blocking another's status; daemon so it never blocks JVM shutdown. + * Shut down by {@link #shutdown()}. + */ + private final ExecutorService fenceExecutor = Executors.newFixedThreadPool(4, r -> { + final Thread t = new Thread(r, "runtime-rule-bg-fence"); + t.setDaemon(true); + return t; + }); + public RuntimeRuleService(final ModuleManager moduleManager, final DSLManager dslManager, final RuntimeRuleClusterClient clusterClient, @@ -291,18 +323,81 @@ public HttpResponse queryApplyStatus(final String catalog, final String name, ? SchemaApplyCoordinator.INSTANCE.getLatestByFile( catalog, name, contentHash.isEmpty() ? null : contentHash) : SchemaApplyCoordinator.INSTANCE.get(applyId); - return applyStatusJson(local); + if (local != null) { + return applyStatusJson(local); + } + // Live status gone (apply-id evicted / never began on this main). Fall back to the + // durable rule row for a content-derived answer; only UNKNOWN if even that is absent. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + return fromDao != null ? fromDao : applyStatusJson(null); } final ApplyStatusResponse remote = clusterClient.getApplyStatus(ApplyStatusRequest.newBuilder() .setApplyId(applyId).setCatalog(catalog).setName(name).setContentHash(contentHash).build()); if (remote == null) { + // Main unreachable, or a mixed-version main answered UNIMPLEMENTED. Degrade to the + // durable rule row (shared storage) before giving up with a transport error. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + if (fromDao != null) { + return fromDao; + } return HttpResponse.of(HttpStatus.BAD_GATEWAY, MediaType.JSON_UTF_8, jsonBody("status_unavailable", catalog, name, "could not reach the cluster main for apply status; retry shortly")); } + if (!remote.getFound()) { + // The main reached us but no longer holds a live status (evicted / restarted). The + // durable row is the same truth the main would derive — read it locally. + final HttpResponse fromDao = applyStatusFromDao(catalog, name, contentHash); + if (fromDao != null) { + return fromDao; + } + } return applyStatusJsonFromProto(remote); } + /** + * Content-derived apply status from the durable rule row, used when the live coordinator status + * is gone (apply-id evicted, main restarted, or the main is unreachable). The persist-is-commit + * invariant means an {@code ACTIVE} row IS the durable record that the apply of its content + * committed, so a matching {@code ACTIVE} row reports {@link ApplyPhase#APPLIED} (flagged + * {@code derivedFrom=durable-dao}). A hash mismatch means the queried content is not the current + * applied content; an {@code INACTIVE} row means it was paused. Returns {@code null} when the DAO + * is unresolvable or the read fails, so the caller keeps its own unavailable/unknown response. + */ + private HttpResponse applyStatusFromDao(final String catalog, final String name, final String contentHash) { + final RuntimeRuleManagementDAO.RuntimeRuleFile row; + try { + row = currentRuleFile(catalog, name); + } catch (final IOException e) { + log.warn("apply-status DAO fallback read failed for {}/{}; reporting unknown", catalog, name, e); + return null; + } + if (row == null) { + return null; + } + final String rowHash = ContentHash.sha256Hex(row.getContent()); + final boolean active = !RuntimeRule.STATUS_INACTIVE.equals(row.getStatus()); + final boolean hashMatches = contentHash.isEmpty() || contentHash.equals(rowHash); + final JsonObject o = new JsonObject(); + o.addProperty("catalog", catalog); + o.addProperty("name", name); + o.addProperty("contentHash", rowHash); + o.addProperty("derivedFrom", "durable-dao"); + if (active && hashMatches) { + o.addProperty("found", true); + o.addProperty("phase", ApplyPhase.APPLIED.name()); + o.addProperty("note", "live apply status unavailable; derived from the durable rule row " + + "(persist-is-commit: an ACTIVE row means this content's apply committed)"); + } else { + o.addProperty("found", false); + o.addProperty("phase", ApplyPhase.UNKNOWN.name()); + o.addProperty("note", active + ? "the queried content is not the currently applied content (hash mismatch)" + : "the rule row is INACTIVE (paused)"); + } + return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); + } + private HttpResponse applyStatusJson(final ApplyStatus s) { final JsonObject o = new JsonObject(); if (s == null) { @@ -320,7 +415,15 @@ private HttpResponse applyStatusJson(final ApplyStatus s) { } o.addProperty("startedAtMs", s.getStartedAtMs()); o.addProperty("updatedAtMs", s.getUpdatedAtMs()); + if (!s.getFenceLaggards().isEmpty()) { + final JsonArray laggards = new JsonArray(); + s.getFenceLaggards().forEach(laggards::add); + o.add("fenceLaggards", laggards); + } } + // Schema parity with the routed-from-main path (applyStatusJsonFromProto), which always + // carries servedBy: this node answered locally (self is main / single process). + o.addProperty("servedBy", "self"); return HttpResponse.of(HttpStatus.OK, MediaType.JSON_UTF_8, GSON.toJson(o)); } @@ -338,6 +441,11 @@ private HttpResponse applyStatusJsonFromProto(final ApplyStatusResponse r) { } o.addProperty("startedAtMs", r.getStartedAtMs()); o.addProperty("updatedAtMs", r.getUpdatedAtMs()); + if (r.getFenceLaggardsCount() > 0) { + final JsonArray laggards = new JsonArray(); + r.getFenceLaggardsList().forEach(laggards::add); + o.add("fenceLaggards", laggards); + } } else { o.addProperty("phase", ApplyPhase.UNKNOWN.name()); } @@ -346,10 +454,19 @@ private HttpResponse applyStatusJsonFromProto(final ApplyStatusResponse r) { } /** Strip the proto {@code APPLY_PHASE_} prefix so the routed-from-main JSON phase matches the - * local-path {@link ApplyPhase} names (e.g. {@code APPLY_PHASE_APPLIED} → {@code APPLIED}). */ + * local-path {@link ApplyPhase} names (e.g. {@code APPLY_PHASE_APPLIED} → {@code APPLIED}). + * Any value that isn't a live {@link ApplyPhase} — the proto default {@code UNSPECIFIED}, a + * reserved/removed phase, or a value from a newer-version peer — maps to {@code UNKNOWN} so the + * JSON phase is always a name the client's enum recognizes. */ private static String stripApplyPhasePrefix(final String protoName) { final String prefix = "APPLY_PHASE_"; - return protoName.startsWith(prefix) ? protoName.substring(prefix.length()) : protoName; + final String stripped = protoName.startsWith(prefix) ? protoName.substring(prefix.length()) : protoName; + for (final ApplyPhase p : ApplyPhase.values()) { + if (p.name().equals(stripped)) { + return stripped; + } + } + return ApplyPhase.UNKNOWN.name(); } private AdminClusterChannelManager resolvePeerChannelManager() { @@ -1094,8 +1211,20 @@ private HttpResponse applyStructural(final String catalog, final String name, final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); final DSLRuntimeState postApply; + // Build the deferred-fence opt ourselves so WE own the post-DDL schema fence: it must run + // in the BACKGROUND after the durable commit + peer resume (a 3-min cluster-propagation + // wait must not block this response or hold peers suspended). fenceRunByCaller tells the + // installer to register the fence closure but NOT run it inline; the fence-phase listener + // lets the background run emit the observable FENCING phase the instant it starts blocking. + final StorageManipulationOpt fenceOpt = StorageManipulationOpt.withSchemaChangeDeferredFence( + Duration.ofMillis(dslManager.getDeferredFenceTimeoutMs())); + fenceOpt.setFenceRunByCaller(true); + fenceOpt.setFencePhaseListener(() -> SchemaApplyCoordinator.INSTANCE.markFencing(applyId)); + // The apply call compiles, verifies, and fires the schema-change DDL (the fence is deferred + // to us); mark DDL before it so an in-flight query sees progress past PENDING. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.DDL); try { - postApply = dslManager.applyNowForRuleFile(ruleFile, true); + postApply = dslManager.applyNowForRuleFile(ruleFile, true, fenceOpt); } catch (final LayerConflictException lce) { // Runtime-DSL layer-declaration conflict — operator-actionable, not a server // error. Resume local + peers immediately so the cluster keeps serving the @@ -1196,7 +1325,11 @@ private HttpResponse applyStructural(final String catalog, final String name, return persistError; } - // Persist succeeded — drain the pending commit now that the DB reflects the new + // Persist succeeded — the durable commit point is crossed. Past here we drain the local + // commit and push peers to converge, so the apply is rolling out across the cluster. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); + + // Drain the pending commit now that the DB reflects the new // content. commitCoord.finalizeCommit drops removedMetrics, swaps the applied // pointers, retires the old loader, fires alarm reset, and advances the snapshot. // @@ -1219,6 +1352,10 @@ private HttpResponse applyStructural(final String catalog, final String name, if (commitFailure != null) { // Durable (DB persisted) but this node's commit-tail threw — peers converge from DB, // this node retries on the next tick. Committed-but-unconfirmed = DEGRADED, not FAILED. + // The DB row IS durable, so push peers to reconcile against it now (async, best-effort) + // rather than leaving them on the ~30s tick while this node is degraded — exactly the + // case where fast peer convergence matters most. + broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, "commit-tail deferred: DB persisted, local backend may be stale until next tick"); return serverError("commit_deferred", catalog, name, @@ -1235,16 +1372,65 @@ private HttpResponse applyStructural(final String catalog, final String name, // recover within an RPC round-trip. if (!drained) { broadcastResume(catalog, name, "force_no_change"); - } else { - // Real change committed and durable — push peers to converge NOW (reconcile against - // the just-persisted DB row) instead of waiting up to one ~30s refresh tick. - // Best-effort; peers self-converge on their own tick if the notify is lost. - broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); - } + // No commit drained ⇒ no schema change to confirm — terminal immediately. + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); + return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, + "structural apply succeeded (no change)" + describeDelta(delta)); + } + // Real change committed and durable — push peers to converge NOW (reconcile against + // the just-persisted DB row) instead of waiting up to one ~30s refresh tick. + // Best-effort; peers self-converge on their own tick if the notify is lost. + broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); + // Confirm cluster-wide schema propagation in the BACKGROUND (up to the configured fence + // timeout, default 3 min): runs the deferred fence we owned, driving FENCING → APPLIED, or + // DEGRADED + the laggard node list on timeout. The response returns now at ROLLING_OUT with + // the applyId; the operator polls GET /runtime/rule/status to watch the rest. + scheduleBackgroundFence(applyId, fenceOpt, catalog, name); + return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, + "structural apply committed; confirming cluster-wide schema propagation in the " + + "background — poll /runtime/rule/status?applyId=" + applyId + + describeDelta(delta)); + } - SchemaApplyCoordinator.INSTANCE.markApplied(applyId); - return ok(HttpStatus.OK, "structural_applied", catalog, name, - "structural apply succeeded" + describeDelta(delta)); + /** + * Run the runtime-rule deferred schema fence on a background thread after the durable commit, + * driving the apply's terminal status. The fence emits {@link ApplyPhase#FENCING} the instant it + * starts blocking (via the opt's listener), waits up to the configured timeout for every data + * node to apply the new schema revision, then marks {@link ApplyPhase#APPLIED} on confirmation or + * {@link ApplyPhase#DEGRADED} with the laggard node ids on timeout/transport error. Background so + * a long (3-min) wait never blocks the apply response or holds peers suspended — the commit is + * already durable and peers already resumed. + */ + private void scheduleBackgroundFence(final String applyId, final StorageManipulationOpt fenceOpt, + final String catalog, final String name) { + try { + fenceExecutor.submit(() -> { + try { + fenceOpt.runDeferredFence(); + final StorageManipulationOpt.FenceOutcome outcome = fenceOpt.getFenceOutcome(); + if (outcome == null || outcome.isApplied()) { + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); + } else { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence did not confirm cluster-wide propagation within " + + dslManager.getDeferredFenceTimeoutMs() + " ms", + outcome.getLaggardNodeIds()); + } + } catch (final StorageException e) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence transport error: " + e.getMessage(), List.of()); + } catch (final Throwable t) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence failed: " + t.getMessage(), List.of()); + } + }); + } catch (final Throwable t) { + // Executor rejected (shutting down). The commit IS durable, so report APPLIED and skip + // the (best-effort) propagation confirmation rather than leaving the apply at ROLLING_OUT. + log.warn("runtime-rule could not schedule the background schema fence for {}/{}; " + + "reporting applied (commit durable, cluster propagation unconfirmed)", catalog, name, t); + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); + } } /** @@ -1418,14 +1604,32 @@ private void broadcastNotifyApplied(final String catalog, final String name, fin if (clusterClient == null) { return; } + // Fire-and-forget: never block the REST response on a sequential, per-peer-deadline fan-out. + // A lost notify is harmless — the peer self-converges on its next tick. try { - clusterClient.broadcastNotifyApplied(catalog, name, contentHash); + notifyExecutor.submit(() -> { + try { + clusterClient.broadcastNotifyApplied(catalog, name, contentHash); + } catch (final Throwable t) { + log.warn("runtime-rule NotifyApplied broadcast failed for {}/{}; peers will " + + "converge on their next tick", catalog, name, t); + } + }); } catch (final Throwable t) { - log.warn("runtime-rule NotifyApplied broadcast failed for {}/{}; peers will converge " + // Executor rejected (shutting down). Drop it — peers self-converge on their next tick. + log.warn("runtime-rule NotifyApplied could not be scheduled for {}/{}; peers converge " + "on their next tick", catalog, name, t); } } + /** Stop the best-effort background executors (notify fan-out + schema fence). Called from the + * module provider's shutdown so tests and clean OAP stops don't leak threads; daemon status + * already prevents them blocking JVM exit. */ + public void shutdown() { + notifyExecutor.shutdownNow(); + fenceExecutor.shutdownNow(); + } + /** * Inspect Suspend acks for the split-brain guard: if any peer responded with REJECTED * (origin conflict — it believes it is the main and is mid-apply), surface that to the @@ -1903,6 +2107,21 @@ private static HttpResponse ok(final HttpStatus status, final String applyStatus return HttpResponse.of(status, MediaType.JSON_UTF_8, jsonBody(applyStatus, catalog, name, message)); } + /** {@link #ok} variant that also carries the apply's {@code applyId} so the caller can poll + * {@code GET /runtime/rule/status?applyId=…} directly. Used by the structural-apply success + * envelope. */ + private static HttpResponse okWithApplyId(final HttpStatus status, final String applyStatus, + final String catalog, final String name, + final String applyId, final String message) { + final JsonObject body = new JsonObject(); + body.addProperty("applyStatus", applyStatus); + body.addProperty("catalog", catalog == null ? "" : catalog); + body.addProperty("name", name == null ? "" : name); + body.addProperty("applyId", applyId == null ? "" : applyId); + body.addProperty("message", message == null ? "" : message); + return HttpResponse.of(status, MediaType.JSON_UTF_8, GSON.toJson(body)); + } + private static HttpResponse badRequest(final String applyStatus, final String catalog, final String name, final String message) { return HttpResponse.of(HttpStatus.BAD_REQUEST, MediaType.JSON_UTF_8, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java index 577186192268..c4e0fca394f5 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java @@ -25,24 +25,31 @@ * failed outright ({@link #FAILED}). * *

Normal progression: - * {@link #PENDING} → {@link #VALIDATING} → {@link #DDL} → {@link #FENCING} → {@link #ROLLING_OUT} - * → {@link #APPLIED}. Two off-ramps: + * {@link #PENDING} → {@link #DDL} → {@link #ROLLING_OUT} → {@link #FENCING} → {@link #APPLIED}. + * These are the phases the main observes from its apply orchestration: {@code PENDING} once the + * apply is accepted, {@code DDL} while the compile/verify/schema-change call runs (a single opaque + * step from the orchestrator's vantage — sub-steps such as validation are not separately + * observable, so they are not modelled), {@code ROLLING_OUT} once the rule row is durably persisted + * and the commit is draining + peers are being resumed/notified, then {@code FENCING} while the + * main waits (in the background, on a generous timeout) for every BanyanDB data node to apply the + * new schema revision. The HTTP response returns at {@code ROLLING_OUT} with the {@code applyId}; + * the operator polls to watch {@code FENCING → APPLIED} (or {@code DEGRADED}). Two off-ramps: *

    *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). The change * was rolled back; nothing was committed.
  • - *
  • {@link #DEGRADED} — committed and durable, but the post-DDL fence did not confirm - * cluster-wide propagation within budget (a node is lagging). Forward-progress: BanyanDB - * keeps converging; this is NOT a revert.
  • + *
  • {@link #DEGRADED} — committed and durable, but the local commit-tail threw or the post-DDL + * fence did not confirm cluster-wide propagation within the timeout (one or more data nodes + * lagging — exposed as {@code fenceLaggards}). Forward-progress: peers converge from the + * durable row and BanyanDB keeps converging; this is NOT a revert.
  • *
* {@link #UNKNOWN} is returned for an apply-id the main no longer holds (evicted / main restarted); * callers fall back to a content-hash comparison. */ public enum ApplyPhase { PENDING, - VALIDATING, DDL, - FENCING, ROLLING_OUT, + FENCING, APPLIED, DEGRADED, FAILED, diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java index 15bdad441366..57dbb35ea175 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyStatus.java @@ -18,6 +18,8 @@ package org.apache.skywalking.oap.server.receiver.runtimerule.status; +import java.util.Collections; +import java.util.List; import lombok.Getter; /** @@ -41,14 +43,18 @@ public final class ApplyStatus { private final String contentHash; private final ApplyPhase phase; /** Non-null only for {@link ApplyPhase#FAILED} (pre-commit error) and {@link ApplyPhase#DEGRADED} - * (committed but a node lagging at fence timeout). Null otherwise. */ + * (committed but a node lagging at fence timeout, or local commit-tail threw). Null otherwise. */ private final String failureReason; private final long startedAtMs; private final long updatedAtMs; + /** Data-node ids that had not confirmed the new schema revision when the fence timed out. + * Non-empty only for {@link ApplyPhase#DEGRADED} caused by fence non-confirmation; always a + * non-null (possibly empty) immutable list. */ + private final List fenceLaggards; public ApplyStatus(final String applyId, final String catalog, final String name, final String contentHash, final ApplyPhase phase, final String failureReason, - final long startedAtMs, final long updatedAtMs) { + final long startedAtMs, final long updatedAtMs, final List fenceLaggards) { this.applyId = applyId; this.catalog = catalog; this.name = name; @@ -57,16 +63,29 @@ public ApplyStatus(final String applyId, final String catalog, final String name this.failureReason = failureReason; this.startedAtMs = startedAtMs; this.updatedAtMs = updatedAtMs; + this.fenceLaggards = fenceLaggards == null + ? Collections.emptyList() + : Collections.unmodifiableList(fenceLaggards); } - /** A copy advanced to {@code newPhase}, clearing any prior failure reason (forward progress). */ + /** A copy advanced to {@code newPhase}, clearing any prior failure reason + laggards (forward + * progress). */ public ApplyStatus withPhase(final ApplyPhase newPhase, final long nowMs) { - return new ApplyStatus(applyId, catalog, name, contentHash, newPhase, null, startedAtMs, nowMs); + return new ApplyStatus(applyId, catalog, name, contentHash, newPhase, null, startedAtMs, nowMs, + Collections.emptyList()); } /** A copy moved to a non-success terminal ({@link ApplyPhase#FAILED} / {@link ApplyPhase#DEGRADED}) - * carrying {@code reason}. */ + * carrying {@code reason} and no laggards. */ public ApplyStatus withFailure(final ApplyPhase terminalPhase, final String reason, final long nowMs) { - return new ApplyStatus(applyId, catalog, name, contentHash, terminalPhase, reason, startedAtMs, nowMs); + return withFailure(terminalPhase, reason, Collections.emptyList(), nowMs); + } + + /** A copy moved to a non-success terminal carrying {@code reason} and the {@code laggards} that + * caused a fence-non-confirmation {@link ApplyPhase#DEGRADED}. */ + public ApplyStatus withFailure(final ApplyPhase terminalPhase, final String reason, + final List laggards, final long nowMs) { + return new ApplyStatus(applyId, catalog, name, contentHash, terminalPhase, reason, startedAtMs, nowMs, + laggards); } } diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java index 2ecf09d497ec..c1fffe185f60 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinator.java @@ -19,6 +19,7 @@ package org.apache.skywalking.oap.server.receiver.runtimerule.status; import java.util.HashSet; +import java.util.List; import java.util.Map; import java.util.Set; import java.util.UUID; @@ -79,7 +80,7 @@ public String begin(final String catalog, final String name, final String conten final String applyId = UUID.randomUUID().toString(); final long now = clock.getAsLong(); byApplyId.put(applyId, new ApplyStatus( - applyId, catalog, name, contentHash, ApplyPhase.PENDING, null, now, now)); + applyId, catalog, name, contentHash, ApplyPhase.PENDING, null, now, now, null)); latestApplyIdByFile.put(fileKey(catalog, name), applyId); if (log.isDebugEnabled()) { log.debug("apply [{}] begin: {}/{} hash={}", applyId, catalog, name, contentHash); @@ -98,12 +99,24 @@ public void markApplied(final String applyId) { transition(applyId, ApplyPhase.APPLIED); } + /** Move an apply to {@link ApplyPhase#FENCING}: the background wait for cluster-wide schema + * propagation is in flight (after the durable commit + peer resume). No-op for an unknown id. */ + public void markFencing(final String applyId) { + transition(applyId, ApplyPhase.FENCING); + } + /** Terminal: committed and durable, but cluster-wide propagation unconfirmed within budget * (a node is lagging). Not a revert — a background re-check may flip it to APPLIED later. */ public void markDegraded(final String applyId, final String reason) { update(applyId, s -> s.withFailure(ApplyPhase.DEGRADED, reason, clock.getAsLong())); } + /** {@link #markDegraded(String, String)} carrying the data-node ids that had not confirmed the + * schema revision at fence timeout, surfaced to the operator on the status. */ + public void markDegraded(final String applyId, final String reason, final List laggards) { + update(applyId, s -> s.withFailure(ApplyPhase.DEGRADED, reason, laggards, clock.getAsLong())); + } + /** Terminal: a pre-commit error (compile / verify / DDL RPC / persist); the change was rolled * back. */ public void markFailed(final String applyId, final String reason) { diff --git a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto index d6d15c5c0e32..79531b0492e7 100644 --- a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto +++ b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto @@ -193,8 +193,15 @@ message ApplyStatusRequest { enum ApplyStatusPhase { APPLY_PHASE_UNSPECIFIED = 0; APPLY_PHASE_PENDING = 1; - APPLY_PHASE_VALIDATING = 2; + // 2 (VALIDATING) was removed: it is a sub-step inside the single compile/verify/schema-change + // call and is not separately observable by the main's orchestration. Reserved to keep the wire + // number stable. + reserved 2; + reserved "APPLY_PHASE_VALIDATING"; APPLY_PHASE_DDL = 3; + // The main waits (in the background, on a generous timeout) for every data node to apply the + // new schema revision. Emitted AFTER ROLLING_OUT in the async model — the apply is already + // durable + peers resumed; this only confirms cluster-wide propagation. APPLY_PHASE_FENCING = 4; APPLY_PHASE_ROLLING_OUT = 5; APPLY_PHASE_APPLIED = 6; @@ -221,6 +228,9 @@ message ApplyStatusResponse { int64 updated_at_ms = 9; // Instance id of the node that produced this status (the main). string node_id = 10; + // Data-node ids that had not confirmed the schema revision when the fence timed out. + // Non-empty only on APPLY_PHASE_DEGRADED (fence non-confirmation). + repeated string fence_laggards = 11; } // Sent by the main after a successful structural commit so peers reconcile immediately. diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java index 7eccc122c234..d7a7af0b76ec 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/GuardrailIntegrationTest.java @@ -146,8 +146,9 @@ void setUp() { .thenReturn(mock(StructuralCommitCoordinator.class)); when(dslManager.getDslRuntimeDelete()) .thenReturn(mock(DSLRuntimeDelete.class)); - // Stub both overloads — the REST handler calls the single-arg form on the - // FILTER_ONLY path and the two-arg form (deferCommit=true) on STRUCTURAL. + // Stub all overloads — the REST handler calls the single-arg form on the FILTER_ONLY + // path, and the three-arg form (deferCommit=true + the orchestrator-owned fence opt) on + // STRUCTURAL; the two-arg form is kept stubbed for any legacy caller. when(dslManager.applyNowForRuleFile(any())).thenAnswer(inv -> { final Object arg = inv.getArgument(0); if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { @@ -166,6 +167,15 @@ void setUp() { } return null; }); + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean(), any())).thenAnswer(inv -> { + final Object arg = inv.getArgument(0); + if (arg instanceof RuntimeRuleManagementDAO.RuntimeRuleFile) { + final RuntimeRuleManagementDAO.RuntimeRuleFile file = + (RuntimeRuleManagementDAO.RuntimeRuleFile) arg; + return DSLRuntimeState.running(file.getCatalog(), file.getName(), "h", 0L); + } + return null; + }); handler = new RuntimeRuleRestHandler(moduleManager, dslManager, clusterClient, 30_000L); } @@ -199,7 +209,7 @@ void malScopeChangeAcceptedWithAllowStorageChangeTrue() throws Exception { assertHttp(resp, HttpStatus.OK); // STRUCTURAL path uses the two-arg overload (deferCommit=true) so row-persist // failure can cleanly roll back. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -212,7 +222,7 @@ void malScopeChangeAcceptedThroughFixRoute() throws Exception { HttpData.ofUtf8(INSTANCE_YAML)); assertHttp(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -249,7 +259,7 @@ void malAddedMetricNeverTripsGuardrail() throws Exception { assertHttp(resp, HttpStatus.OK); // Non-empty addedMetrics makes this STRUCTURAL (NEW classification on first apply // or STRUCTURAL on update) — goes through the deferred-commit path. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } // ---- LAL scenarios -------------------------------------------------------------------- @@ -300,7 +310,7 @@ void lalBodyOnlyEditAccepted() throws Exception { assertHttp(resp, HttpStatus.OK); // LAL always routes through the STRUCTURAL path (classifyLal reports STRUCTURAL on // every content change), so the two-arg overload fires. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -315,7 +325,7 @@ void lalRuleAddedAcceptedWithAllowStorageChangeTrue() throws Exception { HttpData.ofUtf8(twoRules)); assertHttp(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } // ---- helpers -------------------------------------------------------------------------- diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java index ac1fd1880b1c..ee0089f0af95 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java @@ -183,10 +183,10 @@ void addOrUpdateBypassesNoChangeOnInactiveRow() throws Exception { // Reactivation pushes through the STRUCTURAL/NEW path — expect 200 with a status // other than no_change. We don't assert on the exact applyStatus string here (that - // depends on classifier output); the key assertion is that the two-arg deferred- - // commit form of applyNowForRuleFile ran (STRUCTURAL path signature). + // depends on classifier output); the key assertion is that the three-arg deferred- + // commit + fence-opt form of applyNowForRuleFile ran (STRUCTURAL path signature). assertHttpStatus(resp, HttpStatus.OK); - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -205,8 +205,8 @@ void fixBypassesNoChangeEvenOnByteIdenticalActiveRow() throws Exception { assertHttpStatus(resp, HttpStatus.OK); // /addOrUpdate?force=true with byte-identical content → classifier returns // NO_CHANGE, handler falls through to applyStructural (not applyFilterOnly since - // NO_CHANGE != FILTER_ONLY) which uses the two-arg deferred-commit form. - verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true)); + // NO_CHANGE != FILTER_ONLY) which uses the three-arg deferred-commit + fence-opt form. + verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } @Test @@ -534,6 +534,8 @@ private void whenReconcilerApplySucceeds(final String catalog, final String name // NEW paths use the two-arg form (deferCommit=true). when(dslManager.applyNowForRuleFile(any())).thenReturn(state); when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean())).thenReturn(state); + // STRUCTURAL apply now passes the orchestrator-owned fence opt → the three-arg overload. + when(dslManager.applyNowForRuleFile(any(), Mockito.anyBoolean(), any())).thenReturn(state); // Apply path needs SUSPENDED on localSuspend so the workflow proceeds. The other // subsystem getters were stubbed in setUp() with default mocks; here we just // override localSuspend to return SUSPENDED instead of the default null. diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java index 14dfa98ad641..38a16e842401 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/SchemaApplyCoordinatorTest.java @@ -60,10 +60,10 @@ void transitionsAdvancePhaseAndStampUpdatedAt() { clock.set(2_000L); coord.transition(applyId, ApplyPhase.DDL); clock.set(3_000L); - coord.transition(applyId, ApplyPhase.FENCING); + coord.transition(applyId, ApplyPhase.ROLLING_OUT); final ApplyStatus s = coord.get(applyId); - assertEquals(ApplyPhase.FENCING, s.getPhase()); + assertEquals(ApplyPhase.ROLLING_OUT, s.getPhase()); assertEquals(1_000L, s.getStartedAtMs(), "startedAt is fixed at begin"); assertEquals(3_000L, s.getUpdatedAtMs(), "updatedAt advances with each transition"); } diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index 5ef07ff7ba35..f5afaf52873d 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -18,12 +18,14 @@ package org.apache.skywalking.oap.server.core.storage.model; +import java.time.Duration; import java.util.Collections; import java.util.List; import java.util.concurrent.CopyOnWriteArrayList; import java.util.concurrent.atomic.AtomicLong; import lombok.Builder; import lombok.Getter; +import lombok.Setter; import org.apache.skywalking.oap.server.core.storage.StorageException; /** @@ -323,6 +325,22 @@ public static StorageManipulationOpt withSchemaChangeDeferredFence() { return opt; } + /** + * {@link #withSchemaChangeDeferredFence()} with an explicit batched-fence timeout. Used by the + * runtime-rule operator apply, which fences on a generous cluster-propagation budget (default + * 3 min, configurable) instead of the installer's short inline default — the apply is async + + * progress-queryable, so a long single wait is affordable. The inline/static/delete fence paths + * (which never set a timeout here) keep the installer's short constant. + * + * @param timeout the batched-fence wait; passed to the backend via {@link #getFenceTimeoutMs()}. + * @return a deferred-fence opt carrying {@code timeout}. + */ + public static StorageManipulationOpt withSchemaChangeDeferredFence(final Duration timeout) { + final StorageManipulationOpt opt = withSchemaChangeDeferredFence(); + opt.fenceTimeoutMs = timeout == null ? 0L : timeout.toMillis(); + return opt; + } + /** * True for {@link Mode#WITH_SCHEMA_CHANGE}. The on-demand operator workflow — drops, * updates, and reshapes are permitted because the caller explicitly asked for them. @@ -419,6 +437,64 @@ public interface DeferredFence { private volatile DeferredFence deferredFence; + /** + * Batched-fence wait in millis, or {@code 0} (the default) meaning "use the backend installer's + * own short constant". Set only by {@link #withSchemaChangeDeferredFence(Duration)} on the + * runtime-rule operator path; the backend reads it when running the deferred fence so the + * inline/static/delete paths (which leave it {@code 0}) keep the short timeout. + */ + @Getter + private long fenceTimeoutMs = 0L; + + /** + * True when the CALLER (the apply orchestrator) runs {@link #runDeferredFence()} itself — + * typically on a background thread after the durable commit — rather than the backend installer + * running it inline at the end of the apply. The runtime-rule REST apply sets this so the long + * (3-min) fence does not block the apply / hold peers suspended; the reconciler tick leaves it + * {@code false} so the installer keeps fencing inline with the short timeout. + */ + @Getter + @Setter + private boolean fenceRunByCaller = false; + + /** + * Notified by the backend the instant before the deferred fence starts blocking, so the apply + * orchestrator can mark a {@code FENCING} progress phase that is observable while the (long) + * wait is in flight. Null on paths that don't observe progress (tick, peer, non-BanyanDB). + */ + @FunctionalInterface + public interface FencePhaseListener { + void onFenceStart(); + } + + /** + * Outcome of a deferred fence, recorded by the backend so the orchestrator can mark + * {@code APPLIED} vs {@code DEGRADED}-with-laggards after {@link #runDeferredFence()} returns. + */ + public static final class FenceOutcome { + @Getter + private final boolean applied; + @Getter + private final List laggardNodeIds; + + public FenceOutcome(final boolean applied, final List laggardNodeIds) { + this.applied = applied; + this.laggardNodeIds = laggardNodeIds == null + ? Collections.emptyList() + : Collections.unmodifiableList(laggardNodeIds); + } + } + + @Getter + @Setter + private volatile FencePhaseListener fencePhaseListener; + + /** Recorded by the backend during {@link #runDeferredFence()}; read by the orchestrator after. + * Null when no deferred fence ran (no DDL) or the backend records no outcome. */ + @Getter + @Setter + private volatile FenceOutcome fenceOutcome; + /** * Register the single fence to run after the batched apply completes. Idempotent — the * installer may call it once per resource; the latest (equivalent) closure wins. No-op @@ -440,6 +516,11 @@ public void setDeferredFence(final DeferredFence fence) { * revision — each file fences on its own DDL only. The reset happens in a {@code finally} * so a fence transport failure still isolates the next file. The closure reads * {@link #getMaxModRevision()} during {@code await()}, so it is reset only after. + * + *

{@link #fenceOutcome} is cleared before the fence runs (so a shared tick opt + * starts each file clean) and the backend sets it during the run; it is intentionally + * NOT cleared afterward so the caller can read it (the runtime-rule orchestrator reads it to + * decide {@code APPLIED} vs {@code DEGRADED}). */ public void runDeferredFence() throws StorageException { final DeferredFence fence = this.deferredFence; @@ -447,6 +528,7 @@ public void runDeferredFence() throws StorageException { return; } this.deferredFence = null; + this.fenceOutcome = null; try { fence.await(); } finally { diff --git a/oap-server/server-starter/src/main/resources/application.yml b/oap-server/server-starter/src/main/resources/application.yml index ec2d3d7bae4f..cc85d99bf34b 100644 --- a/oap-server/server-starter/src/main/resources/application.yml +++ b/oap-server/server-starter/src/main/resources/application.yml @@ -718,6 +718,11 @@ receiver-runtime-rule: refreshRulesPeriod: ${SW_RECEIVER_RUNTIME_RULE_REFRESH_RULES_PERIOD:30} # SUSPENDED state self-heal threshold (seconds). selfHealThresholdSeconds: ${SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS:60} + # Timeout (seconds) for the runtime-rule deferred BanyanDB schema fence on the operator REST + # apply: the background wait for every data node to apply a new measure's schema revision. + # The apply returns immediately at rolling-out; poll GET /runtime/rule/status for fencing -> + # applied/degraded. Inline/static/delete fences keep their short 2s constant. + deferredFenceTimeoutSeconds: ${SW_RECEIVER_RUNTIME_RULE_DEFERRED_FENCE_TIMEOUT_SECONDS:180} # DSL Debug API — sampling debugger for MAL / LAL / OAL rules. Read-only OAL # listing under /runtime/oal/*; session control plane under /dsl-debugging/*. diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index ea9e2b169073..f613fe185809 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -296,7 +296,7 @@ private void fenceOnRevision(final BanyanDBClient client, final StorageManipulat // time, after every resource has recorded its revision. opt.setDeferredFence(() -> { try { - doFenceOnRevision(client, opt, "batched apply"); + doDeferredFence(client, opt, "batched apply"); } catch (final BanyanDBException e) { throw new StorageException("batched schema fence failed", e); } @@ -311,6 +311,42 @@ private void doFenceOnRevision(final BanyanDBClient client, final StorageManipul doFenceOnRevisionValue(client, opt.getMaxModRevision(), context); } + /** + * The deferred (batched) fence the runtime-rule apply runs once after all DDL. Unlike the inline + * {@link #doFenceOnRevisionValue}, this (1) emits the apply's {@code FENCING} progress phase the + * instant before it blocks (via the opt's {@link StorageManipulationOpt.FencePhaseListener}), + * (2) honors the opt's configured timeout ({@link StorageManipulationOpt#getFenceTimeoutMs()}, + * the runtime-rule 3-min budget) instead of the short inline {@link #FENCE_TIMEOUT}, and + * (3) records the outcome (applied + laggard node ids) on the opt so the orchestrator can mark + * {@code APPLIED} vs {@code DEGRADED}. A laggard timeout is still a non-fatal WARN. + */ + private void doDeferredFence(final BanyanDBClient client, final StorageManipulationOpt opt, + final String context) throws BanyanDBException { + final long rev = opt.getMaxModRevision(); + if (rev <= 0L) { + return; + } + final StorageManipulationOpt.FencePhaseListener listener = opt.getFencePhaseListener(); + if (listener != null) { + listener.onFenceStart(); + } + final Duration timeout = opt.getFenceTimeoutMs() > 0L + ? Duration.ofMillis(opt.getFenceTimeoutMs()) + : FENCE_TIMEOUT; + final SchemaWatcher.Result result = client.getSchemaWatcher().awaitRevisionApplied(rev, timeout); + if (!result.isApplied()) { + log.warn("BanyanDB schema-watch fence did NOT confirm revision {} within {} ms for {}; " + + "proceeding anyway. Laggards: {}", rev, timeout.toMillis(), context, result.getLaggards()); + final List laggardIds = result.getLaggards().stream() + .map(l -> l.getNode()) + .collect(Collectors.toList()); + opt.setFenceOutcome(new StorageManipulationOpt.FenceOutcome(false, laggardIds)); + } else { + log.debug("BanyanDB schema-watch fence confirmed revision {} for {}", rev, context); + opt.setFenceOutcome(new StorageManipulationOpt.FenceOutcome(true, List.of())); + } + } + private void doFenceOnRevisionValue(final BanyanDBClient client, final long rev, final String context) throws BanyanDBException { if (rev <= 0L) { From 8e602b476bad4adb94d2c0a1be10a413642d4999 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 20:06:05 +0800 Subject: [PATCH 13/19] Add tests for the runtime-rule query/notify/fence surfaces + fix javadoc link MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the review-flagged coverage gap on the new apply-status surfaces, and fixes a dangling javadoc reference the whole-project javadoc build caught. - RuntimeRuleClusterServiceImplTest (new): getApplyStatus maps every ApplyPhase to proto along the happy path (PENDING→DDL→ROLLING_OUT→FENCING→APPLIED), surfaces DEGRADED with the laggard node ids, maps FAILED, returns UNKNOWN/found=false when nothing is tracked, and resolves by (catalog,name) when applyId is absent; notifyApplied suppresses a node's own broadcast (no reconcile) and schedules an off-thread reconcile for a peer notify. - MalFileApplierTest: a deferred-fence transport failure rolls the apply back and carries the metrics registered before the fence for the caller's rollback set. - RuntimeRuleRestHandlerTest: queryApplyStatus degrades to the durable rule row when the live status is gone (ACTIVE row → APPLIED, derivedFrom=durable-dao), and returns UNKNOWN/found=false when neither a live status nor a row exists. - DSLManager: drop the stale {@link #newDeferredFenceOpt()} (method removed when the REST orchestrator took over building the deferred-fence opt) so javadoc resolves. Verified: server-core 306/306, runtime-rule 139/139 (+ the 10 new cases above), whole-project checkstyle + compile + javadoc + license all clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../runtimerule/reconcile/DSLManager.java | 5 +- .../runtimerule/apply/MalFileApplierTest.java | 32 +++ .../RuntimeRuleClusterServiceImplTest.java | 183 ++++++++++++++++++ .../rest/RuntimeRuleRestHandlerTest.java | 32 +++ 4 files changed, 250 insertions(+), 2 deletions(-) create mode 100644 oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java index 8266668cbca0..c9d2ece6fb6f 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/reconcile/DSLManager.java @@ -151,8 +151,9 @@ public final class DSLManager { private final long selfHealThresholdMs; /** Timeout for the runtime-rule deferred/batched BanyanDB schema fence on the operator REST - * apply path (default 3 min). Carried onto the deferred-fence opt; see - * {@link #newDeferredFenceOpt()} and {@code RuntimeRuleModuleConfig.deferredFenceTimeoutSeconds}. */ + * apply path (default 3 min). Read via {@link #getDeferredFenceTimeoutMs()} by the REST + * orchestrator, which builds the deferred-fence opt with it + * ({@code RuntimeRuleModuleConfig.deferredFenceTimeoutSeconds}). */ @Getter private final long deferredFenceTimeoutMs; diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java index dc4222b49185..66f147557c71 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/apply/MalFileApplierTest.java @@ -24,6 +24,7 @@ import javassist.ClassPool; import org.apache.skywalking.oap.server.core.analysis.Layer; import org.apache.skywalking.oap.server.core.analysis.meter.MeterSystem; +import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.LayerConflictException; import org.apache.skywalking.oap.server.receiver.runtimerule.layer.RuntimeLayerRegistry; @@ -134,6 +135,37 @@ void successfulApplyRegistersDerivedMetricNames() throws Exception { any(StorageManipulationOpt.class)); } + @Test + void deferredFenceFailureRollsBackAndCarriesPartialMetricsForCallerRollback() throws Exception { + // The DDL for every metric fires first; the batched schema fence runs once at the end. A + // barrier transport error there must abort the apply as an ApplyException (so the REST + // caller rolls back), carrying the metric names registered before the fence so the caller + // knows what to unwind. Inline path (fenceRunByCaller=false) so the applier runs the fence. + final String yaml = + "metricPrefix: meter_vm\n" + + "expSuffix: service(['host'], Layer.OS_LINUX)\n" + + "metricsRules:\n" + + " - name: cpu_total_percentage\n" + + " exp: node_cpu_seconds_total.sum(['host']).rate('PT1M')\n" + + " - name: mem_total_used_percentage\n" + + " exp: node_memory_MemTotal_bytes.sum(['host'])\n"; + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.setDeferredFence(() -> { + throw new StorageException("barrier transport down"); + }); + + final MalFileApplier.ApplyException ex = assertThrows( + MalFileApplier.ApplyException.class, + () -> applier.apply(yaml, "otel-rules/vm-fence", "hashF", opt)); + + assertTrue(ex.getMessage().contains("schema fence failed"), + "fence failure must surface as an apply error, not be swallowed"); + assertEquals( + setOf("meter_vm_cpu_total_percentage", "meter_vm_mem_total_used_percentage"), + ex.getPartiallyRegistered(), + "metrics registered before the fence must be carried for the caller's rollback"); + } + @Test void removeCallsMeterSystemPerName() { // The inverse side of the contract: on unregister every metric name the prior apply diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java new file mode 100644 index 000000000000..309821f6ae86 --- /dev/null +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + */ + +package org.apache.skywalking.oap.server.receiver.runtimerule.cluster; + +import io.grpc.stub.StreamObserver; +import java.util.Arrays; +import java.util.concurrent.CountDownLatch; +import java.util.concurrent.TimeUnit; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.ApplyStatusResponse; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedAck; +import org.apache.skywalking.oap.server.receiver.runtimerule.cluster.v1.NotifyAppliedRequest; +import org.apache.skywalking.oap.server.receiver.runtimerule.reconcile.DSLManager; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.ApplyPhase; +import org.apache.skywalking.oap.server.receiver.runtimerule.status.SchemaApplyCoordinator; +import org.junit.jupiter.api.Test; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; +import static org.junit.jupiter.api.Assertions.assertTrue; +import static org.mockito.Mockito.doAnswer; +import static org.mockito.Mockito.mock; +import static org.mockito.Mockito.never; +import static org.mockito.Mockito.verify; + +/** + * Unit coverage for the admin-internal gRPC handlers the cluster main serves: + * {@code getApplyStatus} (the apply-status query, including the {@link ApplyPhase} → proto mapping + * and the laggard list) and {@code notifyApplied} (self-suppression + the off-thread reconcile + * nudge). Status flows through the {@link SchemaApplyCoordinator#INSTANCE} singleton, so each test + * opens its own apply (a fresh UUID applyId) to stay isolated from the others. + */ +class RuntimeRuleClusterServiceImplTest { + + private static final String SELF = "self-node_17129"; + + /** Minimal StreamObserver that captures the single onNext value the handlers emit. */ + private static final class Capturing implements StreamObserver { + private T value; + + @Override + public void onNext(final T v) { + this.value = v; + } + + @Override + public void onError(final Throwable t) { + } + + @Override + public void onCompleted() { + } + } + + private ApplyStatusResponse query(final RuntimeRuleClusterServiceImpl impl, final String applyId) { + final Capturing obs = new Capturing<>(); + impl.getApplyStatus(ApplyStatusRequest.newBuilder().setApplyId(applyId).build(), obs); + return obs.value; + } + + @Test + void getApplyStatusMapsEachPhaseToProtoAlongTheHappyPath() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-phase", "h1"); + + assertEquals(ApplyStatusPhase.APPLY_PHASE_PENDING, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.DDL); + assertEquals(ApplyStatusPhase.APPLY_PHASE_DDL, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.ROLLING_OUT); + assertEquals(ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.markFencing(id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_FENCING, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.markApplied(id); + + final ApplyStatusResponse applied = query(impl, id); + assertTrue(applied.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_APPLIED, applied.getPhase()); + assertEquals("otel-rules", applied.getCatalog()); + assertEquals("vm-phase", applied.getName()); + assertEquals(SELF, applied.getNodeId()); + assertTrue(applied.getFenceLaggardsList().isEmpty()); + } + + @Test + void getApplyStatusSurfacesDegradedWithLaggardNodeIds() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-degraded", "h2"); + SchemaApplyCoordinator.INSTANCE.markDegraded(id, "fence did not confirm", + Arrays.asList("data-1_17912", "data-2_17912")); + + final ApplyStatusResponse resp = query(impl, id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_DEGRADED, resp.getPhase()); + assertEquals("fence did not confirm", resp.getFailureReason()); + assertEquals(Arrays.asList("data-1_17912", "data-2_17912"), resp.getFenceLaggardsList()); + } + + @Test + void getApplyStatusMapsFailed() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-failed", "h3"); + SchemaApplyCoordinator.INSTANCE.markFailed(id, "ddl_verify_failed"); + + final ApplyStatusResponse resp = query(impl, id); + assertEquals(ApplyStatusPhase.APPLY_PHASE_FAILED, resp.getPhase()); + assertEquals("ddl_verify_failed", resp.getFailureReason()); + assertTrue(resp.getFenceLaggardsList().isEmpty()); + } + + @Test + void getApplyStatusReturnsUnknownWhenNothingTracked() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + + final ApplyStatusResponse resp = query(impl, "no-such-apply-id"); + assertFalse(resp.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_UNKNOWN, resp.getPhase()); + // node_id is still stamped so the caller knows which node answered. + assertEquals(SELF, resp.getNodeId()); + } + + @Test + void getApplyStatusResolvesByCatalogNameWhenApplyIdAbsent() { + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(mock(DSLManager.class), SELF); + final String id = SchemaApplyCoordinator.INSTANCE.begin("otel-rules", "vm-byname", "hbyname"); + SchemaApplyCoordinator.INSTANCE.markApplied(id); + + final Capturing obs = new Capturing<>(); + impl.getApplyStatus(ApplyStatusRequest.newBuilder() + .setCatalog("otel-rules").setName("vm-byname").setContentHash("hbyname").build(), obs); + + assertTrue(obs.value.getFound()); + assertEquals(ApplyStatusPhase.APPLY_PHASE_APPLIED, obs.value.getPhase()); + assertEquals(id, obs.value.getApplyId()); + } + + @Test + void notifyAppliedSuppressesSelfBroadcastAndDoesNotReconcile() { + final DSLManager dslManager = mock(DSLManager.class); + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(dslManager, SELF); + final Capturing obs = new Capturing<>(); + + impl.notifyApplied(NotifyAppliedRequest.newBuilder() + .setSenderNodeId(SELF).setCatalog("otel-rules").setName("vm").build(), obs); + + assertFalse(obs.value.getAccepted(), "a node's own broadcast must be suppressed"); + verify(dslManager, never()).tick(); + } + + @Test + void notifyAppliedFromPeerSchedulesAReconcileTick() throws Exception { + final DSLManager dslManager = mock(DSLManager.class); + final CountDownLatch ticked = new CountDownLatch(1); + doAnswer(inv -> { + ticked.countDown(); + return null; + }).when(dslManager).tick(); + final RuntimeRuleClusterServiceImpl impl = new RuntimeRuleClusterServiceImpl(dslManager, SELF); + final Capturing obs = new Capturing<>(); + + impl.notifyApplied(NotifyAppliedRequest.newBuilder() + .setSenderNodeId("other-node_17129").setCatalog("otel-rules").setName("vm").build(), obs); + + assertTrue(obs.value.getAccepted(), "a peer notify must be accepted"); + assertTrue(ticked.await(3, TimeUnit.SECONDS), + "the reconcile tick must run off the gRPC thread"); + } +} diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java index ee0089f0af95..fcb72b3c97d8 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleRestHandlerTest.java @@ -58,6 +58,7 @@ import org.mockito.Mockito; import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; import static org.mockito.ArgumentMatchers.any; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.never; @@ -209,6 +210,37 @@ void fixBypassesNoChangeEvenOnByteIdenticalActiveRow() throws Exception { verify(dslManager).applyNowForRuleFile(any(), Mockito.eq(true), any()); } + @Test + void applyStatusDegradesToDurableRowWhenLiveStatusGone() throws Exception { + // No live coordinator status for this (catalog,name) and no apply-id supplied — the query + // must fall back to the durable rule row. An ACTIVE row reports APPLIED derived from the + // content hash (persist-is-commit), so a page refresh after the applyId is gone still + // resolves instead of returning an opaque UNKNOWN. + final String yaml = minimalMalYaml(); + whenDaoHasRow("otel-rules", "vm-statusfallback", yaml, RuntimeRule.STATUS_ACTIVE); + + final HttpResponse resp = handler.applyStatus("otel-rules", "vm-statusfallback", "", ""); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertTrue(body.contains("\"found\":true"), body); + assertTrue(body.contains("\"phase\":\"APPLIED\""), body); + assertTrue(body.contains("\"derivedFrom\":\"durable-dao\""), body); + } + + @Test + void applyStatusUnknownWhenNoLiveStatusAndNoDurableRow() throws Exception { + // Neither a live status nor a durable row for the queried apply → found=false / UNKNOWN. + when(dao.getAll()).thenReturn(Arrays.asList()); + + final HttpResponse resp = handler.applyStatus("otel-rules", "vm-none", "", "no-such-apply-id"); + + assertHttpStatus(resp, HttpStatus.OK); + final String body = resp.aggregate().toCompletableFuture().join().contentUtf8(); + assertTrue(body.contains("\"found\":false"), body); + assertTrue(body.contains("UNKNOWN"), body); + } + @Test void addOrUpdateReturnsCompileFailedOnMalformedYaml() throws Exception { // compile_failed is the guaranteed pre-persist error: the classifier's AST walk From a6fd9f9af7d040fc63e5e093b51342059dc21a7e Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 20:49:57 +0800 Subject: [PATCH 14/19] Make the runtime-rule schema fence gate dispatch resume (write-safety) + review fixes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Second review round on the apply-status branch. HIGH — write-safety regression fixed. The async fence resumed dispatch (finalizeCommit + peer notify) BEFORE confirming schema propagation, but an un-propagated write is silently dropped at the data node (CLAUDE.md tip #16), so this lost data during the propagation window. The fence now GATES the resume: after persist the apply marks FENCING and returns the applyId immediately, but the background task (fenceThenResume) fences FIRST, then finalizes the local commit + resumes/notifies peers. Dispatch stays suspended through FENCING — a clean collection pause, not dropped writes. On a genuine laggard it resumes anyway after the budget and marks DEGRADED + the laggard ids (a stuck node must not park the metric forever). Phase order corrected to PENDING → DDL → FENCING → ROLLING_OUT → APPLIED across the enum, proto, status docs, and changelog. The executor-rejected fallback runs the fence inline so the suspend bracket never leaks. MEDIUM — shared-tick opt isolation. runDeferredFence now resets the accumulated revision unconditionally (even a no-DDL file that registered no closure), so a shared tick opt isolates each file; documented that a later commit-tail's drop revisions are inline-fenced and benign (the next file's own create revision is monotonically higher and dominates). LOW — executor shutdown. Added RuntimeRuleClusterServiceImpl.shutdown() (parity with RuntimeRuleService); corrected both shutdown() docs to state the framework's ModuleProvider has no stop hook, so they are daemon-thread + best-effort + for test teardown — not the prior false "called from provider shutdown". MEDIUM — MAL Elvis. Documented the eager-fallback as a known limitation at the codegen site (Javassist cannot lazy-eval without a Supplier-companion pass; real MAL fallbacks are pure, cheap reads so it is benign) rather than risk a codegen rewrite. Removed the now-redundant FencePhaseListener machinery (FENCING is marked synchronously before scheduling). Tests: StorageManipulationOptTest no-closure-reset case; the cluster test walks the corrected phase order. Verified: server-core 9/9, runtime-rule 149/149, whole-project checkstyle + compile + javadoc + license all clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 4 +- .../setup/backend/admin-api/runtime-rule.md | 2 +- .../v2/compiler/MALClosureCodegen.java | 11 + .../RuntimeRuleClusterServiceImpl.java | 8 + .../runtimerule/rest/RuntimeRuleService.java | 195 +++++++++--------- .../runtimerule/status/ApplyPhase.java | 25 ++- .../src/main/proto/runtime-rule-cluster.proto | 7 +- .../RuntimeRuleClusterServiceImplTest.java | 4 +- .../storage/model/StorageManipulationOpt.java | 35 +--- .../model/StorageManipulationOptTest.java | 14 ++ .../banyandb/BanyanDBIndexInstaller.java | 15 +- 11 files changed, 164 insertions(+), 156 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index b33afa82e74e..4b4735bfeee6 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,8 +242,8 @@ admin-host only" entry above for the public REST retirement. #### OAP Server -* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background after the durable commit (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). -* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → rolling-out → fencing → applied, with `degraded` for a committed-but-unconfirmed apply — local commit-tail threw, or the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) in the **background** after the durable commit + peer resume, so a slow cluster never blocks the apply or holds peers suspended: the `POST /addOrUpdate` returns its `applyId` immediately at `rolling-out`, and the operator polls `GET /runtime/rule/status` to watch `fencing → applied` (or `degraded`). A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (persist-is-commit). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background after the durable persist but **before** dispatch resumes — it gates the local commit + peer resume so writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). +* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates the dispatch resume**: because an un-propagated write is silently dropped at the data node, the apply persists the rule row, then waits for cluster-wide schema propagation BEFORE finalizing the local commit and resuming/notifying peers. That wait runs in the background so it never blocks the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (persist-is-commit). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. * Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. diff --git a/docs/en/setup/backend/admin-api/runtime-rule.md b/docs/en/setup/backend/admin-api/runtime-rule.md index a42f54ff33e7..6ccd66722192 100644 --- a/docs/en/setup/backend/admin-api/runtime-rule.md +++ b/docs/en/setup/backend/admin-api/runtime-rule.md @@ -111,7 +111,7 @@ server returns `400 compile_failed`. | GET | `/runtime/rule/bundled?catalog=&withContent=false` | Returns bundled rules for one catalog as JSON. `withContent` defaults to true; `false` omits each YAML body. Each item includes whether an operator override exists. | | GET | `/runtime/rule/list[?catalog=]` | Returns a single JSON envelope `{generatedAt, loaderStats, rules}` merged from stored rules and this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Optional `catalog=` narrows the output; unknown values return `400 invalid_catalog`. | | GET | `/runtime/rule/dump[/]` | Downloads a tar.gz of stored runtime rules plus `manifest.yaml`. The server has no bulk import endpoint; the CLI restore command replays individual `addOrUpdate` and `inactivate` calls. | -| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `ROLLING_OUT`, `FENCING`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `ROLLING_OUT` (durable + peers resumed) with the `applyId`; the schema fence then confirms cluster-wide propagation in the background, advancing to `FENCING` then `APPLIED`, or `DEGRADED` with `fenceLaggards` (the data-node ids that did not confirm within `deferredFenceTimeoutSeconds`). When the live status is gone (or the main is unreachable) it degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | +| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `FENCING`, `ROLLING_OUT`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `FENCING` (rule row durably persisted) with the `applyId`; dispatch for that rule stays paused while the background fence confirms cluster-wide schema propagation — because an un-propagated write is silently dropped — and only then does it advance to `ROLLING_OUT` (finalize commit, resume + notify peers) and `APPLIED`. On a genuine laggard it advances to `DEGRADED` with `fenceLaggards` (the data-node ids that did not confirm within `deferredFenceTimeoutSeconds`) and resumes dispatch anyway. When the live status is gone (or the main is unreachable) it degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | ### `/delete` storage semantics — per backend diff --git a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java index bc2806e5e393..e12a6b7fdcf3 100644 --- a/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java +++ b/oap-server/analyzer/meter-analyzer/src/main/java/org/apache/skywalking/oap/meter/analyzer/v2/compiler/MALClosureCodegen.java @@ -304,6 +304,17 @@ void generateClosureExpr(final StringBuilder sb, // empty string/container, numeric zero, false), not only when null. // Keep the primary single-evaluated so expressions such as tags.remove(...) // do not observe different values between the truth check and result. + // + // KNOWN LIMITATION (eager fallback): elvis(primary, fallback) is a plain method call, + // so Java evaluates BOTH arguments before elvis() runs — the fallback is always + // computed, even when the primary is truthy. This matches Groovy's falsy-SELECTION but + // not its lazy EVALUATION: a fallback that mutates state or is expensive still runs. + // True laziness would need the fallback wrapped in a Supplier (or an inlined ternary + // over a generated temp), which Javassist cannot emit (no lambdas; expression context + // has no statement slot for a temp). Accepted because real MAL fallbacks are pure, + // cheap reads (`tags['x'] ?: tags['y']`, `metricA ?: metricB`) — side-effecting/expensive + // fallbacks do not occur in practice. Revisit via a Supplier-companion codegen pass if + // that ever changes. sb.append(MALCodegenHelper.RUNTIME_HELPER_FQCN).append(".elvis("); generateClosureExpr(sb, elvis.getPrimary(), paramName, beanMode); sb.append(", "); diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java index de155f051d35..0924bd7ad41f 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImpl.java @@ -129,6 +129,14 @@ public RuntimeRuleClusterServiceImpl(final DSLManager dslManager, final String s this.selfNodeId = selfNodeId; } + /** Stop the off-thread reconcile-nudge executor. The framework's {@code ModuleProvider} has no + * stop lifecycle hook, so in production this is not auto-invoked — the executor is a daemon + * thread that never blocks JVM exit. Provided for clean test teardown and for a future module + * shutdown hook; mirrors {@code RuntimeRuleService.shutdown()}. */ + public void shutdown() { + reconcileNudgeExecutor.shutdownNow(); + } + /** * Push-notify from the main after a successful commit: converge NOW rather than on the next * ~30s tick. Runs a full reconcile off the gRPC thread (idempotent, per-file-locked — unchanged diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 7d1e821111a8..e97245b7ddc6 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -60,7 +60,6 @@ import org.apache.skywalking.oap.server.telemetry.api.HistogramMetrics; import org.apache.skywalking.oap.server.core.management.runtimerule.RuntimeRule; import org.apache.skywalking.oap.server.core.storage.model.StorageManipulationOpt; -import org.apache.skywalking.oap.server.core.storage.StorageException; import org.apache.skywalking.oap.server.core.rule.ext.StaticRuleRegistry; import org.apache.skywalking.oap.server.core.storage.StorageModule; import org.apache.skywalking.oap.server.core.storage.management.RuntimeRuleManagementDAO; @@ -1211,15 +1210,15 @@ private HttpResponse applyStructural(final String catalog, final String name, final RuntimeRuleManagementDAO.RuntimeRuleFile ruleFile = new RuntimeRuleManagementDAO.RuntimeRuleFile( catalog, name, content, RuntimeRule.STATUS_ACTIVE, updateTime); final DSLRuntimeState postApply; - // Build the deferred-fence opt ourselves so WE own the post-DDL schema fence: it must run - // in the BACKGROUND after the durable commit + peer resume (a 3-min cluster-propagation - // wait must not block this response or hold peers suspended). fenceRunByCaller tells the - // installer to register the fence closure but NOT run it inline; the fence-phase listener - // lets the background run emit the observable FENCING phase the instant it starts blocking. + // Build the deferred-fence opt ourselves so WE own the post-DDL schema fence. It must run + // AFTER persist but BEFORE dispatch resumes: an un-propagated write is silently dropped at + // the data node (CLAUDE.md tip #16), so the local commit (which swaps the bundle + unparks + // dispatch) and the peer resume/notify must wait for the fence. fenceRunByCaller tells the + // installer to register the fence closure but NOT run it inline — we run it (in the + // background) right before resuming, so the HTTP response is not held for the wait. final StorageManipulationOpt fenceOpt = StorageManipulationOpt.withSchemaChangeDeferredFence( Duration.ofMillis(dslManager.getDeferredFenceTimeoutMs())); fenceOpt.setFenceRunByCaller(true); - fenceOpt.setFencePhaseListener(() -> SchemaApplyCoordinator.INSTANCE.markFencing(applyId)); // The apply call compiles, verifies, and fires the schema-change DDL (the fence is deferred // to us); mark DDL before it so an in-flight query sees progress past PENDING. SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.DDL); @@ -1325,110 +1324,99 @@ private HttpResponse applyStructural(final String catalog, final String name, return persistError; } - // Persist succeeded — the durable commit point is crossed. Past here we drain the local - // commit and push peers to converge, so the apply is rolling out across the cluster. - SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); - - // Drain the pending commit now that the DB reflects the new - // content. commitCoord.finalizeCommit drops removedMetrics, swaps the applied - // pointers, retires the old loader, fires alarm reset, and advances the snapshot. - // - // Commit-tail failure handling: the DB row is durable (persist already succeeded), - // so peers converge from the DB — but on THIS node the local drop+recreate may - // not have fully landed. Return 500 commit_deferred so the operator sees a clear - // "DB row flipped, local commit threw" signal and can retry. Returning 200 would - // tell the operator "done" while the backend schema on this node may still be - // stale — that's the failure mode the review flagged. - Throwable commitFailure = null; - boolean drained = false; + // Persist succeeded — the durable commit point is crossed. But writes must NOT resume until + // the new schema has propagated to every data node: an un-propagated write is silently + // dropped (CLAUDE.md tip #16). So we fence FIRST, then finalize the local commit (swap the + // bundle + unpark local dispatch) and resume/notify peers. That runs on the fence executor + // so the (up to 3-min) wait does not block this response: we return now at FENCING with the + // applyId; the operator polls GET /runtime/rule/status to watch FENCING → ROLLING_OUT → + // APPLIED (or DEGRADED + the laggard node list). Peers + local stay suspended — a clean + // collection pause, not dropped writes — until the fence confirms. + SchemaApplyCoordinator.INSTANCE.markFencing(applyId); + boolean scheduled = true; try { - drained = dslManager.getCommitCoord().finalizeCommit(catalog, name); + fenceExecutor.submit(() -> fenceThenResume(applyId, fenceOpt, catalog, name, content)); } catch (final Throwable t) { - commitFailure = t; - log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist " - + "succeeded — DB is authoritative, peers will converge. Operator action: " - + "inspect log for the underlying cause.", catalog, name, t); + // Executor rejected (shutting down). Run inline (blocking) so the suspend bracket does + // not leak — write-safety + a non-leaked suspend win over a non-blocking response here. + log.warn("runtime-rule could not schedule the background fence for {}/{}; running it " + + "inline so dispatch is not left suspended", catalog, name, t); + scheduled = false; } - if (commitFailure != null) { - // Durable (DB persisted) but this node's commit-tail threw — peers converge from DB, - // this node retries on the next tick. Committed-but-unconfirmed = DEGRADED, not FAILED. - // The DB row IS durable, so push peers to reconcile against it now (async, best-effort) - // rather than leaving them on the ~30s tick while this node is degraded — exactly the - // case where fast peer convergence matters most. - broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); - SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, - "commit-tail deferred: DB persisted, local backend may be stale until next tick"); - return serverError("commit_deferred", catalog, name, - "DB row persisted, but local commit-tail threw — backend shape on this " - + "node may not have fully landed. Peers converge from DB; this node " - + "will retry on the next dslManager tick. Cause: " - + commitFailure.getMessage()); + if (!scheduled) { + fenceThenResume(applyId, fenceOpt, catalog, name, content); } - - // No commit was drained — typical for {@code force=true} re-applies on byte- - // identical content (engine returned NO_CHANGE so nothing was stashed). Peers are - // still PEER-suspended from our earlier broadcast and would only converge via the - // 60 s self-heal window without an explicit Resume. Send the Resume now so peers - // recover within an RPC round-trip. - if (!drained) { - broadcastResume(catalog, name, "force_no_change"); - // No commit drained ⇒ no schema change to confirm — terminal immediately. - SchemaApplyCoordinator.INSTANCE.markApplied(applyId); - return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, - "structural apply succeeded (no change)" + describeDelta(delta)); - } - // Real change committed and durable — push peers to converge NOW (reconcile against - // the just-persisted DB row) instead of waiting up to one ~30s refresh tick. - // Best-effort; peers self-converge on their own tick if the notify is lost. - broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); - // Confirm cluster-wide schema propagation in the BACKGROUND (up to the configured fence - // timeout, default 3 min): runs the deferred fence we owned, driving FENCING → APPLIED, or - // DEGRADED + the laggard node list on timeout. The response returns now at ROLLING_OUT with - // the applyId; the operator polls GET /runtime/rule/status to watch the rest. - scheduleBackgroundFence(applyId, fenceOpt, catalog, name); return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, - "structural apply committed; confirming cluster-wide schema propagation in the " - + "background — poll /runtime/rule/status?applyId=" + applyId - + describeDelta(delta)); + "structural apply persisted; fencing cluster-wide schema propagation before resuming " + + "dispatch — poll /runtime/rule/status?applyId=" + applyId + describeDelta(delta)); } /** - * Run the runtime-rule deferred schema fence on a background thread after the durable commit, - * driving the apply's terminal status. The fence emits {@link ApplyPhase#FENCING} the instant it - * starts blocking (via the opt's listener), waits up to the configured timeout for every data - * node to apply the new schema revision, then marks {@link ApplyPhase#APPLIED} on confirmation or - * {@link ApplyPhase#DEGRADED} with the laggard node ids on timeout/transport error. Background so - * a long (3-min) wait never blocks the apply response or holds peers suspended — the commit is - * already durable and peers already resumed. + * Post-persist tail: wait for the new schema to propagate to every data node BEFORE any write + * resumes (an un-propagated write is silently dropped — CLAUDE.md tip #16), then finalize the + * local commit (drop removed metrics, swap the bundle, unpark local dispatch) and resume/notify + * peers. Runs on the fence executor (or inline if the executor is gone) so a long wait never + * blocks the HTTP response while still gating resume on the fence. + * + *

Drives the terminal status: {@link ApplyPhase#APPLIED} when the fence confirms and the + * commit lands; {@link ApplyPhase#DEGRADED} (with the laggard node ids) when the fence does not + * confirm within the budget — there we resume dispatch ANYWAY rather than park the metric forever + * on one stuck node, accepting that the laggard drops writes until it catches up — or when the + * local commit-tail throws (DB is durable; peers converge from it; this node retries on the next + * tick). */ - private void scheduleBackgroundFence(final String applyId, final StorageManipulationOpt fenceOpt, - final String catalog, final String name) { + private void fenceThenResume(final String applyId, final StorageManipulationOpt fenceOpt, + final String catalog, final String name, final String content) { + // 1. Fence the CREATE schema propagation before any write resumes. + StorageManipulationOpt.FenceOutcome fenceOutcome = null; + Throwable fenceError = null; try { - fenceExecutor.submit(() -> { - try { - fenceOpt.runDeferredFence(); - final StorageManipulationOpt.FenceOutcome outcome = fenceOpt.getFenceOutcome(); - if (outcome == null || outcome.isApplied()) { - SchemaApplyCoordinator.INSTANCE.markApplied(applyId); - } else { - SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, - "schema fence did not confirm cluster-wide propagation within " - + dslManager.getDeferredFenceTimeoutMs() + " ms", - outcome.getLaggardNodeIds()); - } - } catch (final StorageException e) { - SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, - "schema fence transport error: " + e.getMessage(), List.of()); - } catch (final Throwable t) { - SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, - "schema fence failed: " + t.getMessage(), List.of()); - } - }); + fenceOpt.runDeferredFence(); + fenceOutcome = fenceOpt.getFenceOutcome(); } catch (final Throwable t) { - // Executor rejected (shutting down). The commit IS durable, so report APPLIED and skip - // the (best-effort) propagation confirmation rather than leaving the apply at ROLLING_OUT. - log.warn("runtime-rule could not schedule the background schema fence for {}/{}; " - + "reporting applied (commit durable, cluster propagation unconfirmed)", catalog, name, t); + fenceError = t; + log.warn("runtime-rule schema fence for {}/{} errored; resuming dispatch anyway " + + "(commit is durable, peers converge from the DB row)", catalog, name, t); + } + + // 2. Schema is now visible (or we gave up after the budget) — safe to resume writes. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); + Throwable commitFailure = null; + boolean drained = false; + try { + // finalizeCommit drops removedMetrics, swaps the applied pointers, retires the old + // loader, fires alarm reset, advances the snapshot, and unparks local dispatch. + drained = dslManager.getCommitCoord().finalizeCommit(catalog, name); + } catch (final Throwable t) { + commitFailure = t; + log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist + fence " + + "— DB is authoritative, peers converge from it; this node retries on the next tick.", + catalog, name, t); + } + if (drained) { + // Real change committed — peers reconcile against the just-persisted DB row now instead + // of waiting up to one ~30s refresh tick. Best-effort; a lost notify self-heals on tick. + broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); + } else { + // No commit drained (force re-apply on byte-identical content) — peers are still + // PEER-suspended from our earlier broadcast; un-suspend them within an RPC round-trip. + broadcastResume(catalog, name, "structural_resume"); + } + + // 3. Terminal status. + if (commitFailure != null) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "commit-tail deferred: DB persisted, local backend may be stale until the next tick: " + + commitFailure.getMessage()); + } else if (fenceError != null) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence error (committed + durable; cluster converges): " + fenceError.getMessage()); + } else if (fenceOutcome != null && !fenceOutcome.isApplied()) { + SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, + "schema fence did not confirm cluster-wide propagation within " + + dslManager.getDeferredFenceTimeoutMs() + " ms", + fenceOutcome.getLaggardNodeIds()); + } else { SchemaApplyCoordinator.INSTANCE.markApplied(applyId); } } @@ -1622,9 +1610,10 @@ private void broadcastNotifyApplied(final String catalog, final String name, fin } } - /** Stop the best-effort background executors (notify fan-out + schema fence). Called from the - * module provider's shutdown so tests and clean OAP stops don't leak threads; daemon status - * already prevents them blocking JVM exit. */ + /** Stop the best-effort background executors (notify fan-out + schema fence). The framework's + * {@code ModuleProvider} has no stop lifecycle hook, so this is not auto-invoked in production — + * both executors are daemon threads that never block JVM exit. Provided for clean test teardown + * and for a future module shutdown hook; mirrors {@code RuntimeRuleClusterServiceImpl.shutdown()}. */ public void shutdown() { notifyExecutor.shutdownNow(); fenceExecutor.shutdownNow(); diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java index c4e0fca394f5..e0f0d170ed53 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java @@ -25,22 +25,25 @@ * failed outright ({@link #FAILED}). * *

Normal progression: - * {@link #PENDING} → {@link #DDL} → {@link #ROLLING_OUT} → {@link #FENCING} → {@link #APPLIED}. + * {@link #PENDING} → {@link #DDL} → {@link #FENCING} → {@link #ROLLING_OUT} → {@link #APPLIED}. * These are the phases the main observes from its apply orchestration: {@code PENDING} once the * apply is accepted, {@code DDL} while the compile/verify/schema-change call runs (a single opaque * step from the orchestrator's vantage — sub-steps such as validation are not separately - * observable, so they are not modelled), {@code ROLLING_OUT} once the rule row is durably persisted - * and the commit is draining + peers are being resumed/notified, then {@code FENCING} while the - * main waits (in the background, on a generous timeout) for every BanyanDB data node to apply the - * new schema revision. The HTTP response returns at {@code ROLLING_OUT} with the {@code applyId}; - * the operator polls to watch {@code FENCING → APPLIED} (or {@code DEGRADED}). Two off-ramps: + * observable, so they are not modelled), then — after the rule row is durably persisted — the HTTP + * response returns with the {@code applyId} at {@code FENCING} while the main waits (in the + * background, on a generous timeout) for every BanyanDB data node to apply the new schema revision. + * Dispatch stays suspended through {@code FENCING} because an un-propagated write is silently + * dropped. Only once the fence confirms does the apply reach {@code ROLLING_OUT} — finalize the + * local commit, unpark dispatch, resume/notify peers — and then {@code APPLIED}. The operator polls + * to watch {@code FENCING → ROLLING_OUT → APPLIED} (or {@code DEGRADED}). Two off-ramps: *

    *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). The change * was rolled back; nothing was committed.
  • - *
  • {@link #DEGRADED} — committed and durable, but the local commit-tail threw or the post-DDL - * fence did not confirm cluster-wide propagation within the timeout (one or more data nodes - * lagging — exposed as {@code fenceLaggards}). Forward-progress: peers converge from the - * durable row and BanyanDB keeps converging; this is NOT a revert.
  • + *
  • {@link #DEGRADED} — committed and durable, but the fence did not confirm cluster-wide + * propagation within the timeout (one or more data nodes lagging — exposed as + * {@code fenceLaggards}; dispatch is resumed anyway so a stuck node doesn't park the metric + * forever) or the local commit-tail threw. Forward-progress: peers converge from the durable + * row and BanyanDB keeps converging; this is NOT a revert.
  • *
* {@link #UNKNOWN} is returned for an apply-id the main no longer holds (evicted / main restarted); * callers fall back to a content-hash comparison. @@ -48,8 +51,8 @@ public enum ApplyPhase { PENDING, DDL, - ROLLING_OUT, FENCING, + ROLLING_OUT, APPLIED, DEGRADED, FAILED, diff --git a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto index 79531b0492e7..54acb2032ca8 100644 --- a/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto +++ b/oap-server/server-admin/runtime-rule/src/main/proto/runtime-rule-cluster.proto @@ -199,10 +199,11 @@ enum ApplyStatusPhase { reserved 2; reserved "APPLY_PHASE_VALIDATING"; APPLY_PHASE_DDL = 3; - // The main waits (in the background, on a generous timeout) for every data node to apply the - // new schema revision. Emitted AFTER ROLLING_OUT in the async model — the apply is already - // durable + peers resumed; this only confirms cluster-wide propagation. + // After persist, the main waits (in the background, on a generous timeout) for every data node + // to apply the new schema revision BEFORE resuming dispatch — an un-propagated write is dropped. + // The HTTP response returns here with the apply_id; dispatch stays suspended through FENCING. APPLY_PHASE_FENCING = 4; + // Fence confirmed — finalize the local commit, unpark dispatch, resume/notify peers. APPLY_PHASE_ROLLING_OUT = 5; APPLY_PHASE_APPLIED = 6; APPLY_PHASE_DEGRADED = 7; diff --git a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java index 309821f6ae86..3d1746f38d1f 100644 --- a/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java +++ b/oap-server/server-admin/runtime-rule/src/test/java/org/apache/skywalking/oap/server/receiver/runtimerule/cluster/RuntimeRuleClusterServiceImplTest.java @@ -83,10 +83,10 @@ void getApplyStatusMapsEachPhaseToProtoAlongTheHappyPath() { assertEquals(ApplyStatusPhase.APPLY_PHASE_PENDING, query(impl, id).getPhase()); SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.DDL); assertEquals(ApplyStatusPhase.APPLY_PHASE_DDL, query(impl, id).getPhase()); - SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.ROLLING_OUT); - assertEquals(ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT, query(impl, id).getPhase()); SchemaApplyCoordinator.INSTANCE.markFencing(id); assertEquals(ApplyStatusPhase.APPLY_PHASE_FENCING, query(impl, id).getPhase()); + SchemaApplyCoordinator.INSTANCE.transition(id, ApplyPhase.ROLLING_OUT); + assertEquals(ApplyStatusPhase.APPLY_PHASE_ROLLING_OUT, query(impl, id).getPhase()); SchemaApplyCoordinator.INSTANCE.markApplied(id); final ApplyStatusResponse applied = query(impl, id); diff --git a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java index f5afaf52873d..01f526e53a65 100644 --- a/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java +++ b/oap-server/server-core/src/main/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOpt.java @@ -457,16 +457,6 @@ public interface DeferredFence { @Setter private boolean fenceRunByCaller = false; - /** - * Notified by the backend the instant before the deferred fence starts blocking, so the apply - * orchestrator can mark a {@code FENCING} progress phase that is observable while the (long) - * wait is in flight. Null on paths that don't observe progress (tick, peer, non-BanyanDB). - */ - @FunctionalInterface - public interface FencePhaseListener { - void onFenceStart(); - } - /** * Outcome of a deferred fence, recorded by the backend so the orchestrator can mark * {@code APPLIED} vs {@code DEGRADED}-with-laggards after {@link #runDeferredFence()} returns. @@ -485,10 +475,6 @@ public FenceOutcome(final boolean applied, final List laggardNodeIds) { } } - @Getter - @Setter - private volatile FencePhaseListener fencePhaseListener; - /** Recorded by the backend during {@link #runDeferredFence()}; read by the orchestrator after. * Null when no deferred fence ran (no DDL) or the backend records no outcome. */ @Getter @@ -510,12 +496,14 @@ public void setDeferredFence(final DeferredFence fence) { * applies, or non-BanyanDB backends). * *

One-shot. A single reconciler tick reuses ONE opt across every rule - * file ({@code RuleSync#runOnce}), calling this once per file. After the fence runs, the - * closure is cleared and {@link #maxModRevision} is reset so the next file neither re-runs - * this file's stale fence (when that file performed no DDL) nor waits on this file's - * revision — each file fences on its own DDL only. The reset happens in a {@code finally} - * so a fence transport failure still isolates the next file. The closure reads - * {@link #getMaxModRevision()} during {@code await()}, so it is reset only after. + * file ({@code RuleSync#runOnce}), calling this once per file. The closure + accumulated + * {@link #maxModRevision} are always reset (even when this file performed no + * DDL and registered no closure), so the next file neither re-runs this file's stale fence nor + * inherits this file's revision — each file fences on its own DDL only. (Drop revisions that a + * later commit-tail records on a shared opt are inline-fenced at drop time and benign here: the + * next file's own create revision is monotonically higher, so it dominates the fence.) The reset + * is in a {@code finally} so a fence transport failure still isolates the next file; the closure + * reads {@link #getMaxModRevision()} during {@code await()}, so it is reset only after. * *

{@link #fenceOutcome} is cleared before the fence runs (so a shared tick opt * starts each file clean) and the backend sets it during the run; it is intentionally @@ -524,13 +512,12 @@ public void setDeferredFence(final DeferredFence fence) { */ public void runDeferredFence() throws StorageException { final DeferredFence fence = this.deferredFence; - if (fence == null) { - return; - } this.deferredFence = null; this.fenceOutcome = null; try { - fence.await(); + if (fence != null) { + fence.await(); + } } finally { maxModRevision.set(0L); } diff --git a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java index 669dc1c095ab..a6fd54be1474 100644 --- a/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java +++ b/oap-server/server-core/src/test/java/org/apache/skywalking/oap/server/core/storage/model/StorageManipulationOptTest.java @@ -114,6 +114,20 @@ void runDeferredFenceResetsRevisionAfterAwait() throws StorageException { "the revision must reset after the fence so a later file is not over-fenced"); } + @Test + void runDeferredFenceResetsRevisionEvenWithNoClosureRegistered() { + // A no-DDL file on a shared tick opt registers no closure, but a prior file (or its + // commit-tail drops) may have left a revision on the opt. runDeferredFence must still clear + // it so the next file is not over-fenced on a stale revision. + final StorageManipulationOpt opt = StorageManipulationOpt.withSchemaChangeDeferredFence(); + opt.recordModRevision(77L); + + assertDoesNotThrow(opt::runDeferredFence); + + assertEquals(StorageManipulationOpt.DEFAULT_MOD_REVISION, opt.getMaxModRevision(), + "a no-closure flush must still reset the accumulated revision (shared-tick isolation)"); + } + @Test void runDeferredFenceResetsRevisionEvenWhenFenceThrows() { // A barrier transport failure on one file must not leave a stale revision that the next diff --git a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java index f613fe185809..4318806cab14 100644 --- a/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java +++ b/oap-server/server-storage-plugin/storage-banyandb-plugin/src/main/java/org/apache/skywalking/oap/server/storage/plugin/banyandb/BanyanDBIndexInstaller.java @@ -313,12 +313,11 @@ private void doFenceOnRevision(final BanyanDBClient client, final StorageManipul /** * The deferred (batched) fence the runtime-rule apply runs once after all DDL. Unlike the inline - * {@link #doFenceOnRevisionValue}, this (1) emits the apply's {@code FENCING} progress phase the - * instant before it blocks (via the opt's {@link StorageManipulationOpt.FencePhaseListener}), - * (2) honors the opt's configured timeout ({@link StorageManipulationOpt#getFenceTimeoutMs()}, - * the runtime-rule 3-min budget) instead of the short inline {@link #FENCE_TIMEOUT}, and - * (3) records the outcome (applied + laggard node ids) on the opt so the orchestrator can mark - * {@code APPLIED} vs {@code DEGRADED}. A laggard timeout is still a non-fatal WARN. + * {@link #doFenceOnRevisionValue}, this (1) honors the opt's configured timeout + * ({@link StorageManipulationOpt#getFenceTimeoutMs()}, the runtime-rule 3-min budget) instead of + * the short inline {@link #FENCE_TIMEOUT}, and (2) records the outcome (applied + laggard node + * ids) on the opt so the orchestrator can mark {@code APPLIED} vs {@code DEGRADED} and gate the + * dispatch resume on this fence. A laggard timeout is still a non-fatal WARN. */ private void doDeferredFence(final BanyanDBClient client, final StorageManipulationOpt opt, final String context) throws BanyanDBException { @@ -326,10 +325,6 @@ private void doDeferredFence(final BanyanDBClient client, final StorageManipulat if (rev <= 0L) { return; } - final StorageManipulationOpt.FencePhaseListener listener = opt.getFencePhaseListener(); - if (listener != null) { - listener.onFenceStart(); - } final Duration timeout = opt.getFenceTimeoutMs() > 0L ? Duration.ofMillis(opt.getFenceTimeoutMs()) : FENCE_TIMEOUT; From 5b1a5a030bac7b5d3ccecd6d281049ae7fd7af09 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 22:05:03 +0800 Subject: [PATCH 15/19] Runtime-rule apply: fence before persist so durable implies fenced (crash-safe) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Third review round. Root cause of the crash-recovery holes: the apply persisted the rule row BEFORE the background fence, so "durable" did not imply "schema propagated". A main crash after persist but before the fence left a durable row that peers then converged to via the periodic scan — which uses withoutSchemaChange (no fence) — so they resumed dispatch against a schema no node had confirmed propagated, and un-propagated writes are silently dropped (CLAUDE.md tip #16). Fix: reorder to suspend -> DDL -> fence -> persist -> commit -> resume. persistRuleSync moves into the background tail (fenceThenPersistThenResume), AFTER the fence. Now any durable rule row is guaranteed fence-confirmed, so peer / crash-recovery convergence is always safe; a crash before persist leaves no row (the orphaned measure from the DDL is inert) and the cluster stays on the prior, already-fenced content. The HTTP call returns applyId at FENCING = accepted, not yet durable; the operator polls for the rest. Also fixes the commit-tail bug (#3): finalizeCommit throwing left drained=false and fell into broadcastResume (peers ran the OLD bundle) — but the row is durable, so peers must converge to it. Now `commitFailure != null || drained` -> broadcastNotifyApplied; only a genuine no-change (force re-apply) does Resume. Persist failure -> FAILED status (rolled back); fence laggard / commit-tail -> DEGRADED; all polled (the POST already returned at FENCING). Docs synced to fence-then-persist (#5): the concept doc runtime-rule-hot-update.md (structural-path + schema-fence + failure-handling sections), the admin-API doc, the changelog, application.yml, and the ApplyPhase javadoc. Verified: runtime-rule 149/149, whole-project checkstyle + compile + javadoc + license clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/en/changes/changes.md | 4 +- .../runtime-rule-hot-update.md | 159 +++++++++++------- .../setup/backend/admin-api/runtime-rule.md | 2 +- .../runtimerule/rest/RuntimeRuleService.java | 150 ++++++++--------- .../runtimerule/status/ApplyPhase.java | 22 ++- .../src/main/resources/application.yml | 8 +- 6 files changed, 187 insertions(+), 158 deletions(-) diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 4b4735bfeee6..0de690baf9c7 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -242,8 +242,8 @@ admin-host only" entry above for the public REST retirement. #### OAP Server -* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background after the durable persist but **before** dispatch resumes — it gates the local commit + peer resume so writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). -* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates the dispatch resume**: because an un-propagated write is silently dropped at the data node, the apply persists the rule row, then waits for cluster-wide schema propagation BEFORE finalizing the local commit and resuming/notifying peers. That wait runs in the background so it never blocks the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (persist-is-commit). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. +* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). +* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. * Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. * Fix BanyanDB peer nodes permanently flooding ` is not registered`, and a follow-on case where a peer kept translating writes with a stale schema shape after a runtime-rule reshape, when a node held a live persist worker but its local `MetadataRegistry` schema cache was missing or stale for that model — a `withoutSchemaChange` peer apply or a runtime-rule bundled fall-over rebuilt the dispatch worker but skipped the local-cache populate, and the registry was insert-only (never evicting) while the 30s reconcile only covers runtime-rule rows, so nothing re-derived it. The peer / local-cache-only install path now (re)derives and overwrites the local schema entry from the declared model with zero server RPC — honoring the `inspectBackend=false` contract so the cache can never lag the worker, including across a reshape — and a model removal now evicts its cache entry so a dropped or reshaped model leaves no stale translation behind; the persist DAOs keep an RPC-free re-derivation as a read-side backstop, and the no-init defer poll loop retries a transient backend probe error instead of escaping and crash-looping the pod. * Fix a v2 MAL `CounterWindow` key collision: `rate()` / `increase()` / `irate()` keyed each counter's sliding window on the rule's output metric name (the same for every input metric of a rule) instead of the counter's own name, so two or more counters that reduce to the same label set after `.sum(...)` shared one window and computed rates against each other's values — fabricating non-zero rates from unchanged counters (e.g. the BanyanDB liaison gRPC error rate read a steady non-zero off three frozen error counters). The window is now keyed by the counter's own metric name. diff --git a/docs/en/concepts-and-designs/runtime-rule-hot-update.md b/docs/en/concepts-and-designs/runtime-rule-hot-update.md index 4abf49f7dd1f..48ad38202433 100644 --- a/docs/en/concepts-and-designs/runtime-rule-hot-update.md +++ b/docs/en/concepts-and-designs/runtime-rule-hot-update.md @@ -220,29 +220,51 @@ Two paths, picked from the diff between the new content and the current entry: the same fast path. No cluster pause, no backend schema change, no alarm reset. - **Structural path** — anything that moves metric identity (metric set added or removed, scope or downsampling function changed, LAL `(layer, ruleName)` set - changed). The main runs: - 1. **Pause the cluster** — broadcast a pause to every peer over the cluster bus. + changed). The main runs, in order: + 1. **Pause the cluster** — self-suspend first (so a concurrent peer apply is + detected, not merged), then broadcast a pause to every peer over the cluster bus. Peers stop dispatching samples for the affected metrics and drain in-flight batches. Unreachable peers are logged and skipped; they self-recover via the periodic scan. - 2. **Update backend storage on this node**, including the schema-visibility fence on - BanyanDB (see below). - 3. **Persist the entry** — this is the cluster-wide commit point. - 4. **Resume the cluster** — broadcast a resume so peers re-open dispatch. Peers - that missed the resume self-heal within 60 s. - 5. **Reset alarm windows** for any metric whose identity changed, so accumulated - state doesn't carry across the change. - -If any step before persist fails, the entry is **not** advanced, the local node -rolls back to the previous rule state, peers self-heal back to the old content within -60 s, and the operator gets `HTTP 500` with `applyStatus` indicating the failure. - -If persist itself fails, the same rollback happens — the durable state never moved, -so neither does the cluster. - -If persist succeeds but the local finishing step fails (a rare path), the operator -gets `HTTP 500 commit_deferred`: storage holds the new content (peers will converge -on it), but this node hasn't fully applied it yet and will retry on its next scan. + 2. **Fire the backend DDL on this node** — create / update the BanyanDB measures + for the metrics this rule produces. This returns the schema revision but does + **not** yet make the rule the cluster's truth. + + At this point the HTTP call returns `applyId` with phase `FENCING` — **accepted, not + yet durable** — and the remaining steps run in the background so a slow data node + never blocks the operator. The operator polls `GET /runtime/rule/status` (see the + admin-API doc) to watch the rest: + 3. **Schema-visibility fence** — wait (up to a configurable budget, default 3 min) + for every BanyanDB data node to apply the new measure schema (see below). + 4. **Persist the entry** — this is the cluster-wide commit point, and it happens + **after** the fence, so a durable entry always implies "schema propagated + cluster-wide". + 5. **Finalize + resume** — finalize the local commit (swap the bundle, reset alarm + windows for any metric whose identity changed, unpark local dispatch) and + resume / notify peers so they converge to the new entry. Peers that missed the + notify self-heal within 60 s. + +Because persist (step 4) is gated behind the fence (step 3), the ordering is +**pause → DDL → fence → persist → commit → resume**, and crash recovery is safe at +every point: a crash before persist leaves no entry — peers and the recovered main +stay on the old content (the orphaned measure from the DDL is inert) — and any entry +that *is* durable was fence-confirmed before it was written, so a peer converging to +it via the periodic scan never resumes dispatch against an unpropagated schema. + +Outcomes (all observed by polling `/runtime/rule/status`, not by blocking the HTTP +call, which already returned at `FENCING`): + +- **Pre-DDL error** (compile / verify) — phase `FAILED`; the entry was never advanced + and the cluster keeps serving the prior rule. The HTTP call returns the error + synchronously (it happens before the `FENCING` return). +- **Persist fails** — phase `FAILED`; the local node rolls back to the prior rule and + resumes peers. The durable state never moved, so neither does the cluster. +- **Fence does not confirm within the budget** — phase `DEGRADED` with the lagging + node ids; dispatch resumes anyway (a stuck node must not park the metric forever), + and the schema converges through BanyanDB's own watcher. +- **Local finalize fails after persist** — phase `DEGRADED`: storage holds the new + content and peers converge to it, but this node will retry the local finish on its + next scan. ### Lifecycle @@ -367,36 +389,43 @@ gate — without it the system still converges. ## Schema-visibility fence (BanyanDB) -BanyanDB's distributed mode propagates registry writes from the meta-server to -every data node asynchronously. A naive flow — register the schema, immediately -resume dispatch — has a race: the registry holds the new measure but a data node -may not yet have caught up, so the first sample after the apply lands on an -unprepared node. - -For runtime hot-updates this would mean the operator's `200 OK` could come back -before the cluster's data boundary actually moved. The runtime-rule install path -narrows the gap on a best-effort basis: every BanyanDB schema write returns an -etcd `mod_revision`, and the installer waits — synchronously, before resuming -dispatch, up to a bounded timeout (default 2s) — for every BanyanDB data node -to catch up to the highest revision the apply produced. - -The visible contract for operators is: - -- Between operator request and `200 OK`, all sample dispatch for the affected - metric is paused on every node. In-flight samples are dropped (this is by - design: a structural change means the schema is moving and in-flight data has - no valid landing). -- When all data nodes confirm within the bounded window, the `200 OK` marks the - moment the cluster's data boundary moves: samples written at or after the `200` - use the new shape; samples written before use the old shape. -- When one or more nodes haven't applied within the window, OAP logs a warning - naming the laggards and resumes dispatch anyway. The schema is already - authoritative in etcd, so late nodes apply it asynchronously through their - watcher — until they do, samples landing on those specific nodes for that - metric may be rejected by the local data node briefly. This trades strict - cluster-wide cutover for not wedging an apply behind a single slow node; - operators who need strict behavior should fix the slow node, not loosen the - timeout. +BanyanDB's distributed mode propagates schema writes from the meta-server to every +data node asynchronously. A naive flow — register the schema, immediately resume +dispatch — has a race: the registry holds the new measure but a data node may not +yet have caught up, so a sample written before that node applies the schema is +**silently dropped** at the data node (not retried). The fence is therefore a +write-safety barrier, not just an observability check: it must gate dispatch resume. + +Every BanyanDB schema write returns a `mod_revision`. After firing the DDL the main +waits — on a configurable budget (`deferredFenceTimeoutSeconds`, default 3 min) — for +every data node to catch up to the highest revision the apply produced, and only then +persists the rule entry, finalizes the local commit, and resumes dispatch. The wait +runs in the **background**: the HTTP call has already returned `applyId` at phase +`FENCING`, so a slow data node never blocks the operator, yet nothing durable or +visible happens until the schema is confirmed. + +The contract for operators is: + +- The HTTP call returns at `FENCING` (accepted, not yet durable). Dispatch for the + affected metric stays paused on every node from the pause broadcast through the + fence; this is a clean collection gap, not dropped writes (no node is writing the + new shape yet). In-flight samples drained at pause are dropped by design — a + structural change moves the schema and in-flight data has no valid landing. +- When all data nodes confirm within the budget, the entry is persisted and dispatch + resumes — the cluster's data boundary moves at that moment. The status advances + `FENCING → ROLLING_OUT → APPLIED`. +- When one or more nodes haven't applied within the budget, OAP logs a warning naming + the laggards, persists + resumes **anyway**, and reports `DEGRADED` with the laggard + ids. The schema is already authoritative, so late nodes apply it through their own + watcher; until they do, samples landing on those specific nodes may be dropped + briefly. This trades strict cluster-wide cutover for not wedging an apply behind a + single slow node; operators who need strict behavior fix the slow node. + +Because persist is gated behind the fence, the fence's guarantee survives a main +crash: a durable entry is always one whose schema was confirmed propagated, so a peer +converging to it (via the periodic scan, which performs no backend RPC of its own) +never resumes against an unpropagated schema. A crash before persist simply leaves no +entry, and the cluster stays on the prior, already-fenced content. Elasticsearch and JDBC don't have multi-node schema fan-out; their storage change is visible when the call returns, so the fence is a no-op for those backends. @@ -413,23 +442,29 @@ recovery path is the same path operators already use. storage_change_requires_explicit_approval`. No pause broadcast, no persist, no side effects. Re-push with `?allowStorageChange=true` if the change is intentional. -- **Backend storage verification failed mid-apply** — `HTTP 500 ddl_verify_failed`. - Newly added metrics are rolled back so the backend doesn't accumulate orphans; the prior +- **Backend storage verification failed mid-apply** — `HTTP 400 ddl_verify_failed` + (this happens during the synchronous DDL step, before the `FENCING` return). Newly + added metrics are rolled back so the backend doesn't accumulate orphans; the prior rule keeps serving every metric that wasn't being added or reshaped. - `lastApplyError` on `/runtime/rule/list` carries the failure message. -- **Persist failed** — `HTTP 500 persist_failed`. Local state is rolled back to - the pre-apply rule; peers self-heal within 60 s. The cluster never advanced - past the failure. -- **Persist succeeded but the local finishing step failed** — `HTTP 500 commit_deferred`. - Storage is authoritative (peers will converge), but this node will retry on - its next periodic scan. - **Cluster routing fail-safe** — `HTTP 421 cluster_view_split` when a forwarded request reaches a node that also doesn't believe it's the main. Wait for the peer-list to settle (seconds) and retry. -`GET /runtime/rule/list` is the canonical operator view of cluster state: persisted -status, per-node `localState`, and `lastApplyError` for any rule whose most recent -apply failed. There is no separate alert channel — `/list` plus the OAP log are +The errors above are returned synchronously, before the call returns `applyId` at +`FENCING`. The outcomes of the background tail (fence → persist → commit → resume) are +observed by polling `GET /runtime/rule/status?applyId=…`, not by an HTTP code: + +- **Fence didn't confirm within the budget** — phase `DEGRADED` with the lagging node + ids; dispatch resumed anyway, schema converges via the data nodes' watcher. +- **Persist failed** — phase `FAILED`; local state rolled back to the pre-apply rule, + peers self-heal within 60 s. The cluster never advanced (no durable entry). +- **Local finishing step failed after persist** — phase `DEGRADED`; storage is + authoritative (peers converge), this node retries on its next periodic scan. + +`GET /runtime/rule/list` is the canonical operator view of cluster state (persisted +status, per-node `localState`, `lastApplyError`); `GET /runtime/rule/status` reports a +specific apply's live phase / laggards (and degrades to the durable entry when the +apply-id is gone). There is no separate alert channel — those two plus the OAP log are the entire diagnostic surface. ## Dynamic layers diff --git a/docs/en/setup/backend/admin-api/runtime-rule.md b/docs/en/setup/backend/admin-api/runtime-rule.md index 6ccd66722192..b0a8850ded8e 100644 --- a/docs/en/setup/backend/admin-api/runtime-rule.md +++ b/docs/en/setup/backend/admin-api/runtime-rule.md @@ -111,7 +111,7 @@ server returns `400 compile_failed`. | GET | `/runtime/rule/bundled?catalog=&withContent=false` | Returns bundled rules for one catalog as JSON. `withContent` defaults to true; `false` omits each YAML body. Each item includes whether an operator override exists. | | GET | `/runtime/rule/list[?catalog=]` | Returns a single JSON envelope `{generatedAt, loaderStats, rules}` merged from stored rules and this node's local state. Each row carries `loaderKind`, `loaderName`, `bundled`, and `bundledContentHash` so a UI can render override badges without a second roundtrip. Optional `catalog=` narrows the output; unknown values return `400 invalid_catalog`. | | GET | `/runtime/rule/dump[/]` | Downloads a tar.gz of stored runtime rules plus `manifest.yaml`. The server has no bulk import endpoint; the CLI restore command replays individual `addOrUpdate` and `inactivate` calls. | -| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `FENCING`, `ROLLING_OUT`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `FENCING` (rule row durably persisted) with the `applyId`; dispatch for that rule stays paused while the background fence confirms cluster-wide schema propagation — because an un-propagated write is silently dropped — and only then does it advance to `ROLLING_OUT` (finalize commit, resume + notify peers) and `APPLIED`. On a genuine laggard it advances to `DEGRADED` with `fenceLaggards` (the data-node ids that did not confirm within `deferredFenceTimeoutSeconds`) and resumes dispatch anyway. When the live status is gone (or the main is unreachable) it degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | +| GET | `/runtime/rule/status?catalog=&name=[&applyId=][&contentHash=]` | Reports the progress of a structural `/addOrUpdate` apply, served by the cluster main. Query by the `applyId` returned in the `structural_applied` response, or by `catalog`+`name` (+ optional `contentHash`) once the handle is gone (page refresh / main restart). Always `200` with a JSON status `{found, phase, applyId, contentHash, failureReason?, fenceLaggards?, startedAtMs, updatedAtMs, servedBy}`; `phase` is one of `PENDING`, `DDL`, `FENCING`, `ROLLING_OUT`, `APPLIED`, `DEGRADED`, `FAILED`, `UNKNOWN`. The `/addOrUpdate` response returns at `FENCING` with the `applyId` — **accepted, not yet durable**: the DDL has fired but the rule row is persisted only after the fence confirms. Dispatch for that rule stays paused while the background fence waits for cluster-wide schema propagation (an un-propagated write is silently dropped); once it confirms, the row is persisted (the durable commit point) and the apply advances to `ROLLING_OUT` (finalize commit, resume + notify peers) and `APPLIED`. A persist failure → `FAILED` (rolled back, nothing durable); a laggard → `DEGRADED` with `fenceLaggards` (data-node ids that did not confirm within `deferredFenceTimeoutSeconds`), resuming dispatch anyway. Because the row is persisted only after the fence, any durable row is propagation-confirmed — so when the live status is gone (or the main is unreachable) the query safely degrades to the durable rule row: a matching `ACTIVE` row reports `phase=APPLIED` with `derivedFrom=durable-dao`. | ### `/delete` storage semantics — per backend diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index e97245b7ddc6..03d4454941e4 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -1283,91 +1283,55 @@ private HttpResponse applyStructural(final String catalog, final String name, return serverError("apply_failed", catalog, name, err); } - // Apply succeeded + verified. Commit the row — the design's commit point. Retry a - // couple of times on transient failures before giving up; the per-backend - // RuntimeRuleManagementDAO.save can throw on a brief storage outage. A narrow retry - // here avoids turning a blip into a cluster-divergence event. - HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); - if (persistError != null) { - try { - Thread.sleep(100L); - } catch (final InterruptedException ie) { - Thread.currentThread().interrupt(); - } - persistError = persistRuleSync(catalog, name, content, updateTime); - } - if (persistError != null) { - // Persist still failing. The local node has registered added + shape-break - // metrics in MeterSystem (DDL fired, isExists verified) while the DB and peers - // remain on the old content. Discard drains the pending commit by removing only - // the added + shape-break metrics — it does NOT drop removedMetrics (the commit - // was stashed before that step, so those are still alive) and does NOT swap - // appliedMal/appliedContent (still on the pre-apply bundle). Net outcome: - // local node converges back to the pre-apply bundle exactly, no divergence from - // what the DB still says is current. - try { - dslManager.getCommitCoord().discardCommit(catalog, name); - } catch (final Throwable rt) { - log.error("runtime-rule CRITICAL: persist-failure discard itself failed for " - + "{}/{}; state is inconsistent and requires operator intervention", - catalog, name, rt); - } - // Peers are still SUSPENDED on our earlier broadcast. The DB didn't advance, - // so self-heal would eventually flip them back, but broadcasting Resume now - // cuts the dispatch gap from 60 s to a single RPC round-trip. - broadcastResume(catalog, name, "persist_failed"); - log.error("runtime-rule CRITICAL: STRUCTURAL persist FAILED after successful apply " - + "for {}/{} — discarded pending commit; local node re-aligned with old " - + "content. Operator action: re-push via /addOrUpdate once storage is healthy.", - catalog, name); - SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "persist failed after apply"); - return persistError; - } - - // Persist succeeded — the durable commit point is crossed. But writes must NOT resume until - // the new schema has propagated to every data node: an un-propagated write is silently - // dropped (CLAUDE.md tip #16). So we fence FIRST, then finalize the local commit (swap the - // bundle + unpark local dispatch) and resume/notify peers. That runs on the fence executor - // so the (up to 3-min) wait does not block this response: we return now at FENCING with the - // applyId; the operator polls GET /runtime/rule/status to watch FENCING → ROLLING_OUT → - // APPLIED (or DEGRADED + the laggard node list). Peers + local stay suspended — a clean - // collection pause, not dropped writes — until the fence confirms. + // Apply succeeded + verified — DDL fired, the schema fence is deferred to us, and the + // pending commit is stashed (deferCommit). The write-safe tail runs in the BACKGROUND so the + // (up to 3-min) fence wait doesn't block this response: fence → persist → commit → resume. + // Persist (the durable commit point) happens AFTER the fence, so "durable" implies "schema + // propagated cluster-wide": a main crash before persist leaves NO row — peers/crash-recovery + // safely stay on the old content (the orphaned measure from DDL is inert) — and any durable + // row is guaranteed fence-confirmed, so convergence never resumes dispatch against an + // unpropagated schema. The response returns now at FENCING with the applyId; the operator + // polls GET /runtime/rule/status for FENCING → ROLLING_OUT → APPLIED (or DEGRADED + laggards, + // or FAILED on persist failure). SchemaApplyCoordinator.INSTANCE.markFencing(applyId); boolean scheduled = true; try { - fenceExecutor.submit(() -> fenceThenResume(applyId, fenceOpt, catalog, name, content)); + fenceExecutor.submit( + () -> fenceThenPersistThenResume(applyId, fenceOpt, catalog, name, content, updateTime)); } catch (final Throwable t) { // Executor rejected (shutting down). Run inline (blocking) so the suspend bracket does - // not leak — write-safety + a non-leaked suspend win over a non-blocking response here. + // not leak and the apply still completes — write-safety wins over a non-blocking response. log.warn("runtime-rule could not schedule the background fence for {}/{}; running it " + "inline so dispatch is not left suspended", catalog, name, t); scheduled = false; } if (!scheduled) { - fenceThenResume(applyId, fenceOpt, catalog, name, content); + fenceThenPersistThenResume(applyId, fenceOpt, catalog, name, content, updateTime); } return okWithApplyId(HttpStatus.OK, "structural_applied", catalog, name, applyId, - "structural apply persisted; fencing cluster-wide schema propagation before resuming " - + "dispatch — poll /runtime/rule/status?applyId=" + applyId + describeDelta(delta)); + "structural apply accepted; fencing schema propagation, then persisting + resuming in " + + "the background — poll /runtime/rule/status?applyId=" + applyId + describeDelta(delta)); } /** - * Post-persist tail: wait for the new schema to propagate to every data node BEFORE any write - * resumes (an un-propagated write is silently dropped — CLAUDE.md tip #16), then finalize the - * local commit (drop removed metrics, swap the bundle, unpark local dispatch) and resume/notify - * peers. Runs on the fence executor (or inline if the executor is gone) so a long wait never - * blocks the HTTP response while still gating resume on the fence. + * Write-safe post-apply tail, run on the fence executor (or inline if it's gone): + * fence → persist → commit → resume. The fence waits for the new measure schema + * to reach every data node BEFORE anything durable or visible happens (an un-propagated write is + * silently dropped — CLAUDE.md tip #16); only then is the rule row persisted (the durable commit + * point), so a crash before persist leaves no row and peers stay safely on the old content, and + * any durable row is guaranteed fence-confirmed. After persist it finalizes the local commit + * (drop removed metrics, swap the bundle, unpark dispatch) and resumes/notifies peers. * - *

Drives the terminal status: {@link ApplyPhase#APPLIED} when the fence confirms and the - * commit lands; {@link ApplyPhase#DEGRADED} (with the laggard node ids) when the fence does not - * confirm within the budget — there we resume dispatch ANYWAY rather than park the metric forever - * on one stuck node, accepting that the laggard drops writes until it catches up — or when the - * local commit-tail throws (DB is durable; peers converge from it; this node retries on the next - * tick). + *

Terminal status: {@link ApplyPhase#FAILED} if persist fails (rolled back — nothing + * committed); {@link ApplyPhase#DEGRADED} if the fence didn't confirm within the budget (laggards + * listed; dispatch resumed anyway so a stuck node can't park the metric forever) or the local + * commit-tail threw (DB durable, peers converge from it); {@link ApplyPhase#APPLIED} otherwise. */ - private void fenceThenResume(final String applyId, final StorageManipulationOpt fenceOpt, - final String catalog, final String name, final String content) { - // 1. Fence the CREATE schema propagation before any write resumes. + private void fenceThenPersistThenResume(final String applyId, final StorageManipulationOpt fenceOpt, + final String catalog, final String name, + final String content, final long updateTime) { + // 1. Fence the new measure schema's propagation BEFORE persisting or resuming. A fence + // error/timeout is non-fatal (best-effort) — we proceed but mark DEGRADED below. StorageManipulationOpt.FenceOutcome fenceOutcome = null; Throwable fenceError = null; try { @@ -1375,35 +1339,59 @@ private void fenceThenResume(final String applyId, final StorageManipulationOpt fenceOutcome = fenceOpt.getFenceOutcome(); } catch (final Throwable t) { fenceError = t; - log.warn("runtime-rule schema fence for {}/{} errored; resuming dispatch anyway " - + "(commit is durable, peers converge from the DB row)", catalog, name, t); + log.warn("runtime-rule schema fence for {}/{} errored; proceeding anyway " + + "(BanyanDB still propagates the schema asynchronously)", catalog, name, t); + } + + // 2. Persist the rule row — THE durable commit point, AFTER the fence so durable ⟹ fenced. + // Narrow retry over a transient storage blip. On failure nothing is committed: roll back + // the local pending commit and resume peers to the old content (the DB never advanced). + HttpResponse persistError = persistRuleSync(catalog, name, content, updateTime); + if (persistError != null) { + try { + Thread.sleep(100L); + } catch (final InterruptedException ie) { + Thread.currentThread().interrupt(); + } + persistError = persistRuleSync(catalog, name, content, updateTime); + } + if (persistError != null) { + try { + dslManager.getCommitCoord().discardCommit(catalog, name); + } catch (final Throwable rt) { + log.error("runtime-rule CRITICAL: persist-failure discard itself failed for {}/{}; " + + "state is inconsistent and requires operator intervention", catalog, name, rt); + } + broadcastResume(catalog, name, "persist_failed"); + log.error("runtime-rule CRITICAL: STRUCTURAL persist FAILED after apply + fence for {}/{} " + + "— discarded the pending commit; local node re-aligned with old content. Operator " + + "action: re-push via /addOrUpdate once storage is healthy.", catalog, name); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "persist failed after apply + fence"); + return; } - // 2. Schema is now visible (or we gave up after the budget) — safe to resume writes. + // 3. Durable — finalize the local commit (swap bundle, unpark dispatch) + resume/notify peers. SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); Throwable commitFailure = null; boolean drained = false; try { - // finalizeCommit drops removedMetrics, swaps the applied pointers, retires the old - // loader, fires alarm reset, advances the snapshot, and unparks local dispatch. drained = dslManager.getCommitCoord().finalizeCommit(catalog, name); } catch (final Throwable t) { commitFailure = t; - log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist + fence " - + "— DB is authoritative, peers converge from it; this node retries on the next tick.", + log.error("runtime-rule CRITICAL: finalize commit FAILED for {}/{} after persist — DB is " + + "authoritative, peers converge from it; this node retries on the next tick.", catalog, name, t); } - if (drained) { - // Real change committed — peers reconcile against the just-persisted DB row now instead - // of waiting up to one ~30s refresh tick. Best-effort; a lost notify self-heals on tick. + if (commitFailure != null || drained) { + // The durable row advanced (commit drained, OR the local commit-tail threw but the row IS + // persisted) — peers must reconcile against it NOW rather than wait one ~30s tick. broadcastNotifyApplied(catalog, name, ContentHash.sha256Hex(content)); } else { - // No commit drained (force re-apply on byte-identical content) — peers are still - // PEER-suspended from our earlier broadcast; un-suspend them within an RPC round-trip. + // Nothing changed (force re-apply on byte-identical content) — just un-suspend peers. broadcastResume(catalog, name, "structural_resume"); } - // 3. Terminal status. + // 4. Terminal status. if (commitFailure != null) { SchemaApplyCoordinator.INSTANCE.markDegraded(applyId, "commit-tail deferred: DB persisted, local backend may be stale until the next tick: " diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java index e0f0d170ed53..2c0c39c6c151 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/status/ApplyPhase.java @@ -29,16 +29,20 @@ * These are the phases the main observes from its apply orchestration: {@code PENDING} once the * apply is accepted, {@code DDL} while the compile/verify/schema-change call runs (a single opaque * step from the orchestrator's vantage — sub-steps such as validation are not separately - * observable, so they are not modelled), then — after the rule row is durably persisted — the HTTP - * response returns with the {@code applyId} at {@code FENCING} while the main waits (in the - * background, on a generous timeout) for every BanyanDB data node to apply the new schema revision. - * Dispatch stays suspended through {@code FENCING} because an un-propagated write is silently - * dropped. Only once the fence confirms does the apply reach {@code ROLLING_OUT} — finalize the - * local commit, unpark dispatch, resume/notify peers — and then {@code APPLIED}. The operator polls - * to watch {@code FENCING → ROLLING_OUT → APPLIED} (or {@code DEGRADED}). Two off-ramps: + * observable, so they are not modelled). Once DDL fires the HTTP response returns with the + * {@code applyId} at {@code FENCING} — the apply is accepted but NOT yet durable — while the main + * waits (in the background, on a generous timeout) for every BanyanDB data node to apply the new + * schema revision. The rule row is persisted (the durable commit point) only AFTER the fence + * confirms, so "durable" implies "schema propagated" and crash recovery never resumes against an + * unpropagated schema. Dispatch stays suspended through {@code FENCING} because an un-propagated + * write is silently dropped. Once the fence confirms and the row is persisted, the apply reaches + * {@code ROLLING_OUT} — finalize the local commit, unpark dispatch, resume/notify peers — and then + * {@code APPLIED}. The operator polls to watch {@code FENCING → ROLLING_OUT → APPLIED} (or + * {@code DEGRADED}/{@code FAILED}). Two off-ramps: *

    - *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). The change - * was rolled back; nothing was committed.
  • + *
  • {@link #FAILED} — a pre-commit error (compile / verify / DDL RPC / persist). Nothing was + * committed (a crash before persist likewise leaves no durable row); the change was rolled + * back and peers stay on the prior content.
  • *
  • {@link #DEGRADED} — committed and durable, but the fence did not confirm cluster-wide * propagation within the timeout (one or more data nodes lagging — exposed as * {@code fenceLaggards}; dispatch is resumed anyway so a stuck node doesn't park the metric diff --git a/oap-server/server-starter/src/main/resources/application.yml b/oap-server/server-starter/src/main/resources/application.yml index cc85d99bf34b..e02938cc8e60 100644 --- a/oap-server/server-starter/src/main/resources/application.yml +++ b/oap-server/server-starter/src/main/resources/application.yml @@ -719,9 +719,11 @@ receiver-runtime-rule: # SUSPENDED state self-heal threshold (seconds). selfHealThresholdSeconds: ${SW_RECEIVER_RUNTIME_RULE_SELF_HEAL_THRESHOLD_SECONDS:60} # Timeout (seconds) for the runtime-rule deferred BanyanDB schema fence on the operator REST - # apply: the background wait for every data node to apply a new measure's schema revision. - # The apply returns immediately at rolling-out; poll GET /runtime/rule/status for fencing -> - # applied/degraded. Inline/static/delete fences keep their short 2s constant. + # apply: the background wait for every data node to apply a new measure's schema revision. The + # apply returns immediately at FENCING (accepted, not yet durable); the rule row is persisted + # only after the fence confirms, then dispatch resumes. Poll GET /runtime/rule/status for + # fencing -> rolling-out -> applied (or degraded/failed). Inline/static/delete fences keep the + # short 2s constant. deferredFenceTimeoutSeconds: ${SW_RECEIVER_RUNTIME_RULE_DEFERRED_FENCE_TIMEOUT_SECONDS:180} # DSL Debug API — sampling debugger for MAL / LAL / OAL rules. Read-only OAL From 6982709cf3e446a48b7f317fec67b3a76f43c961 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Mon, 15 Jun 2026 23:00:19 +0800 Subject: [PATCH 16/19] docs(skill): add post-merge branch-cleanup steps to gh-pull-request skill Documents the after-merge sync + feature-branch deletion flow (prune, ff-only master, verify the squash landed, then -D), with the note that `git branch -d` reporting "not fully merged" is expected for SkyWalking's squash-merges. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude/skills/gh-pull-request/SKILL.md | 32 +++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/.claude/skills/gh-pull-request/SKILL.md b/.claude/skills/gh-pull-request/SKILL.md index 657cd50c42e3..f26e76ad202d 100644 --- a/.claude/skills/gh-pull-request/SKILL.md +++ b/.claude/skills/gh-pull-request/SKILL.md @@ -150,3 +150,35 @@ EOF - Add `copilot` as a reviewer: `gh pr edit --add-reviewer copilot` - Do NOT add AI assistant as co-author. Code responsibility is on the committer's hands. - Return the PR URL when done. + +## After the PR is merged + +Once the PR is merged, sync the default branch and clean up the feature branch: + +```bash +# 1. Prune stale remote refs. GitHub auto-deletes the PR's branch on merge, so +# the remote feature branch is usually already gone; --prune removes the +# dangling local tracking ref. +git fetch origin --prune + +# 2. Switch back to the default branch and fast-forward it to include the merge. +git checkout master +git pull --ff-only origin master + +# 3. Confirm the change actually landed in master before deleting anything — +# `git log --oneline -1` should show the merge/squash commit with the PR +# number, or grep for a symbol the PR introduced. +git log --oneline -1 + +# 4. Delete the local feature branch. SkyWalking SQUASH-merges PRs, so the +# feature branch's commit is NOT an ancestor of master (master gets a new +# squash commit instead). `git branch -d` therefore reports "not fully +# merged" — that is expected, not an error. After confirming the content is +# in master (step 3), force-delete: +git branch -d 2>/dev/null || git branch -D +``` + +Notes: +- `git branch -d` failing with "not fully merged" on a squash-merged PR is normal — the squash commit has a different SHA than the feature commit. Verify via step 3, then `-D`. +- If the remote branch was not auto-deleted (some repo settings), remove it explicitly: `git push origin --delete `. +- Do NOT skip step 3. Force-deleting a local branch whose work didn't actually merge loses it. From a4c26969ba0cfbd79e4e6db64da2f56cd9aa30e2 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Tue, 16 Jun 2026 09:05:47 +0800 Subject: [PATCH 17/19] Track runtime-rule revert-to-bundled as an async apply (applyId + /status) /delete?mode=revertToBundled now begins a SchemaApplyCoordinator apply, returns 200 reverted_to_bundled + applyId immediately, and runs the revert pipeline + row delete on the fence executor, mapping the orchestrator outcome to a terminal phase. Mirrors the structural /addOrUpdate async model so a UI can poll /runtime/rule/status for the same progress. Precondition rejections (inactivate-first / no_bundled_twin / requires_revert_to_bundled) stay synchronous; revert-pipeline failures surface as /status FAILED. Revert never broadcasts Suspend, so the background tail needs no Resume. --- .../runtimerule/rest/RuntimeRuleService.java | 133 ++++++++++++------ 1 file changed, 88 insertions(+), 45 deletions(-) diff --git a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java index 03d4454941e4..fb5ad4fb9af9 100644 --- a/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java +++ b/oap-server/server-admin/runtime-rule/src/main/java/org/apache/skywalking/oap/server/receiver/runtimerule/rest/RuntimeRuleService.java @@ -1409,6 +1409,65 @@ private void fenceThenPersistThenResume(final String applyId, final StorageManip } } + /** + * Background tail for {@code /delete?mode=revertToBundled}, run on the fence executor (or + * inline if it's gone). Reinstalls the bundled rule via the apply pipeline — which runs the + * deferred schema fence — then removes the runtime row, and maps the orchestrator outcome to + * the apply's terminal phase so an operator polling {@code GET /runtime/rule/status?applyId} + * sees {@code APPLIED}, or {@code FAILED} with the reason. Unlike a structural /addOrUpdate + * there is no Suspend bracket to release here: revert never broadcast a pause (peers converge + * on the persisted row via the periodic scan), so a failure needs no Resume. + */ + private void revertToBundledTracked(final String applyId, final String catalog, + final String name, final String priorContent, + final RuntimeRuleManagementDAO dao) { + final DSLRuntimeDelete.Result revert; + try { + revert = dslManager.getDslRuntimeDelete().revertToBundled(catalog, name, priorContent); + } catch (final Throwable t) { + log.error("runtime-rule /delete: revertToBundled threw for {}/{}", catalog, name, t); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "revert threw: " + t.getMessage()); + return; + } + switch (revert.status) { + case REFUSED_CONFLICT: + log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, "revert refused: " + revert.error); + return; + case PRECONDITION_FAILED: + log.error("runtime-rule /delete: revertToBundled precondition failed for {}/{}: {}", + catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "revert precondition failed: " + revert.error); + return; + case BUNDLED_APPLY_FAILED: + log.error("runtime-rule /delete: bundled apply failed for {}/{}: {}", + catalog, name, revert.error); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "bundled apply failed (typically a storage-backend DDL/verify issue); the " + + "orchestrator unwound the runtime install so local state matches the " + + "persisted INACTIVE row — retry once storage recovers. Cause: " + revert.error); + return; + case REVERTED: + default: + break; + } + // Bundled is durable cluster-wide (the apply's fence confirmed) — finalize by removing + // the runtime row. A delete failure here leaves bundled applied but the INACTIVE row + // lingering; the reconcile retries it, so report FAILED with that context. + SchemaApplyCoordinator.INSTANCE.transition(applyId, ApplyPhase.ROLLING_OUT); + try { + dao.delete(catalog, name); + } catch (final IOException e) { + log.error("failed to delete runtime rule {}/{} after revert", catalog, name, e); + SchemaApplyCoordinator.INSTANCE.markFailed(applyId, + "bundled was reinstalled but removing the runtime row failed; it will be retried " + + "on the next reconcile: " + e.getMessage()); + return; + } + SchemaApplyCoordinator.INSTANCE.markApplied(applyId); + } + /** * Write the row through {@link RuntimeRuleManagementDAO#save} so a DAO failure is surfaced * to the caller instead of silently swallowed. The earlier ManagementStreamProcessor path @@ -1950,55 +2009,39 @@ private HttpResponse doDeleteLocked(final String catalog, final String name, } if (mode == DeleteMode.REVERT_TO_BUNDLED) { - // Bundled-revert path is the schema-change path: bundled may have a different - // shape than runtime. The orchestrator runs the unified pipeline: - // (1) installRuntime to put prior runtime claims back locally, - // (2) apply(bundled, STRUCTURAL, BUNDLED, withSchemaChange) — engine.commit - // drops runtime-only metrics through the standard delta path, - // (3) reset rules-map state to boot-seeded so gone-keys reconcile leaves - // it alone after dao.delete. - // dao.delete only runs after revertToBundled returns REVERTED — a precondition - // or compile failure aborts the row deletion so the operator can retry. - final DSLRuntimeDelete.Result revert; + // Bundled-revert is the schema-change path: bundled may have a different shape + // than runtime, so — like a structural /addOrUpdate — it is tracked and runs + // ASYNCHRONOUSLY. The unified pipeline (installRuntime → apply(bundled, STRUCTURAL, + // withSchemaChange, deferred fence) → reset state → dao.delete) plus its schema + // fence (up to deferredFenceTimeoutMs) run on the background executor, and the + // response returns immediately with an applyId the operator polls via + // GET /runtime/rule/status. The precondition rejections above (inactivate-first, + // no_bundled_twin, requires_revert_to_bundled) are still reported synchronously; + // the revert pipeline's own outcomes (REFUSED_CONFLICT / PRECONDITION_FAILED / + // BUNDLED_APPLY_FAILED / row-delete failure) surface as the apply's terminal phase + // (FAILED, with the reason) on /status rather than an HTTP error. + final String bundled = + StaticRuleRegistry.active().find(catalog, name).orElse(prior.getContent()); + final String applyId = SchemaApplyCoordinator.INSTANCE.begin( + catalog, name, ContentHash.sha256Hex(bundled)); + SchemaApplyCoordinator.INSTANCE.markFencing(applyId); + final String priorContent = prior.getContent(); + boolean scheduled = true; try { - revert = dslManager.getDslRuntimeDelete() - .revertToBundled(catalog, name, prior.getContent()); + fenceExecutor.submit(() -> revertToBundledTracked(applyId, catalog, name, priorContent, dao)); } catch (final Throwable t) { - log.error("runtime-rule /delete: revertToBundled threw for {}/{}", catalog, name, t); - return serverError("revert_to_bundled_failed", catalog, name, t.getMessage()); + // Executor rejected (shutting down) — run inline so the apply still completes. + log.warn("runtime-rule could not schedule the background revert for {}/{}; " + + "running it inline", catalog, name, t); + scheduled = false; } - switch (revert.status) { - case REFUSED_CONFLICT: - log.warn("runtime-rule /delete refused for {}/{}: {}", catalog, name, revert.error); - return HttpResponse.of(HttpStatus.CONFLICT, MediaType.JSON_UTF_8, - jsonBody("delete_refused", catalog, name, revert.error)); - case PRECONDITION_FAILED: - log.error("runtime-rule /delete: revertToBundled precondition failed for {}/{}: {}", - catalog, name, revert.error); - return serverError("revert_to_bundled_precondition_failed", catalog, name, revert.error); - case BUNDLED_APPLY_FAILED: - log.error("runtime-rule /delete: bundled apply failed for {}/{}: {}", - catalog, name, revert.error); - return serverError("revert_to_bundled_failed", catalog, name, - "bundled apply failed (typically a storage-backend DDL/verify " - + "issue — BanyanDB unreachable, shape rejection, or schema-" - + "barrier timeout). The orchestrator unwound the step-1 " - + "runtime install so local state matches the persisted " - + "INACTIVE row. Retry once storage recovers. Cause: " - + revert.error); - case REVERTED: - default: - break; - } - try { - dao.delete(catalog, name); - } catch (final IOException e) { - log.error("failed to delete runtime rule {}/{}", catalog, name, e); - return serverError("delete_failed", catalog, name, e.getMessage()); + if (!scheduled) { + revertToBundledTracked(applyId, catalog, name, priorContent, dao); } - return ok(HttpStatus.OK, "reverted_to_bundled", catalog, name, - "runtime row removed; bundled rule installed via apply pipeline (schema " - + "change handled by the standard delta path); peers converge on next tick"); + return okWithApplyId(HttpStatus.OK, "reverted_to_bundled", catalog, name, applyId, + "revert-to-bundled accepted; reinstalling the bundled rule across the cluster " + + "and removing the runtime row in the background — poll " + + "/runtime/rule/status?applyId=" + applyId); } // No-bundled-twin DEFAULT path. /inactivate already tore down local handlers under From 254d763ab7506bf9eb11408221b7b348695f1e11 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Tue, 16 Jun 2026 09:07:35 +0800 Subject: [PATCH 18/19] Adopt the async runtime-rule apply mode in the e2e flow scripts A structural addOrUpdate now returns immediately at FENCING and persists the rule row in the BACKGROUND (fence -> persist -> commit -> resume), so the old read-back right after the 2xx raced the background persist and saw an empty /list (the Storage Elasticsearch job failed at Phase 1 CREATE). This is backend-independent: the persist itself is deferred, so it bit ES even though ES has synchronous DDL. - mal-storage/runtime-rule-flow.sh: add await_apply_terminal, which polls the new GET /runtime/rule/status?applyId=&catalog=&name= surface to a terminal phase (APPLIED/DEGRADED = durable; FAILED = fail) after every structural post_rule, and give list_row an optional contentHash-advance gate so the Phase 2/3 hash assertions wait for the new content rather than reading the stale pre-apply hash (status stays ACTIVE across an update). swctl has no runtime-rule `status` subcommand, so the poll goes through curl. - lal/lal-flow.sh: make the single-shot list reads poll - await_status for the NEW (async) log-mal and LAL v1 applies, await_hash_changed for the v1->v2 swap. - cluster/cluster-flow.sh: Phase 2's local hash read had the same status-unchanged-ACTIVE stale-hash race; add await_hash_change (which also surfaces lastApplyError on timeout). --- .../runtime-rule/cluster/cluster-flow.sh | 22 ++- .../e2e-v2/cases/runtime-rule/lal/lal-flow.sh | 54 ++++++- .../mal-storage/runtime-rule-flow.sh | 150 ++++++++++++++---- 3 files changed, 188 insertions(+), 38 deletions(-) diff --git a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh index 0740a9947c6b..50499d20c5fc 100755 --- a/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/cluster/cluster-flow.sh @@ -141,6 +141,23 @@ await_hash() { done } +# Poll a node until its row is ACTIVE with a contentHash different from prev, then echo the new hash. +# Needed after a STRUCTURAL apply on an already-ACTIVE row: the status stays ACTIVE across the async +# apply (which returns at FENCING and persists the new content in the background after the schema +# fence), so await_status "ACTIVE" returns on the first iteration with the OLD hash. Gating on the +# contentHash advancing is the only signal the new content is durable on this node. +await_hash_change() { + local base="$1" prev="$2" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) + while true; do + local got; got="$(list_hash "${base}")" + [ -n "${got}" ] && [ "${got}" != "${prev}" ] && { echo "${got}"; return 0; } + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${base} contentHash did not advance past '${prev:0:8}…' within ${CONVERGE_TIMEOUT_S}s (last='${got:0:8}…', applyError='$(list_apply_error "${base}")')" + fi + sleep 2 + done +} + await_absent() { local base="$1" deadline=$(( $(date +%s) + CONVERGE_TIMEOUT_S )) while true; do @@ -190,9 +207,10 @@ log "OAP-2 converged to ${hash_initial:0:8}…" # --- Phase 2: STRUCTURAL update on OAP-1 — second measure created on no-init -------- log "=== Phase 2: STRUCTURAL on OAP-1 ===" apply_on "${OAP1_BASE}" "${SEED_STRUCT}" "allowStorageChange=true" >/dev/null -await_status "${OAP1_BASE}" "ACTIVE" +# Structural apply is async: wait for OAP-1's own contentHash to ADVANCE (the row is ACTIVE before +# and after, so await_status "ACTIVE" alone would read the stale pre-apply hash) before capturing it. +hash_struct="$(await_hash_change "${OAP1_BASE}" "${hash_initial}")" assert_no_apply_error "${OAP1_BASE}" -hash_struct="$(list_hash "${OAP1_BASE}")" [ "${hash_struct}" != "${hash_initial}" ] || fail "OAP-1 contentHash unchanged after STRUCTURAL apply" log "OAP-1 → ACTIVE @ ${hash_struct:0:8}… (was ${hash_initial:0:8}…)" await_hash "${OAP2_BASE}" "${hash_struct}" diff --git a/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh index aa57fd58bda1..38e964aa343b 100755 --- a/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/lal/lal-flow.sh @@ -76,6 +76,49 @@ list_field() { list_row "${catalog}" "${name}" | jq -r '."'"${field}"'" // empty' } +# Budget for an async apply to land in /list. A NEW / STRUCTURAL addOrUpdate is async: it returns +# immediately at FENCING and the rule row is persisted only AFTER a background schema fence, so a +# single read right after the 2xx can miss the row (or read a stale contentHash). The waiters below +# poll within this budget. FILTER_ONLY edits persist synchronously, so they return on the first poll. +APPLY_LAND_S="${APPLY_LAND_S:-200}" + +# Poll until the (catalog,name) row reaches expected_status, up to APPLY_LAND_S. +await_status() { + local catalog="$1" name="$2" expected="$3" + log " await /list ${catalog}/${name} status=${expected} (budget ${APPLY_LAND_S}s)" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) got="" + while true; do + got="$(list_field "${catalog}" "${name}" status)" + [ "${got}" = "${expected}" ] && return 0 + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${catalog}/${name} did not reach status='${expected}' within ${APPLY_LAND_S}s (last='${got}')" + fi + sleep 2 + done +} + +# Poll until the (catalog,name) row is ACTIVE with a contentHash different from prev, then echo the +# new hash. The wait-condition for a swap whose status stays ACTIVE before and after the apply, so +# the contentHash advancing is the only signal the new content actually landed (subsumes the +# async-persist window and a possible STRUCTURAL classification of the swap). +await_hash_changed() { + local catalog="$1" name="$2" prev="$3" + log " await /list ${catalog}/${name} contentHash≠${prev:0:8}… (budget ${APPLY_LAND_S}s)" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) status="" hash="" + while true; do + status="$(list_field "${catalog}" "${name}" status)" + hash="$(list_field "${catalog}" "${name}" contentHash)" + if [ "${status}" = "ACTIVE" ] && [ -n "${hash}" ] && [ "${hash}" != "${prev}" ]; then + echo "${hash}" + return 0 + fi + if [ "$(date +%s)" -ge "${deadline}" ]; then + fail "${catalog}/${name} contentHash did not advance past '${prev:0:8}…' within ${APPLY_LAND_S}s (last status='${status}' hash='${hash:0:8}…')" + fi + sleep 2 + done +} + apply_rule() { local catalog="$1" name="$2" body="$3" admin runtime-rule add --catalog "${catalog}" --name "${name}" -f "${body}" >/dev/null \ @@ -155,15 +198,13 @@ log "OAP ready" # --- Phase 0: apply log-mal aggregation ----------------------------------------------- log "=== Phase 0: apply log-mal aggregation rule ===" apply_rule "${MAL_CATALOG}" "${MAL_NAME}" "${SEED_MAL}" -mal_status="$(list_field "${MAL_CATALOG}" "${MAL_NAME}" status)" -[ "${mal_status}" = "ACTIVE" ] || fail "MAL rule expected ACTIVE, got '${mal_status}'" +await_status "${MAL_CATALOG}" "${MAL_NAME}" "ACTIVE" log "log-mal → ACTIVE" # --- Phase 1: apply LAL v1 ------------------------------------------------------------ log "=== Phase 1: apply LAL v1 (extractor stamps step=v1) ===" apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V1}" -status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" -[ "${status}" = "ACTIVE" ] || fail "v1 expected ACTIVE, got '${status}'" +await_status "${LAL_CATALOG}" "${LAL_NAME}" "ACTIVE" hash_v1="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" [ -n "${hash_v1}" ] || fail "v1 contentHash empty" log "v1 → ACTIVE @ ${hash_v1:0:8}…" @@ -172,10 +213,7 @@ await_metric_for_step "v1" # --- Phase 2: swap to LAL v2 (same key, step flips to v2) ----------------------------- log "=== Phase 2: swap to LAL v2 (extractor stamps step=v2) ===" apply_rule "${LAL_CATALOG}" "${LAL_NAME}" "${SEED_V2}" -status="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" status)" -[ "${status}" = "ACTIVE" ] || fail "v2 expected ACTIVE, got '${status}'" -hash_v2="$(list_field "${LAL_CATALOG}" "${LAL_NAME}" contentHash)" -[ "${hash_v2}" != "${hash_v1}" ] || fail "v2 contentHash unchanged from v1 (${hash_v2:0:8}…)" +hash_v2="$(await_hash_changed "${LAL_CATALOG}" "${LAL_NAME}" "${hash_v1}")" log "v2 → ACTIVE @ ${hash_v2:0:8}… (was ${hash_v1:0:8}…) — swap applied" await_metric_for_step "v2" diff --git a/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh index d9832ec10a33..860ddd2f807f 100755 --- a/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh +++ b/test/e2e-v2/cases/runtime-rule/mal-storage/runtime-rule-flow.sh @@ -196,40 +196,123 @@ assert_apply_status() { || fail "expected applyStatus=${expected}, not found in: ${actual}" } -# GET /runtime/rule/list and ensure the row matches the expected status. Returns -# the matching JSON line on stdout for callers that want to inspect contentHash. +# Budget for the async apply state machine to reach a terminal phase on GET /runtime/rule/status. +APPLY_TERMINAL_S="${APPLY_TERMINAL_S:-200}" + +# Drive the new async apply surface to a terminal phase. A STRUCTURAL addOrUpdate (and a +# /delete?mode=revertToBundled) returns immediately at FENCING with an applyId; the row is persisted +# and dispatch resumed in the BACKGROUND, after the schema fence. Given the apply's JSON response, +# extract its applyId and poll GET /runtime/rule/status until the phase is terminal: +# APPLIED / DEGRADED → the durable row was written (DEGRADED == committed-and-durable, only the +# cluster-wide fence confirmation lagged) → return 0 +# FAILED → a pre-commit error, nothing was committed → fail +# anything else (FENCING / DDL / ROLLING_OUT / PENDING / UNKNOWN) → keep polling +# A synchronous apply (filter_only / inactivate / default delete) carries no applyId, so this is a +# no-op — that response is already durable on return. swctl has no runtime-rule `status` subcommand, +# so this goes through curl (the status endpoint lives on the same REST port). Passing catalog+name +# lets the main answer from the durable rule row once the live apply-id is TTL-evicted, so a slow +# poll converges instead of false-timing-out. +await_apply_terminal() { + local resp="$1" + local rule_name="${2:-${NAME}}" + local apply_id + apply_id="$(echo "${resp}" | jq -r '.applyId // empty' 2>/dev/null || true)" + if [[ -z "${apply_id}" ]]; then + return 0 + fi + log "runtime-rule status → polling (≤${APPLY_TERMINAL_S}s) for apply ${apply_id} of ${CATALOG}/${rule_name} to reach a terminal phase" + local deadline=$(( $(date +%s) + APPLY_TERMINAL_S )) + local body phase="" + while :; do + body="$(curl -s "${REST_BASE}/runtime/rule/status?applyId=${apply_id}&catalog=${CATALOG}&name=${rule_name}" 2>/dev/null || true)" + phase="$(echo "${body}" | jq -r '.phase // empty' 2>/dev/null || true)" + case "${phase}" in + APPLIED|DEGRADED) + log " ✓ apply ${apply_id} → ${phase} (durable)" + return 0 + ;; + FAILED) + fail "apply ${apply_id} of ${CATALOG}/${rule_name} reached FAILED: ${body}" + ;; + esac + if (( $(date +%s) >= deadline )); then + fail "apply ${apply_id} of ${CATALOG}/${rule_name} did not reach a terminal phase within ${APPLY_TERMINAL_S}s (last phase='${phase}', body: ${body})" + fi + sleep 2 + done +} + +# Budget for an async structural apply to land in /list. A structural addOrUpdate returns +# immediately at FENCING (accepted, not yet durable): the rule row is persisted only AFTER the +# background schema fence confirms, and BanyanDB's meta→data-node schema sync can take 1-2 minutes, +# so a single /list read right after the 2xx can miss the row. The /list assertions poll within this +# budget (covers the fence timeout + the sync). ES/JDBC have no such fence — they land in under a +# second, so the poll returns on its first iteration there. +APPLY_LAND_S="${APPLY_LAND_S:-200}" + +# Poll GET /runtime/rule/list until the row for (catalog, rule_name) shows the expected status, +# up to APPLY_LAND_S. Returns the matching JSON line on stdout for callers that inspect contentHash. +# +# Optional 3rd arg differ_hash: when set, the poll additionally requires the row's contentHash to +# differ from it. This is the wait-condition for a STRUCTURAL update of an ALREADY-ACTIVE row — the +# status is ACTIVE both before and after the async apply, so a status-only poll would return on the +# first iteration with the OLD (pre-apply) contentHash, before the background fence→persist tail +# has written the new content. Gating on "status==expected AND contentHash advanced" blocks until +# the new content is durable and visible. list_row() { local expected_status="$1" local rule_name="${2:-${NAME}}" - log "runtime-rule list → looking for ${CATALOG}/${rule_name} status=${expected_status}" - local lines - lines="$(admin runtime-rule list)" \ - || fail "runtime-rule list failed" - local match - match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" - [[ -n "${match}" ]] \ - || fail "/list has no row for ${CATALOG}/${rule_name} (got: ${lines})" - local actual_status - actual_status="$(echo "${match}" | jq -r '.status')" - [[ "${actual_status}" == "${expected_status}" ]] \ - || fail "expected /list status=${expected_status}, got '${actual_status}' (row: ${match})" - echo "${match}" + local differ_hash="${3:-}" + log "runtime-rule list → waiting (≤${APPLY_LAND_S}s) for ${CATALOG}/${rule_name} status=${expected_status}${differ_hash:+ contentHash≠${differ_hash:0:8}…}" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) + local lines match actual_status="" actual_hash="" + while :; do + lines="$(admin runtime-rule list)" \ + || fail "runtime-rule list failed" + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + if [[ -n "${match}" ]]; then + actual_status="$(echo "${match}" | jq -r '.status')" + actual_hash="$(echo "${match}" | jq -r '.contentHash')" + if [[ "${actual_status}" == "${expected_status}" \ + && ( -z "${differ_hash}" || "${actual_hash}" != "${differ_hash}" ) ]]; then + echo "${match}" + return 0 + fi + fi + if (( $(date +%s) >= deadline )); then + if [[ -n "${match}" ]]; then + fail "expected /list status=${expected_status}${differ_hash:+ with advanced contentHash}, got status='${actual_status}' hash='${actual_hash}' within ${APPLY_LAND_S}s (row: ${match})" + fi + fail "/list has no row for ${CATALOG}/${rule_name} within ${APPLY_LAND_S}s (got: ${lines})" + fi + sleep 2 + done } -# Assert that /list does NOT have a row for the given (catalog, name). +# Poll until /list has NO row (or status n/a) for the given (catalog, name), up to APPLY_LAND_S. +# A /delete?mode=revertToBundled runs the async apply pipeline (the bundled re-apply), so the row's +# removal can lag the same way a structural apply's appearance does. list_no_row() { local rule_name="${1:-${NAME}}" - log "runtime-rule list → expect NO row for ${CATALOG}/${rule_name}" - local lines match - lines="$(admin runtime-rule list)" \ - || fail "runtime-rule list failed" - match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" - if [[ -n "${match}" ]]; then - local status + log "runtime-rule list → waiting (≤${APPLY_LAND_S}s) for NO row for ${CATALOG}/${rule_name}" + local deadline=$(( $(date +%s) + APPLY_LAND_S )) + local lines match status + while :; do + lines="$(admin runtime-rule list)" \ + || fail "runtime-rule list failed" + match="$(echo "${lines}" | jq -c ".rules[] | select(.catalog==\"${CATALOG}\" and .name==\"${rule_name}\")" 2>/dev/null || true)" + if [[ -z "${match}" ]]; then + return 0 + fi status="$(echo "${match}" | jq -r '.status')" - [[ "${status}" == "n/a" ]] \ - || fail "/list still has row for ${CATALOG}/${rule_name} status=${status} (row: ${match})" - fi + if [[ "${status}" == "n/a" ]]; then + return 0 + fi + if (( $(date +%s) >= deadline )); then + fail "/list still has row for ${CATALOG}/${rule_name} status=${status} within ${APPLY_LAND_S}s (row: ${match})" + fi + sleep 2 + done } # Per-phase entity scope. SHAPE-BREAK reshapes the metric from SERVICE to @@ -445,6 +528,7 @@ log "=== Phase 1: CREATE seed-rule.yaml ===" step_set "create" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule.yaml")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" list_row "ACTIVE" >/dev/null hash_initial="$(list_row ACTIVE | jq -r '.contentHash')" log " initial contentHash=${hash_initial}" @@ -455,7 +539,10 @@ log "=== Phase 2: UPDATE-FILTER seed-rule-filter-only.yaml ===" step_set "update_filter" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-filter-only.yaml")" assert_apply_status "filter_only_applied" "${resp}" -hash_filter_only="$(list_row ACTIVE | jq -r '.contentHash')" +await_apply_terminal "${resp}" +# FILTER_ONLY persists synchronously (no applyId), so the new hash is already durable here; the +# differ-gate still hardens the read against any list lag and proves the row actually advanced. +hash_filter_only="$(list_row ACTIVE "${NAME}" "${hash_initial}" | jq -r '.contentHash')" [[ "${hash_filter_only}" != "${hash_initial}" ]] \ || fail "FILTER_ONLY apply did not advance /list contentHash" log " contentHash advanced to ${hash_filter_only}" @@ -466,7 +553,11 @@ log "=== Phase 3: UPDATE-STRUCTURAL seed-rule-structural.yaml ===" step_set "structural" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-structural.yaml" "allowStorageChange=true")" assert_apply_status "structural_applied" "${resp}" -hash_structural="$(list_row ACTIVE | jq -r '.contentHash')" +await_apply_terminal "${resp}" +# STRUCTURAL update of an already-ACTIVE row: status stays ACTIVE across the async apply, so gate on +# the contentHash advancing past the filter-only hash, not just on status — otherwise the read races +# the background fence→persist tail and returns the stale pre-apply hash. +hash_structural="$(list_row ACTIVE "${NAME}" "${hash_filter_only}" | jq -r '.contentHash')" [[ "${hash_structural}" != "${hash_filter_only}" ]] \ || fail "STRUCTURAL apply did not advance /list contentHash" log " contentHash advanced to ${hash_structural}" @@ -569,6 +660,7 @@ log "=== Phase 5h: HAPPY-PATH + RESTART dynamic-LAYER round-trip ===" struct_baseline="$(latest_bucket_id_for_step "e2e_rr_requests" "structural")" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-sibling-with-layer.yaml" "" "${SIBLING_NAME}")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" "${SIBLING_NAME}" list_row "ACTIVE" "${SIBLING_NAME}" >/dev/null sleep 2 layers_after_create="$(swctl --display yaml \ @@ -644,6 +736,7 @@ step_set "shape_break_new" log " POST INSTANCE-scope rule v4" resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" assert_apply_status "structural_applied" "${resp}" +await_apply_terminal "${resp}" hash_shape_break="$(list_row ACTIVE | jq -r '.contentHash')" log " contentHash after shape break = ${hash_shape_break}" # Rule v4 is INSTANCE-scope; swctl now needs --instance-name to resolve @@ -670,6 +763,7 @@ resp="$(post_rule "${SEED_RULES_DIR}/seed-rule-instance.yaml")" status="$(echo "${resp}" | jq -r '.applyStatus // empty')" [[ "${status}" == "structural_applied" || "${status}" == "no_change" ]] \ || fail "ACTIVATE: unexpected applyStatus=${status} (full: ${resp})" +await_apply_terminal "${resp}" list_row "ACTIVE" >/dev/null await_metric_for_step "e2e_rr_requests" "activate" # NOTE: we do NOT re-assert "no step=inactivate rows" here. Phase 7's in-window From 98389bdb331efd2140b7742bd3a76638d129ab30 Mon Sep 17 00:00:00 2001 From: Wu Sheng Date: Tue, 16 Jun 2026 10:12:31 +0800 Subject: [PATCH 19/19] Add deferredFenceTimeoutSeconds to the e2e config-dump golden file RuntimeRuleModuleConfig gained deferredFenceTimeoutSeconds (default 180); the storage config-dump e2e compares the full config map, so the golden file needs the new key (sorted before refreshRulesPeriod). Value matches the code default. --- test/e2e-v2/cases/storage/expected/config-dump.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/test/e2e-v2/cases/storage/expected/config-dump.yml b/test/e2e-v2/cases/storage/expected/config-dump.yml index e2a37ee996cb..960e6efc809b 100644 --- a/test/e2e-v2/cases/storage/expected/config-dump.yml +++ b/test/e2e-v2/cases/storage/expected/config-dump.yml @@ -183,6 +183,7 @@ "receiver-pprof.provider": "default", "receiver-profile.provider": "default", "receiver-register.provider": "default", + "receiver-runtime-rule.default.deferredFenceTimeoutSeconds": "180", "receiver-runtime-rule.default.refreshRulesPeriod": "30", "receiver-runtime-rule.default.selfHealThresholdSeconds": "60", "receiver-runtime-rule.provider": "default",