Another Antithesis Draft by def- · Pull Request #36667 · MaterializeInc/materialize

def- · 2026-05-21T07:55:53Z

No description provided.

Adds an 'antithesis' build flavor to mzbuild and ci-builder, on the same axis as 'coverage' and 'sanitizer'. When enabled, the cargo build for binaries copied into mzbuild images uses the antithesis-sdk rustflags and ships full debug symbols for symbolization. Pulls antithesis-sdk into the workspace and into the storage, catalog, and persist-client member crates so subsequent commits can wrap panic and invariant sites in SDK assertions.

…is-sdk assertions Adds SDK assertion hooks at known panic / invariant sites that Antithesis fault injection can exercise: - src/storage/src/source/kafka.rs (kafka source startup and offset-known invariants) - src/storage/src/source/reclock.rs (reclock mint progress) - src/storage/src/source/reclock/compat.rs (frontier shape on the remap shard) - src/storage/src/source/mysql/replication/partitions.rs (GTID monotonicity in the mysql source) - src/storage/src/upsert/types.rs (upsert state-machine invariants on ancient/tombstoned keys) - src/storage/src/upsert_continual_feedback*.rs (assert on tombstone removal and key rehydration paths) - src/persist-client/src/internal/apply.rs (CaS monotonicity) - src/catalog/src/durable/persist.rs (catalog epoch fencing) Assertions are no-ops in non-antithesis builds.

Adds a 'pool-backed' execution mode to parallel_workload where the Database wraps a pre-existing cluster (typically bootstrapped by an external compose like Antithesis) instead of allocating its own. - Database/Cluster gain 'existing_cluster_name' and 'is_pool_backed' so framework-owned actions (CreateCluster, ResizeCluster, ScaleCluster) skip pool clusters they don't own. - 'name_scope' lets multiple parallel_workload invocations coexist against the same Materialize without colliding on object names. - mzcompose Clusterd treats scratch_directory=None as a real signal: omit --scratch-directory entirely so clusterd falls back to RocksDB's mem_env, matching production replica shape. - Drop a handful of feature flags from the random-LD-flag pool that parallel_workload no longer exercises cleanly.

Today, every per-action try/except site under Scenario.Kill / Scenario.ZeroDowntimeDeploy is an unconditional swallow: try: source.create(exe) … except: if exe.db.scenario not in (Scenario.Kill, Scenario.ZeroDowntimeDeploy): raise Under those scenarios, *every* exception is silently dropped — bare `except:` catches AssertionError, KeyError, TypeError, the whole mess. The intent was 'tolerate connection drops from the kill thread'; the implementation also tolerates real correctness bugs. This commit adds is_fault_shaped(exc) — a predicate that returns True only for messages matching connection-drop / DNS / broker-transport / Mz-restart shapes (the things the kill thread actually produces) — and threads it through every swallow site. Non-fault-shaped exceptions re-raise as before. Affected sites (action.py): SQLsmithAction (subprocess fail), AlterIcebergSinkFromAction, AlterKafkaSinkFromAction, DropRoleAction, DropClusterAction, DropClusterReplicaAction, GrantPrivilegesAction, RevokePrivilegesAction, CreateKafkaSourceAction, CreateMySqlSourceAction, CreatePostgresSourceAction, CreateSqlServerSourceAction, HttpPostAction (2 sites). Plus executor.py's WS-executor connection-error handler. Affects every parallel_workload consumer, not just the Antithesis driver. Existing CI (`test/parallel-workload/mzcompose.py`) that runs the framework's KillAction worker is the other consumer of these swallows; the predicate's patterns cover the same shapes the KillAction produces (connection drop on materialized restart), so behavior is preserved for them while real exceptions stop being silently dropped.

Adds the build-side scaffolding needed to run an Antithesis test of Materialize: - test/antithesis/Makefile local build + push wrappers - test/antithesis/mzcompose.py authoritative service graph - test/antithesis/export-compose.py mzcompose.py -> docker-compose.yaml - test/antithesis/export-env.py fingerprint .env generator - test/antithesis/push-antithesis.py image push to the Antithesis registry - test/antithesis/config/ mzbuild image carrying the exported docker-compose.yaml + image refs - test/antithesis/workload/ mzbuild image carrying the workload runner, its dependencies, and a lightweight stub of materialize/ mzcompose so workload code can import it inside the container - test/antithesis/fault-orchestrator/ quiet/active window orchestrator that drives fault injection from inside the test - test/antithesis/AGENTS.md orientation for future contributors No actual workload drivers, helpers, or properties yet — those come in subsequent commits.

Wires the Antithesis build into the nightly Buildkite pipeline: - ci/test/build-antithesis.sh builds antithesis-flavored images and pushes them - ci/nightly/pipeline.template.yml nightly job entry - ci/mkpipeline.py / ci/test/build.py treat CI_ANTITHESIS as a scoped build flavor on the same axis as coverage and sanitizer - ci/test/lint-main/checks/ check-antithesis-compose.sh guards the exported docker-compose.yaml against drift from mzcompose.py check-pipeline.sh adds the new check to lint

Adds the helper library shared by all workload drivers: - helper_pg / helper_pg_source / helper_pg_upstream pg-side helpers used both for the Materialize SQL client and for the upstream Postgres CDC source - helper_mysql / helper_mysql_source mysql-side upstream helpers and source DDL - helper_kafka kafka topic / producer helpers - helper_none_source / helper_upsert_source envelope-shape source helpers - helper_testdrive in-container testdrive runner - helper_random / helper_table_mv / helper_source_stats small utilities used across drivers - helper_logging per-invocation correlation IDs and lifecycle lines - anytime_health_check.sh always-on Antithesis health probe

Adds the kafka-source and upsert workload drivers and their properties: - first_select_upsert_implementation chooses upsert_v1 or upsert_v2 for the invocation - parallel_driver_kafka_none_envelope envelope-NONE no-data-loss / no-duplication - parallel_driver_upsert_latest_value key reflects latest value under fault - singleton_driver_upsert_state_rehydration state rehydrates correctly after a clusterd restart - anytime_kafka_frontier_monotonic source frontier never regresses - anytime_kafka_offset_known_not_below_committed upstream offset invariant - anytime_kafka_source_resumes_after_fault liveness after broker or clusterd fault

Adds mysql-source workload drivers and their properties: - first_mysql_replica_setup bootstraps the upstream MySQL primary and replica - parallel_driver_mysql_cdc InnoDB CDC source correctness under concurrent DML - parallel_driver_mysql_myisam MyISAM table behavior (no-data-loss) - anytime_mysql_source_no_gtid_errors GTID monotonicity at the source level

Adds pg-source workload drivers: - first_pg_cdc_setup bootstraps the upstream Postgres + replication slot - parallel_driver_pg_cdc pg-CDC correctness under concurrent upstream writes - singleton_driver_pg_cdc_testdrive runs a pg-cdc testdrive suite once per invocation

- parallel_driver_parallel_workload drives the existing parallel_workload library against a per-invocation pool-backed cluster - parallel_driver_upsert_ancient_key_writable cross-invocation property: ancient keys remain writable after long quiescence

Drivers that pair workload-side observations with SUT-side assertion anchors introduced in the storage/persist/catalog instrumentation commit: - singleton_driver_catalog_recovery_consistency catalog recovery under environmentd fault - parallel_driver_strict_serializable_reads persist strict-serializable read property - parallel_driver_mv_reflects_table_updates materialized views reflect base-table writes eventually - anytime_fault_recovery_exercised liveness signal: the SUT is actually faulting and recovering

…lizeInc#11200 / MaterializeInc#11224) Two parallel_driver shapes that collectively cover the two peek-sequencing variants of the read-hold downgrade bug: - first_explicit_txn_setup seed-private table bootstrap for the explicit-txn driver - parallel_driver_explicit_txn_no_since_violation BEGIN / 8x SELECT alternating table+MV / COMMIT; exercises the in_immediate_multi_stmt_txn path - first_pw_hot_objects_setup bootstrap for the hot-objects driver - helper_pw_hot shared helpers for the hot-objects driver - parallel_driver_pw_hot_objects many drivers racing against a small fixed object pool; exercises the single-statement peek-sequencing path

…figs Splits the single all-services compose into one configurable composition that emits per-group docker-compose YAMLs: - kafka kafka stack + multi-replica clusterd - pg-cdc postgres-source + multi-replica clusterd - mysql-cdc mysql primary+replica + multi-replica clusterd - parallel-workload clusterd pool + multi-replica antithesis_cluster - catalog single clusterd, no upstream sources - combined kitchen sink (every service, every driver) Why: Antithesis runs on a single core, so service-count is the main competitive resource. Cutting unused upstreams (kafka, mysql, postgres) out of stacks that don't exercise them gives the relevant workload more hypervisor time and tightens per-property signal. How: - test/antithesis/groups.yaml is the single source of truth for which services + setup + drivers + anytime scripts belong to each group. anytime_health_check and anytime_fault_recovery_exercised are auto-added to every group. - test/antithesis/export-compose.py takes --group=NAME and filters services + workload depends_on, injects ANTITHESIS_WORKLOAD_GROUP on the workload service. - The workload Dockerfile stages all scripts; the entrypoint reads ANTITHESIS_WORKLOAD_GROUP and copies only the selected scripts into /opt/antithesis/test/v1/materialize/ so Test Composer doesn't see drivers that aren't in scope. - One mzbuild config image per group (test/antithesis/configs/<group>/); push-antithesis.py and ci/test/build.py iterate over them. Bookmark dov/antithesis-stack still points at this commit; previous HEAD becomes commit 12. Land it; CI nightly schedule (run all groups every night vs. rotate) is deferred — implementation is configurable via the buildkite step's ANTITHESIS_WORKLOAD_GROUPS env var.

…ql GTID driver

…ssion-drain signals

…orkload setup

…ault injection

…tate panic

…l-cdc; pin upsert-stress to v1; shrink upsert-stress to 1×1 with 6 hammers; absorb container-down testdrive errors as transient Drop the upsert-stress topology to a single clusterd replica running one timely worker to match the Flare 50cc shape that actually panicked in INC-936, and free Antithesis compute budget for exploring more interleavings per timeline. The bug-relevant concurrency comes from the hammer producers and the RocksDB foreground/background thread boundary, not from extra timely workers or replicas. Bump UPSERT_HAMMER_REPLICAS 4 → 6 to spend the freed budget on more producers piling concurrent retract+insert events on the same per-(key, ts) consolidation windows. * workload-entrypoint.sh branches on ANTITHESIS_WORKLOAD_GROUP: the upsert-stress group provisions only replica1; all other groups keep replica1 + replica2. * export-compose.py: new apply_group_overrides rewrites clusterd1's CLUSTERD_{COMPUTE,STORAGE}_TIMELY_CONFIG to workers=1 and the workload service's CLUSTERD_WORKERS=1 for the upsert-stress group. * groups.yaml: drop clusterd2 from upsert-stress services, add upsert-hammer-4/5, expand docstring with the rationale. Also absorb three previously-unrecognised Antithesis fault-injection shapes the testdrive harness was treating as property violations: * helper_testdrive.py: add "connection closed" (tokio-postgres bare wording when materialized's TCP stream drops mid-`>` checkpoint — distinct from the pre-existing "connection closed before message completed" which only covers the catalog-dump HTTP path) and "No more packets in the wire" (tiberius's TDS-disconnect wording surfaced via testdrive's `$ sql-server-execute` directive when the sql-server container is killed mid-statement). Regenerate stale per-group composes (kafka, pg-cdc, parallel-workload, combined) so the SQL_SERVER_* workload env vars and the combined group's sql-server service definition are in sync with mzcompose.py.

@teskje

…VERY MV + PREPARE/EXECUTE-in-txn) and transient-error absorption Two coverage additions targeting the SQL-261 / database-issues#11224 bug family (insufficient read holds inside explicit txns, see Linear SQL-261): * helper_explicit_txn_setup.py: add a second MV `antithesis_txn_mv_refresh_every` with REFRESH EVERY '2 seconds' alongside the existing plain MV. The MaterializeInc#11200 CI repro hit a REFRESH EVERY MV specifically — its stepped `since` advance interacts with stored-read-hold re-resolution differently than a streaming MV. * parallel_driver_explicit_txn_no_since_violation.py: SDK coin flip per invocation picks which MV variant the txn reads, with per-arm sometimes() signals so a coverage regression on either shape is visible in the dashboard. * parallel_driver_explicit_txn_prepare_execute.py (new): sibling driver that interleaves `EXECUTE` (forces old peek sequencing via `ComputeController::peek`) with direct `SELECT` (new peek sequencing via frontend_peek.rs) inside a single `BEGIN..COMMIT`. Covers the intersection @teskje called out on MaterializeInc#11200 — both sequencers operating on the same stored-read-hold context. Wired into the parallel-workload and combined groups. Absorb four Antithesis fault-injection shapes the parallel-workload driver was treating as worker-death property violations: * parallel_driver_parallel_workload.py: add "Name or service not known" (EAI_NONAME, sibling of the existing EAI_AGAIN entry), "Failed to execute http request" (polaris's reqwest transport-failure wording when its upstream is gone), and "OperationTimedOut" (librdkafka admin client timeout during CREATE SINK validation when the broker is paused). Also case-fold the substring match via a new `_msg_contains` helper so future patterns can stick to one canonical case rather than hand-curating both.

…mysql-cdc fault shapes Consolidate the four previously-diverging fault-shape pattern lists (helper_testdrive, parallel_driver_parallel_workload, both explicit-txn drivers, helper_pw_hot) into one canonical list in a new `helper_fault_tolerance` module. Each driver had been collecting its own set of strings independently — helper_testdrive had "Failed to resolve hostname" but not "Name or service not known" (EAI_NONAME); parallel_driver_parallel_workload had both — so mysql-cdc testdrive faults that hit EAI_NONAME escaped tolerance even though the parallel-workload driver had absorbed the same shape. Shared module owns: * FAULT_PATTERNS — the union, with each entry documented by producer crate + in-the-wild context. * looks_like_fault(msg) — case-folded substring matcher. Migrated callsites (5 files): helper_testdrive.py (re-exports for back-compat + looks_transient delegates), helper_pw_hot.py (is_transient becomes an alias), parallel_driver_parallel_workload.py (_SETUP_FAULT_PATTERNS / _msg_contains removed, keeps local _SETUP_RACE_PATTERNS), and both explicit-txn drivers (_TRANSIENT_PATTERNS removed). Adds three patterns to the canonical list that were missing from helper_testdrive and surfaced as mysql-cdc property violations: * "Unable to connect to MySQL server" — testdrive's `$ mysql-connect` directive failing because the mysql container is paused. * "failed to lookup address information" — materialized's CREATE CONNECTION validation when libc getaddrinfo() can't find the upstream host. * "Name or service not known" — glibc EAI_NONAME, the sibling of the pre-existing EAI_AGAIN entry, for cases where the resolver has no record at all (container fully gone, not just transiently partitioned).

Empirically 9× runs in the May 25 batch never exercised the kinds of weird-shape SELECTs that catch optimizer/planner panics — the framework's own action vocabulary is DDL-heavy and the per-MV reads it issues stay within a narrow surface. SQLsmith generates AST-derived random queries (deeply-nested joins, weird casts, window expressions over MV outputs) and is the canonical tool for this class of fuzzing; the test/sqlsmith image already builds the binary against ubuntu-base. Bundle the binary + its libpqxx runtime into the workload image via MZFROM sqlsmith, and add parallel_driver_sqlsmith.py that runs a bounded batch per invocation, demoting known-error patterns and looks_like_fault matches. Properties: always(no unexpected error escaped) sometimes(>= 1 query executed) sometimes(>= 100 queries in a timeline) The driver lives in the parallel-workload group only — that's the group whose topology already has the source/sink surface SQLsmith needs to generate against. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This reverts commit 9d3cebf.

Both `parallel workload: clusterd pool slots claimed` and `parallel workload: full multi-replica pool claim` were registered in Antithesis's property history but had no `sometimes(...)` call site in the driver — across the May 25 batch they showed up 8× and 6× as "unfound" because the driver picked a slot, ran against it, and exited without ever telling Antithesis the claim happened. Wire the first to the post-pick code path so it fires every invocation (liveness anchor: the pool-cluster topology is reachable). Wire the second to a cross-invocation ledger at /tmp/antithesis-pool-claims.txt that records the union of slot indices touched in this timeline — fires once that union covers every slot, which is the meaningful coverage milestone for a multi-replica pool topology. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two registered Sometimes-assertions about fault-recovery visibility have been 9× unfound across every May 25 timeline: fault recovery: observed antithesis_cluster replica non-online at least once fault recovery: SELECT 1 succeeded after a previously-observed connect failure Both are properties about the workload-side *observability* of fault windows. Antithesis kills / pauses containers freely, but no driver was polling `mz_internal.mz_cluster_replica_statuses` or tracking the "connect failed then later succeeded" pattern, so neither sometimes() ever fired even though faults absolutely landed on the cluster. Add a continuous probe that polls cluster-replica statuses scoped to the universal `antithesis_cluster` and issues a SELECT 1 round-trip on every poll, recording transitions across the launch. Fires the two legacy assertions plus a liveness anchor and a return-to-online transition observation. Wired under default_anytime so every group gets it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`workload-replay: succeeded >= 10 queries in at least one timeline` failed in both 12h workload-replay timelines in the May 25 batch. The sample size cap is 60, runtime budget 60s × 4 threads, statement timeout 3s — there's no obvious reason 10 should be hard. Two latent overheads, both relevant to the short per-invocation window: 1. Re-issuing 5 SETs per query. Every captured query triggered SETs for transaction_isolation, cluster, database, search_path, and statement_timeout — but the bundled capture has identical isolation/cluster/database for all 212 queries and only search_path varies. Under fault windows where each round-trip can stall, 5 per-query SETs eat the per-invocation budget before any actual query runs. Cache per-connection session state and only emit a SET when the value changes; statement_timeout is applied once per connection rather than per query for the same reason. 2. Connect timeout inherited from helper_pg (30 s) — half the 60 s per-invocation budget on a single failed connect. Use a tighter 10 s timeout so the worker loop fail-fasts inside a fault window and retries on the next cycle. Both tunable via WR_* env knobs. Both changes also improve throughput on the healthy path: 5x fewer round-trips per query is a measurable difference on workloads where catalog queries finish in <1 ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brings up a second `materialized` instance (`materialized2`) alongside the primary, sharing the same postgres-metadata catalog and minio persist blob store, with `deploy_generation=1` so it starts in read-only mode. This is the topology test/0dt and test/txn-wal-fencing use; the Antithesis port layers fault injection on top of it. Initial driver `singleton_driver_read_only_safety` asserts two properties: always — no write to materialized2 ever succeeds. Any accepted write means the deploy-generation gate failed and the two instances are racing on the same persist state (dual-leader). sometimes — a marker INSERT'd on the primary is visible on materialized2 within bounded time. Liveness anchor for the shared-state replication path. sometimes — at least one write was rejected with `cannot write in read-only mode`. Liveness anchor for the rejection path so the safety check isn't vacuous when no write reached the target. Wired into both the focused `deploy` group and the kitchen-sink `combined` group. Future work (full deploy-generation bump → fence on primary, then writes succeed on the new instance) is documented in the group's manifest entry; it needs an in-container hook to bump deploy_generation that today doesn't exist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The existing config groups all commit a placeholder `.env` (overwritten by CI's build-antithesis.sh before each docker build) and have no `.gitignore`. The deploy group I added instead carried a `.gitignore` excluding `.env`, which removed the file from the docker build context even after CI had written it — the COPY in the FROM-scratch Dockerfile then failed: failed to calculate checksum of ref … "/.env": not found The nightly antithesis-flavored build (build x86_64 Antithesis) tripped on this; build state was failing because every downstream job depends on the antithesis image set. Mirror the parallel-workload pattern: commit `.env` as a placeholder referencing the current `ghcr.io/materializeinc/materialize/*` mzbuild fingerprints; CI regenerates it pointed at the Antithesis GCP registry before the docker build runs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… driver

The `subscribe: every committed row eventually appears in the SUBSCRIBE stream` Always-assertion fired false positives in the May 26 24h runs: authoritative=150, received=149, missed=[150], reconnects=0 — an exact off-by-one with no fault even involved. Root cause is a snapshot-skew race in the driver, not a SUT bug. The authoritative set was read with a plain `SELECT counter FROM table`, which under strict serializability picks its own read timestamp that can sit *ahead* of the separately-sampled `mz_now()` used as the consumer catch-up target. The authoritative set then included a row whose commit ts was later than target_ts, while the consumer — correctly caught up only to target_ts — had not yet streamed it. Fix: sample target_ts first, wait for the consumer to catch up to it, then read the authoritative set `AS OF target_ts` so both sides of the missed/spurious comparison are taken at the same logical time. The AS OF read is non-blocking here because the consumer already observed progress past target_ts (table upper is already beyond it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The probe added last commit fired (no longer unfound) but its condition was never true across the May 26 runs (pass=0): it never observed a replica non-online, nor a SELECT-1 recovery after a connect failure. Two self-inflicted bugs: 1. Point-in-time `mz_cluster_replica_statuses` polled every 2s misses a clusterd kill/restart that resolves inside one window. Add a query against the durable `mz_cluster_replica_status_history` (every transition with occurred_at), scoped to events since the probe launch — a blip the live poll missed still leaves an `offline` row that any later successful poll observes. Also tighten the live poll to 1s. 2. The SELECT-1 liveness check used helper_pg.query_retry, which retries *through* fault windows and almost always eventually succeeds — so the "connect failed then later recovered" transition was never observed. Replace with a short, non-retrying direct connect so the transient failure is actually seen. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…replica, internal-error Three new SUT-anchored assertions (default drivers/anytime, so every group runs them) to catch wrong-results and panics-adjacent bugs the existing crash/frontier suite misses: * parallel_driver_differential_query — builds a random table and reads base table vs non-materialized view vs materialized view of the same aggregate inside ONE read transaction (one consistent timestamp); asserts all three agree. Catches incremental-maintenance / optimizer wrong-results — the silent-correctness class with no prior coverage. * parallel_driver_cross_replica_consistency — reads the same indexed view pinned to each replica of antithesis_cluster (via SET cluster_replica) and asserts identical results. Catches replica divergence / non-deterministic execution. Self-skips single-replica groups. * anytime_internal_error_scan — runs a rotating set of valid catalog / arithmetic / window / jsonb queries and asserts none ever return Materialize's `internal error` class; also scans source/sink statuses for objects stuck in an internal-error state. All demote fault-shaped failures via looks_like_fault so fault windows don't trip the safety assertions. Also includes bin/fmt line-wrapping of two unrelated drivers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… primary The deploy group's premise was wrong. A second `materialized` against the *shared* catalog at deploy_generation=1 is not a passive read replica: it's a 0dt deployment successor that auto-promotes after `with_0dt_deployment_max_wait` (30 min) and fences the primary. Both the `deploy` and `combined` 10h runs (nightly 16590) timed out at the 30-minute setup deadline with `setup_error: No 'setup complete' event received` — the workload entrypoint could never provision antithesis_cluster against the fenced/contended primary. Adding materialized2 to `combined` broke the kitchen-sink group entirely. environmentd has no permanent-read-replica flag (read-only mode is purely deploy_generation-driven), so a real 0dt/fencing test needs explicit start→rehydrate→promote→fence orchestration, not an always-on second container. Revert the whole experiment: remove materialized2 from mzcompose.py, the deploy group + combined's materialized2/driver from groups.yaml, antithesis-*-deploy from export-compose.py, deploy from the Makefile, and the workloads/configs/test dirs. Regenerate the combined compose. The 0dt idea is preserved in git history for a future properly-orchestrated attempt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…erance Both new oracles fired false always(False) across 7 groups in the 10h runs — not on actual disagreements (no base≠mv, no replica divergence was ever reported), but on fault/transient errors their handlers didn't recognize: cross-replica: "target replica failed or was dropped" — Antithesis killed the clusterd the read was pinned to via SET cluster_replica. differential: "collection 'uNN' was dropped" — the scratch MV/view's collection was torn down during fault recovery (or our own CASCADE cleanup raced a retry). Add "target replica failed or was dropped" to the shared helper_fault_tolerance FAULT_PATTERNS (an unambiguous replica-fault any replica-targeted driver can hit). Handle "was dropped" locally in both oracle drivers — these own prefix-scoped scratch objects, so a drop mid-read is only ever fault-recovery / self-cleanup, never a real bug; keeping it local preserves strictness for drivers that read shared objects (testdrive, etc.). Both now demote to the "completed without a fault" sometimes-anchor instead of the always(False) "unexpected error" path. The oracles themselves found zero correctness violations in the 10h runs — this only removes the fault-handling noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e + mz_now in one txn The earlier AS-OF-mz_now() fix didn't fully close the snapshot-skew race; the 10h runs still showed exact off-by-one missed/spurious (authoritative=118 received=119, ambiguous=0, reconnects=0 — no fault involved). Root cause: mz_now() sampled on its own connection can lag the producer's last committed row, so reading the authoritative set AS OF that too-early target missed a row the consumer already streamed. Read `SELECT counter FROM table` and `SELECT mz_now()` in a SINGLE read transaction. The txn's read timestamp T is >= every committed row, so authoritative_set is complete; and target_ts == T, so once the consumer reaches T it has streamed exactly that set (producer has stopped, nothing commits after T). missed/spurious are then structurally empty unless the SUT genuinely lost or invented a row. Bail to the existing sometimes(False) anchor on a fault-shaped snapshot failure; drop the now unused query_retry / query_one_retry imports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds two SUT-anchored assertions (default drivers, so every group runs them) for bug classes the existing suite doesn't directly test: * anytime_source_status_convergence — port of test/source-sink-errors (README §4 "Next test frameworks to port"). Polls mz_source_statuses + mz_sink_statuses scoped to user objects, tracks per-(id, kind) healthy/unhealthy transitions across the launch, and asserts no user source/sink ended the launch stuck in `stalled`/`failed`/`ceased`/`paused`/`dropped` without a subsequent healthy observation past MIN_RECOVERY_S (180s, comfortably longer than the fault-orchestrator's MAX_ON=40s). Catches the bug class where the status tables don't converge to ground truth after the upstream disruption clears. Liveness anchors fire on observed stalls + observed recoveries. * singleton_driver_append_only_monotonic — runs a long-lived monotonic-counter INSERT loop into a self-owned prefix-scoped table (singleton so the in-process committed_count model is unambiguous) and periodically samples count(*) on a fresh connection. Asserts count(*) never decreases across observations and never exceeds the in-process committed_count. Catches the persist / storage data-loss bug class where a committed row silently disappears after a fault or compaction — no other driver directly tests row-count monotonicity on an append-only object. Both demote fault-shape failures via looks_like_fault, so fault windows don't trip the safety assertions; both validated against a live materialized (source-status query reads cleanly, INSERT/count pattern works). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

def- force-pushed the pr-antithesis2 branch 4 times, most recently from 280bc49 to e2eac0a Compare May 28, 2026 02:52

DAlperin and others added 26 commits May 28, 2026 02:52

test/antithesis: drop broken recovery anytime drivers + redundant mys…

991bf2f

…ql GTID driver

test/antithesis: drop unused testdrive reset paths

06d7139

test/antithesis: bump parallel-workload runtime + add failure-rate/se…

61edafb

…ssion-drain signals

test/antithesis: wire CancelAction into parallel-workload driver

c8bbfab

test/antithesis: tolerate polaris-side fault errors during parallel-w…

fcbd8af

…orkload setup

testdrive: use Instant for elapsed-time delta, immune to wall-clock f…

ad101f4

…ault injection

test/antithesis: add upsert-stress group for INC-936 invalid-upsert-s…

7cc0d89

…tate panic

test/antithesis: add onboarding README + propose next frameworks to port

810bf1f

Try to make progress on Antithesis testing

944265f

def- and others added 18 commits May 28, 2026 02:52

More drivers

9e7c075

random fixes/improvements

33c695d

random fixes

d5e8e0e

Revert "test/antithesis: add SQLsmith driver to parallel-workload group"

03391cb

This reverts commit 9d3cebf.

test/antithesis: assert timestamp monotonicity in strict-serializable…

1957934

… driver

def- force-pushed the pr-antithesis2 branch from e2eac0a to b7d32d3 Compare May 28, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another Antithesis Draft#36667

Another Antithesis Draft#36667
def- wants to merge 44 commits into
MaterializeInc:mainfrom
def-:pr-antithesis2

def- commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

def- commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants