Skip to content

Another Antithesis Draft#36667

Draft
def- wants to merge 44 commits into
MaterializeInc:mainfrom
def-:pr-antithesis2
Draft

Another Antithesis Draft#36667
def- wants to merge 44 commits into
MaterializeInc:mainfrom
def-:pr-antithesis2

Conversation

@def-
Copy link
Copy Markdown
Contributor

@def- def- commented May 21, 2026

No description provided.

@def- def- force-pushed the pr-antithesis2 branch 4 times, most recently from 280bc49 to e2eac0a Compare May 28, 2026 02:52
DAlperin and others added 26 commits May 28, 2026 02:52
Adds an 'antithesis' build flavor to mzbuild and ci-builder, on the same
axis as 'coverage' and 'sanitizer'. When enabled, the cargo build for
binaries copied into mzbuild images uses the antithesis-sdk rustflags
and ships full debug symbols for symbolization.

Pulls antithesis-sdk into the workspace and into the storage, catalog,
and persist-client member crates so subsequent commits can wrap panic
and invariant sites in SDK assertions.
…is-sdk assertions

Adds SDK assertion hooks at known panic / invariant sites that
Antithesis fault injection can exercise:

  - src/storage/src/source/kafka.rs              (kafka source startup
                                                  and offset-known
                                                  invariants)
  - src/storage/src/source/reclock.rs            (reclock mint progress)
  - src/storage/src/source/reclock/compat.rs     (frontier shape on the
                                                  remap shard)
  - src/storage/src/source/mysql/replication/partitions.rs
                                                 (GTID monotonicity in
                                                  the mysql source)
  - src/storage/src/upsert/types.rs              (upsert state-machine
                                                  invariants on
                                                  ancient/tombstoned
                                                  keys)
  - src/storage/src/upsert_continual_feedback*.rs
                                                 (assert on tombstone
                                                  removal and key
                                                  rehydration paths)
  - src/persist-client/src/internal/apply.rs     (CaS monotonicity)
  - src/catalog/src/durable/persist.rs           (catalog epoch fencing)

Assertions are no-ops in non-antithesis builds.
Adds a 'pool-backed' execution mode to parallel_workload where the
Database wraps a pre-existing cluster (typically bootstrapped by an
external compose like Antithesis) instead of allocating its own.

  - Database/Cluster gain 'existing_cluster_name' and 'is_pool_backed'
    so framework-owned actions (CreateCluster, ResizeCluster,
    ScaleCluster) skip pool clusters they don't own.
  - 'name_scope' lets multiple parallel_workload invocations coexist
    against the same Materialize without colliding on object names.
  - mzcompose Clusterd treats scratch_directory=None as a real signal:
    omit --scratch-directory entirely so clusterd falls back to
    RocksDB's mem_env, matching production replica shape.
  - Drop a handful of feature flags from the random-LD-flag pool that
    parallel_workload no longer exercises cleanly.
Today, every per-action try/except site under Scenario.Kill /
Scenario.ZeroDowntimeDeploy is an unconditional swallow:

  try:
      source.create(exe)
      …
  except:
      if exe.db.scenario not in (Scenario.Kill, Scenario.ZeroDowntimeDeploy):
          raise

Under those scenarios, *every* exception is silently dropped — bare
`except:` catches AssertionError, KeyError, TypeError, the whole
mess. The intent was 'tolerate connection drops from the kill thread';
the implementation also tolerates real correctness bugs.

This commit adds is_fault_shaped(exc) — a predicate that returns True
only for messages matching connection-drop / DNS / broker-transport /
Mz-restart shapes (the things the kill thread actually produces) — and
threads it through every swallow site. Non-fault-shaped exceptions
re-raise as before.

Affected sites (action.py): SQLsmithAction (subprocess fail),
AlterIcebergSinkFromAction, AlterKafkaSinkFromAction, DropRoleAction,
DropClusterAction, DropClusterReplicaAction, GrantPrivilegesAction,
RevokePrivilegesAction, CreateKafkaSourceAction, CreateMySqlSourceAction,
CreatePostgresSourceAction, CreateSqlServerSourceAction, HttpPostAction
(2 sites). Plus executor.py's WS-executor connection-error handler.

Affects every parallel_workload consumer, not just the Antithesis
driver. Existing CI (`test/parallel-workload/mzcompose.py`) that
runs the framework's KillAction worker is the other consumer of these
swallows; the predicate's patterns cover the same shapes the KillAction
produces (connection drop on materialized restart), so behavior is
preserved for them while real exceptions stop being silently dropped.
Adds the build-side scaffolding needed to run an Antithesis test
of Materialize:

  - test/antithesis/Makefile               local build + push wrappers
  - test/antithesis/mzcompose.py           authoritative service graph
  - test/antithesis/export-compose.py      mzcompose.py -> docker-compose.yaml
  - test/antithesis/export-env.py          fingerprint .env generator
  - test/antithesis/push-antithesis.py     image push to the Antithesis registry
  - test/antithesis/config/                mzbuild image carrying the exported
                                           docker-compose.yaml + image refs
  - test/antithesis/workload/              mzbuild image carrying the workload
                                           runner, its dependencies, and a
                                           lightweight stub of materialize/
                                           mzcompose so workload code can
                                           import it inside the container
  - test/antithesis/fault-orchestrator/    quiet/active window orchestrator
                                           that drives fault injection from
                                           inside the test
  - test/antithesis/AGENTS.md              orientation for future contributors

No actual workload drivers, helpers, or properties yet — those come in
subsequent commits.
Wires the Antithesis build into the nightly Buildkite pipeline:

  - ci/test/build-antithesis.sh                builds antithesis-flavored
                                               images and pushes them
  - ci/nightly/pipeline.template.yml           nightly job entry
  - ci/mkpipeline.py / ci/test/build.py        treat CI_ANTITHESIS as a
                                               scoped build flavor on the
                                               same axis as coverage and
                                               sanitizer
  - ci/test/lint-main/checks/
      check-antithesis-compose.sh              guards the exported
                                               docker-compose.yaml against
                                               drift from mzcompose.py
      check-pipeline.sh                        adds the new check to lint
Adds the helper library shared by all workload drivers:

  - helper_pg / helper_pg_source / helper_pg_upstream  pg-side helpers
                                                       used both for the
                                                       Materialize SQL
                                                       client and for
                                                       the upstream
                                                       Postgres CDC
                                                       source
  - helper_mysql / helper_mysql_source                 mysql-side
                                                       upstream helpers
                                                       and source DDL
  - helper_kafka                                       kafka topic /
                                                       producer helpers
  - helper_none_source / helper_upsert_source          envelope-shape
                                                       source helpers
  - helper_testdrive                                   in-container
                                                       testdrive runner
  - helper_random / helper_table_mv / helper_source_stats
                                                       small utilities
                                                       used across
                                                       drivers
  - helper_logging                                     per-invocation
                                                       correlation IDs
                                                       and lifecycle
                                                       lines
  - anytime_health_check.sh                            always-on
                                                       Antithesis
                                                       health probe
Adds the kafka-source and upsert workload drivers and their properties:

  - first_select_upsert_implementation                   chooses
                                                         upsert_v1
                                                         or upsert_v2
                                                         for the
                                                         invocation
  - parallel_driver_kafka_none_envelope                  envelope-NONE
                                                         no-data-loss /
                                                         no-duplication
  - parallel_driver_upsert_latest_value                  key reflects
                                                         latest value
                                                         under fault
  - singleton_driver_upsert_state_rehydration            state
                                                         rehydrates
                                                         correctly
                                                         after a
                                                         clusterd
                                                         restart
  - anytime_kafka_frontier_monotonic                     source
                                                         frontier
                                                         never
                                                         regresses
  - anytime_kafka_offset_known_not_below_committed       upstream
                                                         offset
                                                         invariant
  - anytime_kafka_source_resumes_after_fault             liveness
                                                         after broker
                                                         or clusterd
                                                         fault
Adds mysql-source workload drivers and their properties:

  - first_mysql_replica_setup                  bootstraps the
                                               upstream MySQL primary
                                               and replica
  - parallel_driver_mysql_cdc                  InnoDB CDC source
                                               correctness under
                                               concurrent DML
  - parallel_driver_mysql_myisam               MyISAM table behavior
                                               (no-data-loss)
  - anytime_mysql_source_no_gtid_errors        GTID monotonicity at
                                               the source level
Adds pg-source workload drivers:

  - first_pg_cdc_setup                  bootstraps the upstream
                                        Postgres + replication slot
  - parallel_driver_pg_cdc              pg-CDC correctness under
                                        concurrent upstream writes
  - singleton_driver_pg_cdc_testdrive   runs a pg-cdc testdrive suite
                                        once per invocation
  - parallel_driver_parallel_workload                drives the
                                                     existing
                                                     parallel_workload
                                                     library against
                                                     a per-invocation
                                                     pool-backed
                                                     cluster
  - parallel_driver_upsert_ancient_key_writable      cross-invocation
                                                     property:
                                                     ancient keys
                                                     remain writable
                                                     after long
                                                     quiescence
Drivers that pair workload-side observations with SUT-side assertion
anchors introduced in the storage/persist/catalog instrumentation
commit:

  - singleton_driver_catalog_recovery_consistency  catalog recovery
                                                   under
                                                   environmentd
                                                   fault
  - parallel_driver_strict_serializable_reads      persist
                                                   strict-serializable
                                                   read property
  - parallel_driver_mv_reflects_table_updates      materialized
                                                   views reflect
                                                   base-table writes
                                                   eventually
  - anytime_fault_recovery_exercised               liveness signal:
                                                   the SUT is
                                                   actually faulting
                                                   and recovering
…lizeInc#11200 / MaterializeInc#11224)

Two parallel_driver shapes that collectively cover the two
peek-sequencing variants of the read-hold downgrade bug:

  - first_explicit_txn_setup                              seed-private
                                                          table
                                                          bootstrap
                                                          for the
                                                          explicit-txn
                                                          driver
  - parallel_driver_explicit_txn_no_since_violation       BEGIN / 8x
                                                          SELECT
                                                          alternating
                                                          table+MV /
                                                          COMMIT;
                                                          exercises the
                                                          in_immediate_multi_stmt_txn
                                                          path
  - first_pw_hot_objects_setup                            bootstrap for
                                                          the
                                                          hot-objects
                                                          driver
  - helper_pw_hot                                         shared helpers
                                                          for the
                                                          hot-objects
                                                          driver
  - parallel_driver_pw_hot_objects                        many drivers
                                                          racing against
                                                          a small fixed
                                                          object pool;
                                                          exercises the
                                                          single-statement
                                                          peek-sequencing
                                                          path
…figs

Splits the single all-services compose into one configurable composition
that emits per-group docker-compose YAMLs:

  - kafka              kafka stack + multi-replica clusterd
  - pg-cdc             postgres-source + multi-replica clusterd
  - mysql-cdc          mysql primary+replica + multi-replica clusterd
  - parallel-workload  clusterd pool + multi-replica antithesis_cluster
  - catalog            single clusterd, no upstream sources
  - combined           kitchen sink (every service, every driver)

Why: Antithesis runs on a single core, so service-count is the main
competitive resource. Cutting unused upstreams (kafka, mysql, postgres)
out of stacks that don't exercise them gives the relevant workload more
hypervisor time and tightens per-property signal.

How:
  - test/antithesis/groups.yaml is the single source of truth for
    which services + setup + drivers + anytime scripts belong to each
    group. anytime_health_check and anytime_fault_recovery_exercised
    are auto-added to every group.
  - test/antithesis/export-compose.py takes --group=NAME and filters
    services + workload depends_on, injects ANTITHESIS_WORKLOAD_GROUP
    on the workload service.
  - The workload Dockerfile stages all scripts; the entrypoint reads
    ANTITHESIS_WORKLOAD_GROUP and copies only the selected scripts
    into /opt/antithesis/test/v1/materialize/ so Test Composer
    doesn't see drivers that aren't in scope.
  - One mzbuild config image per group (test/antithesis/configs/<group>/);
    push-antithesis.py and ci/test/build.py iterate over them.

Bookmark dov/antithesis-stack still points at this commit; previous
HEAD becomes commit 12.

Land it; CI nightly schedule (run all groups every night vs. rotate)
is deferred — implementation is configurable via the buildkite step's
ANTITHESIS_WORKLOAD_GROUPS env var.
…l-cdc; pin upsert-stress to v1; shrink upsert-stress to 1×1 with 6 hammers; absorb container-down testdrive errors as transient

Drop the upsert-stress topology to a single clusterd replica running one
timely worker to match the Flare 50cc shape that actually panicked in
INC-936, and free Antithesis compute budget for exploring more
interleavings per timeline.  The bug-relevant concurrency comes from
the hammer producers and the RocksDB foreground/background thread
boundary, not from extra timely workers or replicas.  Bump
UPSERT_HAMMER_REPLICAS 4 → 6 to spend the freed budget on more producers
piling concurrent retract+insert events on the same per-(key, ts)
consolidation windows.

  * workload-entrypoint.sh branches on ANTITHESIS_WORKLOAD_GROUP: the
    upsert-stress group provisions only replica1; all other groups keep
    replica1 + replica2.
  * export-compose.py: new apply_group_overrides rewrites clusterd1's
    CLUSTERD_{COMPUTE,STORAGE}_TIMELY_CONFIG to workers=1 and the
    workload service's CLUSTERD_WORKERS=1 for the upsert-stress group.
  * groups.yaml: drop clusterd2 from upsert-stress services, add
    upsert-hammer-4/5, expand docstring with the rationale.

Also absorb three previously-unrecognised Antithesis fault-injection
shapes the testdrive harness was treating as property violations:

  * helper_testdrive.py: add "connection closed" (tokio-postgres bare
    wording when materialized's TCP stream drops mid-`>` checkpoint —
    distinct from the pre-existing "connection closed before message
    completed" which only covers the catalog-dump HTTP path) and
    "No more packets in the wire" (tiberius's TDS-disconnect wording
    surfaced via testdrive's `$ sql-server-execute` directive when the
    sql-server container is killed mid-statement).

Regenerate stale per-group composes (kafka, pg-cdc, parallel-workload,
combined) so the SQL_SERVER_* workload env vars and the combined
group's sql-server service definition are in sync with mzcompose.py.
…VERY MV + PREPARE/EXECUTE-in-txn) and transient-error absorption

Two coverage additions targeting the SQL-261 / database-issues#11224 bug
family (insufficient read holds inside explicit txns, see Linear SQL-261):

  * helper_explicit_txn_setup.py: add a second MV
    `antithesis_txn_mv_refresh_every` with REFRESH EVERY '2 seconds'
    alongside the existing plain MV.  The MaterializeInc#11200 CI repro hit a REFRESH
    EVERY MV specifically — its stepped `since` advance interacts with
    stored-read-hold re-resolution differently than a streaming MV.
  * parallel_driver_explicit_txn_no_since_violation.py: SDK coin flip
    per invocation picks which MV variant the txn reads, with per-arm
    sometimes() signals so a coverage regression on either shape is
    visible in the dashboard.
  * parallel_driver_explicit_txn_prepare_execute.py (new): sibling
    driver that interleaves `EXECUTE` (forces old peek sequencing
    via `ComputeController::peek`) with direct `SELECT` (new peek
    sequencing via frontend_peek.rs) inside a single `BEGIN..COMMIT`.
    Covers the intersection @teskje called out on MaterializeInc#11200 — both
    sequencers operating on the same stored-read-hold context.
    Wired into the parallel-workload and combined groups.

Absorb four Antithesis fault-injection shapes the parallel-workload
driver was treating as worker-death property violations:

  * parallel_driver_parallel_workload.py: add
    "Name or service not known" (EAI_NONAME, sibling of the existing
    EAI_AGAIN entry), "Failed to execute http request" (polaris's
    reqwest transport-failure wording when its upstream is gone),
    and "OperationTimedOut" (librdkafka admin client timeout during
    CREATE SINK validation when the broker is paused).  Also
    case-fold the substring match via a new `_msg_contains` helper
    so future patterns can stick to one canonical case rather than
    hand-curating both.
…mysql-cdc fault shapes

Consolidate the four previously-diverging fault-shape pattern lists
(helper_testdrive, parallel_driver_parallel_workload, both explicit-txn
drivers, helper_pw_hot) into one canonical list in a new
`helper_fault_tolerance` module.  Each driver had been collecting its
own set of strings independently — helper_testdrive had
"Failed to resolve hostname" but not "Name or service not known"
(EAI_NONAME); parallel_driver_parallel_workload had both — so mysql-cdc
testdrive faults that hit EAI_NONAME escaped tolerance even though the
parallel-workload driver had absorbed the same shape.

Shared module owns:
  * FAULT_PATTERNS — the union, with each entry documented by producer
    crate + in-the-wild context.
  * looks_like_fault(msg) — case-folded substring matcher.

Migrated callsites (5 files): helper_testdrive.py (re-exports for
back-compat + looks_transient delegates), helper_pw_hot.py
(is_transient becomes an alias), parallel_driver_parallel_workload.py
(_SETUP_FAULT_PATTERNS / _msg_contains removed, keeps local
_SETUP_RACE_PATTERNS), and both explicit-txn drivers (_TRANSIENT_PATTERNS
removed).

Adds three patterns to the canonical list that were missing from
helper_testdrive and surfaced as mysql-cdc property violations:

  * "Unable to connect to MySQL server" — testdrive's `$ mysql-connect`
    directive failing because the mysql container is paused.
  * "failed to lookup address information" — materialized's
    CREATE CONNECTION validation when libc getaddrinfo() can't find
    the upstream host.
  * "Name or service not known" — glibc EAI_NONAME, the sibling of the
    pre-existing EAI_AGAIN entry, for cases where the resolver has no
    record at all (container fully gone, not just transiently
    partitioned).
def- and others added 18 commits May 28, 2026 02:52
Empirically 9× runs in the May 25 batch never exercised the kinds of
weird-shape SELECTs that catch optimizer/planner panics — the framework's
own action vocabulary is DDL-heavy and the per-MV reads it issues stay
within a narrow surface.  SQLsmith generates AST-derived random queries
(deeply-nested joins, weird casts, window expressions over MV outputs)
and is the canonical tool for this class of fuzzing; the test/sqlsmith
image already builds the binary against ubuntu-base.

Bundle the binary + its libpqxx runtime into the workload image via
MZFROM sqlsmith, and add parallel_driver_sqlsmith.py that runs a bounded
batch per invocation, demoting known-error patterns and looks_like_fault
matches.  Properties:

  always(no unexpected error escaped)
  sometimes(>= 1 query executed)
  sometimes(>= 100 queries in a timeline)

The driver lives in the parallel-workload group only — that's the group
whose topology already has the source/sink surface SQLsmith needs to
generate against.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both `parallel workload: clusterd pool slots claimed` and
`parallel workload: full multi-replica pool claim` were registered in
Antithesis's property history but had no `sometimes(...)` call site in
the driver — across the May 25 batch they showed up 8× and 6× as
"unfound" because the driver picked a slot, ran against it, and exited
without ever telling Antithesis the claim happened.

Wire the first to the post-pick code path so it fires every invocation
(liveness anchor: the pool-cluster topology is reachable).  Wire the
second to a cross-invocation ledger at /tmp/antithesis-pool-claims.txt
that records the union of slot indices touched in this timeline — fires
once that union covers every slot, which is the meaningful coverage
milestone for a multi-replica pool topology.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two registered Sometimes-assertions about fault-recovery visibility
have been 9× unfound across every May 25 timeline:

  fault recovery: observed antithesis_cluster replica non-online at least once
  fault recovery: SELECT 1 succeeded after a previously-observed connect failure

Both are properties about the workload-side *observability* of fault
windows.  Antithesis kills / pauses containers freely, but no driver
was polling `mz_internal.mz_cluster_replica_statuses` or tracking the
"connect failed then later succeeded" pattern, so neither sometimes()
ever fired even though faults absolutely landed on the cluster.

Add a continuous probe that polls cluster-replica statuses scoped to
the universal `antithesis_cluster` and issues a SELECT 1 round-trip
on every poll, recording transitions across the launch.  Fires the
two legacy assertions plus a liveness anchor and a return-to-online
transition observation.  Wired under default_anytime so every group
gets it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`workload-replay: succeeded >= 10 queries in at least one timeline`
failed in both 12h workload-replay timelines in the May 25 batch. The
sample size cap is 60, runtime budget 60s × 4 threads, statement
timeout 3s — there's no obvious reason 10 should be hard.

Two latent overheads, both relevant to the short per-invocation window:

1. Re-issuing 5 SETs per query.  Every captured query triggered SETs
   for transaction_isolation, cluster, database, search_path, and
   statement_timeout — but the bundled capture has identical
   isolation/cluster/database for all 212 queries and only search_path
   varies.  Under fault windows where each round-trip can stall, 5
   per-query SETs eat the per-invocation budget before any actual
   query runs.  Cache per-connection session state and only emit a
   SET when the value changes; statement_timeout is applied once per
   connection rather than per query for the same reason.

2. Connect timeout inherited from helper_pg (30 s) — half the 60 s
   per-invocation budget on a single failed connect.  Use a tighter
   10 s timeout so the worker loop fail-fasts inside a fault window
   and retries on the next cycle.  Both tunable via WR_* env knobs.

Both changes also improve throughput on the healthy path: 5x fewer
round-trips per query is a measurable difference on workloads where
catalog queries finish in <1 ms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Brings up a second `materialized` instance (`materialized2`) alongside
the primary, sharing the same postgres-metadata catalog and minio
persist blob store, with `deploy_generation=1` so it starts in
read-only mode.  This is the topology test/0dt and test/txn-wal-fencing
use; the Antithesis port layers fault injection on top of it.

Initial driver `singleton_driver_read_only_safety` asserts two
properties:

  always — no write to materialized2 ever succeeds.  Any accepted
    write means the deploy-generation gate failed and the two
    instances are racing on the same persist state (dual-leader).

  sometimes — a marker INSERT'd on the primary is visible on
    materialized2 within bounded time.  Liveness anchor for the
    shared-state replication path.

  sometimes — at least one write was rejected with `cannot write in
    read-only mode`.  Liveness anchor for the rejection path so the
    safety check isn't vacuous when no write reached the target.

Wired into both the focused `deploy` group and the kitchen-sink
`combined` group.  Future work (full deploy-generation bump → fence
on primary, then writes succeed on the new instance) is documented in
the group's manifest entry; it needs an in-container hook to bump
deploy_generation that today doesn't exist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The existing config groups all commit a placeholder `.env` (overwritten
by CI's build-antithesis.sh before each docker build) and have no
`.gitignore`.  The deploy group I added instead carried a `.gitignore`
excluding `.env`, which removed the file from the docker build context
even after CI had written it — the COPY in the FROM-scratch Dockerfile
then failed:

  failed to calculate checksum of ref … "/.env": not found

The nightly antithesis-flavored build (build x86_64 Antithesis) tripped
on this; build state was failing because every downstream job depends
on the antithesis image set.

Mirror the parallel-workload pattern: commit `.env` as a placeholder
referencing the current `ghcr.io/materializeinc/materialize/*` mzbuild
fingerprints; CI regenerates it pointed at the Antithesis GCP registry
before the docker build runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The `subscribe: every committed row eventually appears in the SUBSCRIBE
stream` Always-assertion fired false positives in the May 26 24h runs:
authoritative=150, received=149, missed=[150], reconnects=0 — an exact
off-by-one with no fault even involved.

Root cause is a snapshot-skew race in the driver, not a SUT bug.  The
authoritative set was read with a plain `SELECT counter FROM table`,
which under strict serializability picks its own read timestamp that
can sit *ahead* of the separately-sampled `mz_now()` used as the
consumer catch-up target.  The authoritative set then included a row
whose commit ts was later than target_ts, while the consumer — correctly
caught up only to target_ts — had not yet streamed it.

Fix: sample target_ts first, wait for the consumer to catch up to it,
then read the authoritative set `AS OF target_ts` so both sides of the
missed/spurious comparison are taken at the same logical time.  The
AS OF read is non-blocking here because the consumer already observed
progress past target_ts (table upper is already beyond it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The probe added last commit fired (no longer unfound) but its condition
was never true across the May 26 runs (pass=0): it never observed a
replica non-online, nor a SELECT-1 recovery after a connect failure.
Two self-inflicted bugs:

1. Point-in-time `mz_cluster_replica_statuses` polled every 2s misses a
   clusterd kill/restart that resolves inside one window.  Add a query
   against the durable `mz_cluster_replica_status_history` (every
   transition with occurred_at), scoped to events since the probe
   launch — a blip the live poll missed still leaves an `offline` row
   that any later successful poll observes.  Also tighten the live poll
   to 1s.

2. The SELECT-1 liveness check used helper_pg.query_retry, which retries
   *through* fault windows and almost always eventually succeeds — so the
   "connect failed then later recovered" transition was never observed.
   Replace with a short, non-retrying direct connect so the transient
   failure is actually seen.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…replica, internal-error

Three new SUT-anchored assertions (default drivers/anytime, so every
group runs them) to catch wrong-results and panics-adjacent bugs the
existing crash/frontier suite misses:

* parallel_driver_differential_query — builds a random table and reads
  base table vs non-materialized view vs materialized view of the same
  aggregate inside ONE read transaction (one consistent timestamp);
  asserts all three agree.  Catches incremental-maintenance / optimizer
  wrong-results — the silent-correctness class with no prior coverage.

* parallel_driver_cross_replica_consistency — reads the same indexed
  view pinned to each replica of antithesis_cluster (via
  SET cluster_replica) and asserts identical results.  Catches replica
  divergence / non-deterministic execution.  Self-skips single-replica
  groups.

* anytime_internal_error_scan — runs a rotating set of valid catalog /
  arithmetic / window / jsonb queries and asserts none ever return
  Materialize's `internal error` class; also scans source/sink statuses
  for objects stuck in an internal-error state.

All demote fault-shaped failures via looks_like_fault so fault windows
don't trip the safety assertions.  Also includes bin/fmt line-wrapping
of two unrelated drivers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… primary

The deploy group's premise was wrong.  A second `materialized` against
the *shared* catalog at deploy_generation=1 is not a passive read
replica: it's a 0dt deployment successor that auto-promotes after
`with_0dt_deployment_max_wait` (30 min) and fences the primary.  Both
the `deploy` and `combined` 10h runs (nightly 16590) timed out at the
30-minute setup deadline with `setup_error: No 'setup complete' event
received` — the workload entrypoint could never provision
antithesis_cluster against the fenced/contended primary.  Adding
materialized2 to `combined` broke the kitchen-sink group entirely.

environmentd has no permanent-read-replica flag (read-only mode is
purely deploy_generation-driven), so a real 0dt/fencing test needs
explicit start→rehydrate→promote→fence orchestration, not an always-on
second container.  Revert the whole experiment: remove materialized2
from mzcompose.py, the deploy group + combined's materialized2/driver
from groups.yaml, antithesis-*-deploy from export-compose.py, deploy
from the Makefile, and the workloads/configs/test dirs.  Regenerate the
combined compose.  The 0dt idea is preserved in git history for a
future properly-orchestrated attempt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erance

Both new oracles fired false always(False) across 7 groups in the 10h
runs — not on actual disagreements (no base≠mv, no replica divergence
was ever reported), but on fault/transient errors their handlers
didn't recognize:

  cross-replica: "target replica failed or was dropped" — Antithesis
    killed the clusterd the read was pinned to via SET cluster_replica.
  differential:  "collection 'uNN' was dropped" — the scratch MV/view's
    collection was torn down during fault recovery (or our own CASCADE
    cleanup raced a retry).

Add "target replica failed or was dropped" to the shared
helper_fault_tolerance FAULT_PATTERNS (an unambiguous replica-fault any
replica-targeted driver can hit).  Handle "was dropped" locally in both
oracle drivers — these own prefix-scoped scratch objects, so a drop
mid-read is only ever fault-recovery / self-cleanup, never a real bug;
keeping it local preserves strictness for drivers that read shared
objects (testdrive, etc.).  Both now demote to the
"completed without a fault" sometimes-anchor instead of the
always(False) "unexpected error" path.

The oracles themselves found zero correctness violations in the 10h
runs — this only removes the fault-handling noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e + mz_now in one txn

The earlier AS-OF-mz_now() fix didn't fully close the snapshot-skew
race; the 10h runs still showed exact off-by-one missed/spurious
(authoritative=118 received=119, ambiguous=0, reconnects=0 — no fault
involved).  Root cause: mz_now() sampled on its own connection can lag
the producer's last committed row, so reading the authoritative set
AS OF that too-early target missed a row the consumer already streamed.

Read `SELECT counter FROM table` and `SELECT mz_now()` in a SINGLE read
transaction.  The txn's read timestamp T is >= every committed row, so
authoritative_set is complete; and target_ts == T, so once the consumer
reaches T it has streamed exactly that set (producer has stopped, nothing
commits after T).  missed/spurious are then structurally empty unless the
SUT genuinely lost or invented a row.  Bail to the existing
sometimes(False) anchor on a fault-shaped snapshot failure; drop the now
unused query_retry / query_one_retry imports.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two SUT-anchored assertions (default drivers, so every group runs
them) for bug classes the existing suite doesn't directly test:

* anytime_source_status_convergence — port of test/source-sink-errors
  (README §4 "Next test frameworks to port").  Polls
  mz_source_statuses + mz_sink_statuses scoped to user objects,
  tracks per-(id, kind) healthy/unhealthy transitions across the
  launch, and asserts no user source/sink ended the launch stuck in
  `stalled`/`failed`/`ceased`/`paused`/`dropped` without a subsequent
  healthy observation past MIN_RECOVERY_S (180s, comfortably longer
  than the fault-orchestrator's MAX_ON=40s).  Catches the bug class
  where the status tables don't converge to ground truth after the
  upstream disruption clears.  Liveness anchors fire on observed
  stalls + observed recoveries.

* singleton_driver_append_only_monotonic — runs a long-lived
  monotonic-counter INSERT loop into a self-owned prefix-scoped table
  (singleton so the in-process committed_count model is unambiguous)
  and periodically samples count(*) on a fresh connection.  Asserts
  count(*) never decreases across observations and never exceeds the
  in-process committed_count.  Catches the persist / storage data-loss
  bug class where a committed row silently disappears after a fault
  or compaction — no other driver directly tests row-count
  monotonicity on an append-only object.

Both demote fault-shape failures via looks_like_fault, so fault
windows don't trip the safety assertions; both validated against a
live materialized (source-status query reads cleanly, INSERT/count
pattern works).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@def- def- force-pushed the pr-antithesis2 branch from e2eac0a to b7d32d3 Compare May 28, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants