Skip to content

test: add e2e replication tests for multi-region validation#394

Draft
WentingWu666666 wants to merge 5 commits into
documentdb:mainfrom
WentingWu666666:developer/e2e-replication-tests
Draft

test: add e2e replication tests for multi-region validation#394
WentingWu666666 wants to merge 5 commits into
documentdb:mainfrom
WentingWu666666:developer/e2e-replication-tests

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

Summary

Add a new replication test area that validates DocumentDB cross-cluster replication semantics within a single Kind cluster, following the same approach CNPG uses for its replication tests.

Changes

  • Add \ReplicationLabel\ to \labels.go\ and \�llAreaLabels()\ in \suite_test.go\
  • Add \ReplicationReady\ (10min) and \DataSync\ (3min) timeout operations
  • Create replication mixin template for \clusterReplication\ config
  • Create \ est/e2e/tests/replication/\ test area with:
    • Suite bootstrap with Ginkgo SynchronizedBeforeSuite/AfterSuite
    • Helpers including ExternalName bridge services that simulate cross-cluster DNS resolution within a single cluster
    • Deploy test: deploys primary + replica, verifies CNPG ReplicaCluster config, pg_basebackup source, and ExternalClusters
    • Data replication test: validates bulk insert count, content fidelity, and update replication via MongoDB wire protocol

Approach

Since DocumentDB replication is designed for multi-cluster deployments (with service mesh handling DNS), testing within a single cluster requires ExternalName bridge services to CNAME the expected DNS names to actual service FQDNs. This is test-only scaffolding — it does not change production code.

Testing

All 4 tests pass on a local Kind cluster (~204 seconds):

  • Deploy primary and replica with CNPG config verification (2 specs)
  • Data replication: bulk insert count, content fidelity, update propagation (3 ordered cases)

\\�ash
cd test/e2e
ginkgo -r --label-filter=replication ./tests/...
\\

Add a new replication test area that validates DocumentDB cross-cluster
replication semantics within a single Kind cluster, following the same
approach CNPG uses for its replication tests.

Changes:
- Add ReplicationLabel to labels.go and allAreaLabels() in suite_test.go
- Add ReplicationReady (10min) and DataSync (3min) timeout operations
- Create replication mixin template for clusterReplication config
- Create test/e2e/tests/replication/ test area with:
  - Suite bootstrap with Ginkgo SynchronizedBeforeSuite/AfterSuite
  - Helpers including ExternalName bridge services that simulate
    cross-cluster DNS resolution within a single cluster
  - Deploy test: deploys primary + replica, verifies CNPG
    ReplicaCluster config, pg_basebackup source, and ExternalClusters
  - Data replication test: validates bulk insert count, content
    fidelity, and update replication via MongoDB wire protocol

All 4 tests pass on a local Kind cluster (~204 seconds).

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/e2e-replication-tests branch from b8ac2e1 to 4387913 Compare June 1, 2026 17:55
wentingwu000 and others added 3 commits June 1, 2026 14:20
Add failover_test.go that validates:
- Pre-failover data seeding and replication to replica
- Promotion via spec.clusterReplication.primary patch
- CNPG cluster role swap (replica→primary, primary→replica)
- Pre-existing data accessibility on the new primary
- New primary accepts writes after promotion

Also adds Failover timeout (10min) to the timeouts package.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Consolidate deploy_replication_test.go into data_replication_test.go to
avoid duplicate primary+replica deployments. CNPG config assertions now
run in the BeforeAll of the data replication spec, cutting total test
time by eliminating a redundant ~2 min setup phase.

Move findCNPGCluster() helper to helpers_test.go so it is shared by both
the data replication and failover test files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Add a 5th failover test case that writes data on the new primary and
verifies it replicates to the demoted replica. This confirms the
replication pipeline remains functional after a promotion, with data
flowing from the new primary to the demoted node.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/e2e-replication-tests branch from 6d006e9 to d279473 Compare June 1, 2026 19:09
…ivergence)

Add timeline_divergence_test.go with tests that reproduce three sub-issues
from issue documentdb#375:

- Sub-issue 2: promotionToken not cleared after successful promotion
  (CNPG cluster reports 'Cluster is unrecoverable')
- Second rapid failover: cluster becomes unrecoverable after A→B→A
  switchback before replication is healthy
- Sub-issue 1: replication broken after rapid back-to-back failover
  (writes fail with 'Exceeded time limit waiting for primary')

All three tests are designed to FAIL against the current operator,
confirming the bugs exist. They will pass once issue documentdb#375 is fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@documentdb-triage-tool documentdb-triage-tool Bot added enhancement New feature or request test labels Jun 1, 2026
@documentdb-triage-tool
Copy link
Copy Markdown

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, enhancement
Project fields suggested: Component test · Priority P2 · Effort L · Status In Progress
Confidence: 0.88 (mixed)

Reasoning

component from path globs (test); effort from diff stats (1208+3 LOC, 9 files); LLM: Adds a new e2e test area for multi-region replication validation across multiple files with new labels, timeouts, helpers, and test suites — touches test infrastructure broadly.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants