Skip to content

feat: long-haul (canary) test driver for the operator (#220)#348

Open
WentingWu666666 wants to merge 52 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests
Open

feat: long-haul (canary) test driver for the operator (#220)#348
WentingWu666666 wants to merge 52 commits into
documentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

@WentingWu666666 WentingWu666666 commented Apr 20, 2026

Closes #220.

What this PR adds

A standalone long-haul (canary) test driver for the DocumentDB Kubernetes Operator, plus the GitHub Actions plumbing to keep it running on a long-lived AKS cluster.

The driver lives under test/longhaul/ and is its own Go module. It is not a Ginkgo suite — it is a long-running binary (cmd/longhaul) deployed as a Deployment into a namespace alongside a real DocumentDB CR. It exercises the operator with a continuous, realistic workload while a weighted-random scheduler injects disruptive operations, and it journals every event so failures can be reproduced.

Pieces that ship in this PR

  • Continuous data plane (workload/): writer + verifier goroutines, sequence / checksum gap detection, lag-aware reads — the durability oracle for FM1/FM3/FM7.
  • Operation scheduler (operations/): weighted random selection, per-category cooldowns, steady-state gating, and an outage-policy verdict per operation. Operations implemented today: ScaleUp, ScaleDown, UpgradeDocumentDB (auto-skips when instancesPerNode < 2 because rolling restart on a single instance is unavoidable downtime).
  • Real cluster client (monitor/k8sclient.go): reads the DocumentDB CR, watches pods, pulls metrics-server samples — no fakes.
  • Health monitor + leak detector (monitor/health.go, monitor/leakdetect.go): steady-state detection plus linear-regression slope on memory/CPU samples.
  • Append-only journal (journal/): every disruption window, every workload event, every verdict. The disruption-window oracle is what makes verdicts reproducible.
  • Reporter + alerter (report/): writes the longhaul-report ConfigMap so a human can kubectl get cm longhaul-report -o yaml for a single source of truth on pass/fail. Alerts surface in the monitor workflow.
  • In-cluster packaging (Dockerfile, deploy/longhaul.yaml): runs as a Deployment (not a Job) — if the process exits, it restarts; the ConfigMap is the durable verdict, not the pod's exit code.
  • Two GHA workflows chained via workflow_run:
    • LONGHAUL - Build Test Driver Image — builds and pushes to GHCR.
    • LONGHAUL - Deploy Test Driver to AKS — uses a long-lived ServiceAccount-token kubeconfig (LONGHAUL_KUBECONFIG repo secret) to roll the Deployment.
  • LONGHAUL - Monitor — hourly cron that pulls the report ConfigMap and posts a workflow summary; alerts on stale or failing reports.
  • Tier-1 unit tests across config, journal, monitor, report, operations — no cluster needed, run with go test ./... from test/longhaul/.
  • Docsdocs/designs/long-haul-test-design.md (problem statement, eight failure modes, design principles, architecture, operations catalog with status, env-var table, phase tracker) and test/longhaul/README.md (how to run, what every env var does, why Deployment over Job).

Why this is structured the way it is

  • Long-haul is what happens when the failure modes you need to detect require more time and operations than fit in any bounded test run. Memory leaks, CR-history drift, upgrade-under-state failures — these don't speed up. The eight failure modes in the design doc are why every component exists.
  • No drain before disruption. Workload runs through scale, upgrade, and (future) failover ops — that's exactly the bug class we're trying to find.
  • Deployment, not Job. A Job that finishes is a Job no one looks at. The longhaul-report ConfigMap is the durable verdict, and the monitor cron makes sure a human sees stale runs.
  • Single cluster today, two clusters later. A baseline DocumentDB cluster (control group) is in the design but not in this PR — the single-cluster driver already exercises the primary failure modes.

Live canary

I've stood up a long-running cluster — the driver Deployment lives there and gets re-rolled by the deploy workflow on every successful image build:

  • AKS cluster: longhaul-aks (longhaul-rg, sub 81901d5e…)
  • Everything (DocumentDB CR + driver Deployment + longhaul-report ConfigMap) runs in documentdb-test-ns.
  • Useful pokes:
    • kubectl logs deploy/longhaul-test -n documentdb-test-ns
    • kubectl get cm longhaul-report -n documentdb-test-ns -o yaml

What's intentionally out of scope

Tracked in the design doc's phase table:

  • Two-cluster topology (Phase 3): primary + baseline, so signals become attributable.
  • Backup / restore operations (Phase 2a) and chaos primitives beyond scale (Phase 2d): drain node, kill primary pod, operator restart.
  • Operator version upgrade operation: we exercise DocumentDB version upgrades today; operator upgrades will land once the release workflow is wired to bump the canary's target.
  • OutagePolicy.AllowedDowntime enforcement: the field is reserved on the struct; current verdicts use AllowedWriteFailures + MustRecoverWithin. Called out explicitly in the design doc.

How to review

If you only want to skim, three files give you the shape of the PR:

  1. docs/designs/long-haul-test-design.md — the why.
  2. test/longhaul/cmd/longhaul/main.go — the wiring.
  3. test/longhaul/operations/scheduler.go — the heart of the disruption loop.

If you want to run it locally:

cd test/longhaul
LONGHAUL_CLUSTER_NAME=documentdb-sample LONGHAUL_NAMESPACE=documentdb-test-ns \
  LONGHAUL_MAX_DURATION=10m \
  go run ./cmd/longhaul

Happy to walk through any area on a call.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 8 times, most recently from ca9522c to d1b3462 Compare April 22, 2026 16:38
@WentingWu666666 WentingWu666666 changed the title docs: add long haul test design document feat: add long-haul test design and Phase 1a skeleton (#220) Apr 22, 2026
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from d1b3462 to 6bd4274 Compare April 22, 2026 16:41
@WentingWu666666 WentingWu666666 marked this pull request as ready for review April 22, 2026 16:50
Copilot AI review requested due to automatic review settings April 22, 2026 16:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.

Changes:

  • Introduces a long-haul Ginkgo suite with a BeforeSuite enablement gate (LONGHAUL_ENABLED) and a placeholder canary spec.
  • Adds a test/longhaul/config package for env-based configuration loading/validation with unit tests.
  • Documents the long-haul test architecture and usage via a new design doc and package README.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
operator/src/test/longhaul/suite_test.go Adds the Ginkgo suite entry point for long-haul tests.
operator/src/test/longhaul/longhaul_test.go Adds gated BeforeSuite config loading/validation and a placeholder canary spec.
operator/src/test/longhaul/config/suite_test.go Adds the Ginkgo suite entry point for config unit tests.
operator/src/test/longhaul/config/config_test.go Adds unit tests for config defaults, env parsing, validation, and enablement gating.
operator/src/test/longhaul/config/config.go Implements long-haul test configuration (env vars, defaults, validation, enablement).
operator/src/test/longhaul/README.md Adds usage docs, configuration reference, and CI-safety notes for long-haul tests.
docs/designs/long-haul-test-design.md Adds the long-haul canary design document and phased implementation plan.

Comment thread operator/src/test/longhaul/README.md Outdated
Comment thread operator/src/test/longhaul/README.md Outdated
Comment thread operator/src/test/longhaul/config/config.go Outdated
Comment thread test/longhaul/config/config_test.go
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md
Comment thread docs/designs/long-haul-test-design.md Outdated
Copy link
Copy Markdown
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.

Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅


🟠 Major

M1. Module placement will pollute operator/src/go.modoperator/src/test/longhaul/ (entire tree)

The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.

For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:

  • Phase 1b adds go.mongodb.org/mongo-driver (design line 75)
  • Phase 1c adds a file-backed journal
  • Phase 1d adds OTel / Prometheus client libs
  • Phase 2f adds k8s Job manifests + likely controller-runtime client

…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.

M2. Design does not engage with the existing e2e suitedocs/designs/long-haul-test-design.md throughout

Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:

  • Ginkgo v2 harness with SynchronizedBeforeSuite for cluster bootstrapping (test/e2e/pkg/testenv)
  • Label taxonomy (test/e2e/labels.go)
  • Run-ID tagging for artifact separation (test/e2e/runid.go)
  • envsubst-based manifest rendering
  • A mongo-driver v2 wire-protocol client (same driver the design proposes)
  • .github/actions/setup-test-environment composite action for cluster provisioning

The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.

M3. Four design gaps that will cost rework, not captured in §Open Questionslong-haul-test-design.md:443-445

  1. Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
  2. Workload resumption idempotency. Writers use a unique index on (writer_id, seq) (line 80). If a writer or the Job pod restarts, where does seq restart from? Reset to 0 → writes collide. Bootstrap from max(seq) per writer_id → the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described.
  3. Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
  4. Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.

These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.

M4. MaxDuration=30m default contradicts "run until failure"config.go:594, long-haul-test-design.md:37,188

The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.


🟡 Minor (cheap in-PR fixes)

  • m1 IsEnabled()config.go:637-640 — doesn't trim whitespace. LONGHAUL_ENABLED=" true " → false (silent skip). Add strings.TrimSpace. Silent-failure mode; worth fixing in-PR.
  • m2 Validate()config.go:625-633 — accepts negative MaxDuration. time.ParseDuration("-5m") succeeds. Add if c.MaxDuration < 0 { return fmt.Errorf(...) }.
  • m3 Error strings capitalized — config.go:627,630ST1005 / Go convention: lowercase. Matching tests use ContainSubstring("Namespace") / "ClusterName" and will need adjustment.
  • m4 Path inconsistency — long-haul-test-design.md:340 says kubectl apply -f test/longhaul/deploy/job.yaml but the directory structure roots at operator/src/test/longhaul/. Fix path, or resolve M1 first.
  • m5 README overstates CI safety — README.md:543-547 — phrase as "no CI workflow currently sets LONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code.
  • m6 config is a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Consider longhaulcfg or cfg. Not blocking.
  • m7 Missing IsEnabled test cases — config_test.go:727-757 — given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True", "YES"), "0" / "no" / "false" → false.

🟢 Nit

  • longhaul_test.go:799 — package-level mutable testConfig. Fine for a single-file suite; revisit if Phase 1b splits files.
  • DefaultConfig() by value vs. Validate() pointer receiver — slight asymmetry. Leave it.
  • README line 509's -timeout 0 guidance is correct and rarely documented — nice touch.

Positive feedback

  • Skip gate is correctly placed in BeforeSuite, not BeforeEach — single skipped line instead of N per spec. Exactly right.
  • LoadFromEnv cleanly separates parsing from validation, so Validate() can be called on a programmatically-built Config in Phase 1b unit tests without fighting env vars.
  • The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
  • Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
  • Test coverage on the config package is thorough for the surface area.

Questions for the author

  1. (M1) Any reason longhaul can't live at test/longhaul/ as a sibling module to test/e2e/? If there is one, please capture it in the design.
  2. (M2) Have you looked at test/e2e/pkg/ for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation?
  3. (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
  4. Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
  5. Should CONTRIBUTING.md or AGENTS.md get a one-liner pointing at operator/src/test/longhaul/README.md so contributors aren't left wondering "what is this tree?"

Reviewed with the GitHub Copilot code-review agent.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 3 times, most recently from f3ffbef to 8459cae Compare April 22, 2026 18:47
@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

WentingWu666666 commented Apr 22, 2026

Response to @xgerman's Review

Thank you for the thorough review! I've addressed all items. Here's a summary:


Major Items

M1 (Module placement): Agreed extracted long-haul tests to test/longhaul/ at the repo root as a separate Go module (github.com/documentdb/documentdb-operator/test/longhaul). This keeps Ginkgo/Gomega and any future test-only deps completely out of the operator's runtime go.mod. Follows the three-directory layout described in the updated design doc.

M2 (E2E engagement): Designed a three-directory layout for test code reuse:

Each directory has its own go.mod with replace directives. The design doc now includes a package-to-use mapping table showing which shared utils each long-haul phase will reuse (mongo client, assertions, DocumentDB CR helpers, etc.). Extraction of shared packages from test/e2e/pkg/e2eutils/ to test/utils/ will be coordinated with @xgerman in a follow-up.

M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:

  1. Journal durability: PVC-backed file at /data/journal/, structured JSON lines, on-startup reconstruction from existing file.
  2. Writer seq resumption: Bootstrap from max(seq) per writer_id in DB. Oracle tolerates crash-recovery gaps (logged, not flagged as data loss).
  3. Teardown on abort: Signal handler for SIGTERM/SIGINT cancel contexts flush journal write final status to ConfigMap exit. Leftover-run detector on startup.
  4. Latency baseline: Kept as Open Question establish P99 baseline during first 30min warmup, alert on >2 sustained regression.

These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs.

M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets LONGHAUL_MAX_DURATION=0s. The 30m default remains as a safety net for local development.


Minor Items All Fixed

  • m1: Added strings.TrimSpace to IsEnabled()
  • m2: Added negative MaxDuration validation in Validate()
  • m3: Lowercased error strings (Go convention ST1005)
  • m4: Fixed path: test/longhaul/deploy/job.yaml
  • m5: Rephrased CI safety: "No CI workflow currently sets this variable do not add it to any PR-gated workflow"
  • m6: Valid concern. Keeping config for Phase 1a (no shadowing risk yet). Will revisit if the operator's own config needs importing in later phases.
  • m7: Added 8 new test cases: whitespace (" true ", " yes "), whitespace-only (" "), mixed case ("True", "YES"), explicit false values ("0", "no", "false"). Total: 23 specs (was 15).

Questions

  1. (M1) Resolved extracted to separate Go module at test/longhaul/.
  2. (M2) Resolved three-directory layout with shared utils plan documented.
  3. (M3.1) PVC-backed file journal documented as provisional design decision.
  4. (Issue Set up long haul test cluster #220) The design's Problem Statement already quotes the three requirements from Set up long haul test cluster #220. SLO targets remain as Open Question Adding Microsoft SECURITY.MD #2 they need team input.
  5. (CONTRIBUTING.md/AGENTS.md) Good idea. Will add a one-liner in a follow-up PR to avoid scope creep here.

Test results: 23 config specs + 1 canary spec (skipped) = all passing. go vet clean.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 3 times, most recently from 9676d24 to 158a393 Compare April 23, 2026 17:31
@hossain-rayhan
Copy link
Copy Markdown
Collaborator

The plan looks good.
I have one concern about the test restart policy.

We should keep running parallel tests for different combination. Like, operator n version with gateway/extension n-1 version. Operator n-2 version with keeps working fine with latest non-schema changes.

At least, different operator version tests, not only the latest release restarts the tests.

@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

Thanks for the feedback @hossain-rayhan! Great point about version compatibility.

Since the operator and database images follow independent version tracks, we agree cross-version testing matters e.g., upgrading the operator while the database is still on the previous version, and vice versa.

That said, we think this is better suited for the E2E test suite rather than long-haul. Cross-version compat bugs surface immediately on upgrade (deterministic, finite), while long-haul focuses on time-dependent issues memory leaks, connection pool exhaustion, replication drift that only appear after hours/days of sustained operation.

We'll make sure the E2E suite covers both operator-first and database-first upgrade paths with version N/N-1 combinations.

Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md Outdated
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from 158a393 to 1d63aea Compare April 24, 2026 15:35
wentingwu000 and others added 3 commits May 7, 2026 13:08
…dDowntime TODO (documentdb#220)

Sweep of xgerman's lower-severity findings on PR documentdb#348:

[documentdb#8] journal.DisruptionWindow.ExceededPolicy() ignored Policy.AllowedDowntime even though every operation's OutagePolicy() sets it. Document the gap with a TODO(longhaul, documentdb#220) explaining what enforcement would require (per-write timestamps fed into the active window from writer.go) so the dead field is not mistaken for actual policy enforcement.

[documentdb#9] UpgradeDocumentDB.waitForImage was 'continue'-ing on every status read error during the rolling restart, which made a sustained apiserver outage indistinguishable from a slow upgrade until the recovery timeout fired. Plumb the journal into the upgrade operation and emit j.Warn on each transient read error so a real outage shows up in the report immediately while one-off blips still retry on the next 10s tick. Constructor signature changes; main.go updated.

[nit] report.EmitAnnotation called Duration.Round(1) which rounds to 1ns -- a no-op. Use Round(time.Second) to match GenerateMarkdown so annotations read '2h17m' instead of '2h16m59.834561287s'.

[nit] monitor.ClusterHealth.WritesHealthy declared but never set or read. Removed; can be added back alongside its populator if/when the writer's healthy/unhealthy signal is wired in.

[documentdb#7 deferred] Pinning go.mod to 1.25.x is non-trivial: indirect k8s.io/* deps were tidied against pre-release versions (dated 20260120) that require go 1.26 features. Downgrading them needs its own change (release-pinned k8s.io/* + transitive resolution) rather than just editing the directive. Tracking separately.

Verified: go vet ./..., go test ./... (config: 19 specs) clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220)

[documentdb#6] writer.writeOne now treats mongo.IsDuplicateKeyError as a successful, idempotent ACK. With retryable writes on by default in the v2 driver, a network blip during a disruption window can produce: server commits insert, ACK is dropped, driver auto-retries the same _id, server returns code 11000. The data is durably committed in that case, so counting it as a write failure (which feeds AllowedWriteFailures and the policy gate) was turning successful writes into spurious FAIL verdicts during the very disruption windows the harness exists to measure.

[documentdb#5] verifier.verifyWriter now resumes from a per-writer nextSeq map instead of re-scanning the entire writer history every 10s. A 100ms-per-write writer accumulates ~864k docs/day, so the original full scan grew to ~75M doc-reads/hour per writer over a 3-day run, both saturating the cluster and turning the verifier's own load into a confounding signal in the report. Filter is now {writer_id, seq: \$gte\: nextSeq}; nextSeq is advanced to lastSeen+1 each cycle. Gaps are still detected because we always start the scan at the first unverified seq -- a future cycle that finds a gap-fill below nextSeq is intentionally not re-checked (an in-flight delayed write that lands below a previously-seen seq would already have produced a gap on the cycle that observed the higher seq).

Verified: go vet ./..., go test ./... clean. Mongo-dependent unit tests for both behaviors land in the next commit alongside the rest of the unit-test sweep.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220)

Adds the per-package unit-test sweep xgerman flagged in PR documentdb#348 [#3]. Workload writer/verifier remain integration-tested only because their meaningful behavior (mongo retryable writes, dup-key handling, incremental scan filters) requires a real or mocked mongo collection that is out of scope for this PR.

journal/policy_test.go covers DisruptionWindow.{IsActive, Duration, ExceededPolicy} including boundary cases (write failures equal to budget allowed; active window evaluated against MustRecoverWithin).

journal/journal_test.go covers Record/Info/Warn/Error append semantics, EventsSince cutoff, the OpenDisruptionWindow auto-close-previous behavior, RecordWriteFailure no-op when no active window, HasPolicyViolation across active and closed windows, and a concurrent-append correctness check.

monitor/health_test.go uses a fakeClusterClient stub to exercise IsSteadyState transitions: not-yet-healthy, becoming-healthy after the configured wait, loss-of-health resets steadySince, poll-error resets steadySince, and WaitForSteadyState's reached/cancelled paths.

monitor/leakdetect_test.go covers the linear-regression slope: below min-samples returns zeroed result, flat memory produces ~0 slope, growing memory above threshold flags HasLeak with the expected slope (~3600 MB/h for 1 MB/s growth), sub-threshold growth does not flag, and the minSamples<3 floor is enforced.

operations/scheduler_test.go uses a fakeOp stub to validate selectOperation: nil when no precondition passes, nil when total weight is zero, only-eligible filtering, weighted-distribution within tolerance over 2000 trials, and executeOp's window-open/close + ERROR-event-on-failure behavior.

report/report_test.go snapshots GenerateMarkdown invariants: PASS vs FAIL header, duration rounded to seconds, data-plane metrics table always present, disruption-windows and leak-analysis sections gated on data presence, and the recent-events tail truncated to 20.

Verified: go test ./... clean. -race not run locally (CGO unavailable on this Windows env); the concurrent-append test still validates final-count correctness without it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from e3a5846 to d6f9adc Compare May 7, 2026 17:11
wentingwu000 and others added 3 commits May 7, 2026 13:13
Adds a section to test/longhaul/README.md explaining how the long haul
harness relates to the Ginkgo-based e2e suite landed in PR documentdb#346:

- Different shapes (daemon vs test binary) and lifetimes (days vs minutes)
- Different operator-API access patterns (dynamic vs typed client)
- Concrete e2e helpers (mongo, operatorhealth, clusterprobe) that long
  haul will likely consume once it grows beyond raw URI / no-TLS

Explains why a shared test/shared/ module is deliberately not introduced
yet: today's only duplicated surface (raw mongo connect + ping) is too
small to justify a third Go module given the version drift between the
two test modules. Revisit when the driver adopts e2e's connection model.

No code changes.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 new *_test.go files using plain `testing` (matching the
in-package dominant pattern; only config/ uses Ginkgo for envtest-style
suite setup).

Coverage:
- workload/metrics_test.go: Snapshot, WriteSuccessRate, HasDataLoss,
  concurrent counter aggregation
- workload/writer_test.go: computeChecksum determinism + sensitivity
  + 16-char hex format; constructor wiring
- workload/verifier_test.go: nextSeq resume-point semantics; mongo
  I/O paths skipped (integration-only)
- operations/scale_test.go: clamping, OutagePolicy values, full
  Precondition table for ScaleUp/ScaleDown
- operations/upgrade_test.go: readDesiredVersion (NotFound/found/empty),
  Precondition gating (no-desired/match/ipn<2/eligible)
- report/alert_test.go: all 5 EmitAnnotation branches (PASS/FAIL/
  warning-on-leak) using os.Pipe stdout capture
- report/checkpoint_test.go: create-on-first-emit, update-on-second
  via fake clientset; RUNNING-for-PASS translation; FAIL preserved
- monitor/k8sclient_test.go: K8sClusterClient end-to-end via fake
  clientset + fake dynamic client; isPodReady, GetClusterHealth,
  GetInstancesPerNode, ScaleCluster, GetCurrentDocumentDBImageTag
  (incl. registry-with-port edge), UpgradeDocumentDB, MetricsAvailable

The mongo I/O paths (writer.writeOne, verifier.verifyAll) and
cmd/longhaul/main.go remain integration-only -- documented inline.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves xgerman review item: test/longhaul/go.mod declared go 1.26.0
while operator/src/go.mod and test/e2e/go.mod both pin go 1.25.9.

Root cause: longhaul required k8s.io/{api,apimachinery,client-go,metrics}
v0.36.0 which transitively requires Go 1.26+. Downgraded those deps to
v0.35.0 (matches operator and e2e), which allowed the go directive to
drop back to 1.25.9.

Verified with go mod tidy + go build ./... + go test ./... from
test/longhaul/ -- all 6 packages pass.

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

@xgerman thanks again for the detailed review all items from your latest review are now addressed. Quick disposition summary:

Critical

  1. ScaleUp/Down now patch spec.instancesPerNode (not nodeCount).
  2. Env vars renamed to LONGHAUL_MIN_INSTANCES / LONGHAUL_MAX_INSTANCES everywhere (code, README, workflow).
  3. Unit tests added for all 6 packages (config, journal, monitor, operations, report, workload) commit 463c0ae.
  4. Upgrade auto-skips when instancesPerNode < 2 (operations/upgrade.go Precondition).

Major
5. Verifier resumes from nextSeq instead of full re-scan every cycle.
6. Writer now treats mongo.IsDuplicateKeyError as success (idempotent retry).
7. test/longhaul/go.mod aligned to go 1.25.9 to match operator/e2e needed downgrading k8s.io deps from v0.36.0 v0.35.0 since v0.36 requires Go 1.26+ (commit e17fcf3).
8. OutagePolicy.AllowedDowntime now has a TODO + enforcement skeleton in journal/policy.go.

Suggestions / Nitpicks
9. waitForImage now logs polling errors instead of swallowing them.
10. Acknowledged scheduler poll cadence kept as-is.
11. Out of scope for this PR (cross-module API pin).
12. Removed unused ClusterHealth.WritesHealthy.
13. setup.yaml header now references deployment.yaml.
14. setup.yaml defaults to instancesPerNode: 2.
15. EmitAnnotation uses Round(time.Second).
16. Declined keeping timestamp-based payload to preserve write-time observability.
17. Suggestion only workflow secret format unchanged.

Verified locally: go build ./... and go test ./... from test/longhaul/ are green. Ready for re-review when you have a moment.

Aligns long-haul unit tests with the repo convention (operator and e2e both use Ginkgo/Gomega per AGENTS.md). Replaces stdlib testing patterns with Describe/Context/It/DescribeTable and Expect matchers across the journal, monitor, operations, report, and workload packages.

Adds suite_test.go per package. Behavior unchanged; coverage identical. Verified with go build + go test from test/longhaul/.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@xgerman
Copy link
Copy Markdown
Collaborator

xgerman commented May 14, 2026

Review: long-haul harness vs. existing test/e2e/pkg/e2eutils/

Verdict: the PR does not leverage the existing e2e helpers and ships meaningful duplication of the DocumentDB CR contract and Mongo connection plumbing.

Root cause

test/longhaul/go.mod is a standalone module with no replace github.com/documentdb/documentdb-operator => ../../operator/src directive and no dependency on the operator or e2e module. As a result, longhaul cannot import previewv1 or test/e2e/pkg/e2eutils/, and reimplements them via raw dynamic.Interface + unstructured. test/e2e/go.mod:110 has the replace directive that makes this work for the e2e suite.

Concrete duplications

1. test/longhaul/monitor/k8sclient.go duplicates test/e2e/pkg/e2eutils/documentdb/documentdb.go

Longhaul (raw dynamic) E2E (typed)
K8sClusterClient.ScaleCluster (k8sclient.go:178) documentdb.PatchInstances (documentdb.go:299)
K8sClusterClient.UpgradeDocumentDB (k8sclient.go:221) documentdb.PatchSpec — generic mutator (documentdb.go:325)
K8sClusterClient.GetInstancesPerNode (k8sclient.go:155) documentdb.Get + field access (documentdb.go:428)
K8sClusterClient.GetCurrentDocumentDBImageTag (k8sclient.go:200) documentdb.Get + field access
K8sClusterClient.GetClusterHealthstatus == "Cluster in healthy state" hardcoded at k8sclient.go:148 documentdb.ReadyStatus constant (documentdb.go:68)
Polling-for-healthy in HealthMonitor.check (health.go:93) documentdb.WaitHealthy (documentdb.go:346)

The hardcoded readiness string is the most concerning item: if the operator ever changes its ReadyStatus, e2e picks it up automatically; longhaul silently breaks.

2. cmd/longhaul/main.go mongo connection duplicates test/e2e/pkg/e2eutils/mongo/

  • main.go does raw mongo.Connect(options.Client().ApplyURI(cfg.MongoURI)) plus a manual ping.
  • e2e already exports mongo.NewClient, mongo.BuildURI, and mongo.NewFromDocumentDB — the last reads the documentdb-credentials Secret, opens a port-forward when needed, applies TLS, and retries pings.
  • Longhaul sidesteps the credentials Secret by requiring an externally-injected LONGHAUL_MONGO_URI env var via a separate longhaul-mongo-credentials Secret (deploy/setup.yaml). That's an operational workaround, not a code-reuse decision — mongo.BuildURI and the credentials-secret reader still apply even when the cluster is LB-exposed.

3. GVR hardcoding

monitor/k8sclient.go:78-82 hardcodes {Group: "documentdb.io", Version: "preview", Resource: "dbs"}. The typed client encodes the same fact via AddToScheme and would never drift. (Note the resource is dbs, not the more obvious documentdbs — exactly the kind of detail you only get wrong once when you bypass the scheme.)

4. Operator pod health monitoring

monitor/health.go ad-hoc counts pod readiness. test/e2e/pkg/e2eutils/operatorhealth/gate.go already encapsulates "snapshot operator pod identity + restart count, detect churn" — the exact primitive needed to distinguish "operator died" from "operator was upgraded mid-run".

Not duplicated (longhaul-specific, fine as-is)

workload/verifier.go, workload/writer.go, monitor/leakdetect.go, journal/, operations/scheduler.go, report/report.go are genuinely longhaul-only — keep them where they are.

Recommendations (ranked)

  1. Add the operator replace directive and an e2eutils dependency to test/longhaul/go.mod, or extract e2eutils/{documentdb,mongo,operatorhealth,portforward} into a test/shared/ module that both test/e2e/ and test/longhaul/ consume. The shared-module option is cleaner long-term because it avoids importing Ginkgo into a non-Ginkgo binary.
  2. Replace the CR operations in monitor/k8sclient.go with documentdb.Get / PatchInstances / PatchSpec / WaitHealthy, and reference documentdb.ReadyStatus instead of the literal string. operations/scale.go and operations/upgrade.go consume the monitor.ClusterClient interface, so this is an internal refactor and won't ripple.
  3. Replace mongo.Connect in cmd/longhaul/main.go with mongo.NewFromDocumentDB (or mongo.NewClient + mongo.BuildURI when an external URI is genuinely required).
  4. Lower priority — even if (1) is rejected as too invasive: at minimum expose ReadyStatus from a small shared constants package and import it. Do not duplicate the magic string.

The harness itself is good work; this is a maintenance-liability call, not a behavioural one. Worth resolving before merge so we don't end up with two source-of-truths for "how to talk to a DocumentDB CR".

WentingWu666666 pushed a commit to WentingWu666666/documentdb-kubernetes-operator that referenced this pull request May 14, 2026
Create a new Go module test/shared/ that holds framework-agnostic

helpers used by both e2e and long-haul suites:

  * documentdb/cr.go - typed CR ops (Get/List/PatchInstances/PatchSpec/

    WaitHealthy/IsHealthy/Delete) plus DefaultWaitPoll and ReadyStatus

    constants (single source of truth for the `Cluster in healthy

    state'' string).

  * mongo/client.go - mongo client constructor (verbatim copy from

    e2e) for in-cluster service-DNS access.

  * mongo/uri.go - NewFromURI helper for callers (long-haul) that

    already hold a fully-formed mongodb:// URI; applies the same

    DefaultConnectTimeout used by the rest of the helpers.

Tests are migrated alongside the code; cd test/shared && go test ./...

is green.

This is preparation for PR documentdb#348 review feedback (xgerman): the long-

haul monitor was duplicating CR-op and mongo-connect logic against

raw dynamic-client / raw mongo.Connect calls. Subsequent commits wire

e2e and long-haul to consume this module.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WentingWu666666 pushed a commit to WentingWu666666/documentdb-kubernetes-operator that referenced this pull request May 14, 2026
Address xgerman's PR documentdb#348 review by removing the long-haul monitor's

duplicated dynamic-client / raw-mongo logic in favor of the typed

helpers in test/shared:

  * monitor/k8sclient.go: switch from dynamic.Interface + hardcoded

    GVR to a controller-runtime client.Client with the previewv1

    scheme. CR reads/patches now go through documentdb.Get,

    documentdb.PatchInstances and documentdb.PatchSpec; CR readiness

    uses documentdb.IsHealthy / documentdb.ReadyStatus instead of

    the previously hardcoded `Cluster in healthy state'' string.

    The instancesPerNode-missing => 1 defaulting is preserved

    explicitly because the typed CR returns 0 (Go zero value) for

    an omitted int, where the unstructured path returned !found.

    HealthMonitor.WaitForSteadyState is intentionally left alone --

    its semantics (pods + CR readiness over time) differ from

    WaitHealthy (CR-only) and are exercised by the long-haul flow.

  * monitor/k8sclient_test.go: switch dynamic-fake -> controller-

    runtime fakeclient with previewv1 scheme; keep kubernetes

    fake.Clientset for pod/metrics fixtures.

  * cmd/longhaul/main.go: replace mongo.Connect(options.Client().

    ApplyURI(uri)) with sharedmongo.NewFromURI so the long-haul

    driver picks up the same DefaultConnectTimeout the rest of the

    helpers use.

  * go.mod: add replace directives for ../../operator/src and

    ../shared and pull in controller-runtime + previewv1 transitively.

Verified: go build ./..., go vet ./..., go test ./... all green in

test/longhaul.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
wentingwu000 and others added 3 commits May 14, 2026 19:42
Create a new Go module test/shared/ that holds framework-agnostic

helpers used by both e2e and long-haul suites:

  * documentdb/cr.go - typed CR ops (Get/List/PatchInstances/PatchSpec/

    WaitHealthy/IsHealthy/Delete) plus DefaultWaitPoll and ReadyStatus

    constants (single source of truth for the `Cluster in healthy

    state'' string).

  * mongo/client.go - mongo client constructor (verbatim copy from

    e2e) for in-cluster service-DNS access.

  * mongo/uri.go - NewFromURI helper for callers (long-haul) that

    already hold a fully-formed mongodb:// URI; applies the same

    DefaultConnectTimeout used by the rest of the helpers.

Tests are migrated alongside the code; cd test/shared && go test ./...

is green.

This is preparation for PR documentdb#348 review feedback (xgerman): the long-

haul monitor was duplicating CR-op and mongo-connect logic against

raw dynamic-client / raw mongo.Connect calls. Subsequent commits wire

e2e and long-haul to consume this module.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Switch test/e2e/pkg/e2eutils/documentdb to consume the typed CR ops

from test/shared/documentdb instead of defining them locally:

  * Add replace directive for test/shared in test/e2e/go.mod.

  * Re-export Get, List, PatchInstances, PatchSpec, WaitHealthy,

    IsHealthy, Delete plus DefaultWaitPoll and ReadyStatus through

    a thin shim so call sites across the e2e suite need no change.

  * Drop the now-duplicated implementations and the migrated tests

    (covered by test/shared/documentdb/cr_test.go).

Renderer + Create-with-mixin code stays put -- those are e2e-specific.

Verified: cd test/e2e && go vet ./... && go test ./pkg/e2eutils/...

is green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Address xgerman's PR documentdb#348 review by removing the long-haul monitor's

duplicated dynamic-client / raw-mongo logic in favor of the typed

helpers in test/shared:

  * monitor/k8sclient.go: switch from dynamic.Interface + hardcoded

    GVR to a controller-runtime client.Client with the previewv1

    scheme. CR reads/patches now go through documentdb.Get,

    documentdb.PatchInstances and documentdb.PatchSpec; CR readiness

    uses documentdb.IsHealthy / documentdb.ReadyStatus instead of

    the previously hardcoded `Cluster in healthy state'' string.

    The instancesPerNode-missing => 1 defaulting is preserved

    explicitly because the typed CR returns 0 (Go zero value) for

    an omitted int, where the unstructured path returned !found.

    HealthMonitor.WaitForSteadyState is intentionally left alone --

    its semantics (pods + CR readiness over time) differ from

    WaitHealthy (CR-only) and are exercised by the long-haul flow.

  * monitor/k8sclient_test.go: switch dynamic-fake -> controller-

    runtime fakeclient with previewv1 scheme; keep kubernetes

    fake.Clientset for pod/metrics fixtures.

  * cmd/longhaul/main.go: replace mongo.Connect(options.Client().

    ApplyURI(uri)) with sharedmongo.NewFromURI so the long-haul

    driver picks up the same DefaultConnectTimeout the rest of the

    helpers use.

  * go.mod: add replace directives for ../../operator/src and

    ../shared and pull in controller-runtime + previewv1 transitively.

Verified: go build ./..., go vet ./..., go test ./... all green in

test/longhaul.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from cf91ba4 to ff01b4d Compare May 14, 2026 23:43
wentingwu000 and others added 6 commits May 14, 2026 20:06
Mirror the long-haul suite's framework so test/shared tests

(documentdb CR helpers + mongo client helpers) use Describe/It

instead of testing.T tables. Adds a per-package suite_test.go

wiring RunSpecs, and DescribeTable for the BuildURI

missing-required cases.

Pinned ginkgo/v2 v2.28.1 and gomega v1.39.1 to match

test/longhaul/go.mod.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
These re-exports in test/e2e/pkg/e2eutils/documentdb had no

external readers besides assertions.go and fixtures.go. Import

shareddoc directly from those two call sites and drop the

const aliases from the e2e shim  the var aliases (Get, List,

PatchInstances, etc.) stay because 14 e2e files depend on them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Adds the standard Microsoft MIT header to:

- test/shared/documentdb/cr.go

- test/shared/mongo/client.go

- test/shared/mongo/uri.go

- .github/workflows/longhaul-deploy.yml

- .github/workflows/longhaul-image-build.yml

Matches the convention used in operator/src/** and the existing

.github/workflows/longhaul-monitor.yaml.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Removes the var (Get = shareddoc.Get, Delete = shareddoc.Delete,

...) re-export shim from test/e2e/pkg/e2eutils/documentdb. The 17

e2e test files that called the aliases now import shareddoc and

use shareddoc.Get / shareddoc.Delete / etc. directly, so each call

site shows where the helper actually lives.

The e2e documentdb package is still imported where Create /

CreateOptions / RenderCR / DropEmptyVarLines / ManifestsFS are

needed (those stay e2e-only); files that only used the dropped

aliases drop the e2e import.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Covers the empty-URI guard, the lazy-connect happy path

(mongo-driver v2 Connect doesn't dial until first op so an

unreachable host is fine), and a malformed-URI case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The 'cr' suffix was redundant inside a package already named

documentdb. lifecycle.go matches the file's doc comment

(CRUD and lifecycle helpers) and reads better at the call site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from 9084237 to 6b647f7 Compare May 15, 2026 15:46
@WentingWu666666 WentingWu666666 marked this pull request as draft May 15, 2026 15:53
wentingwu000 and others added 5 commits May 15, 2026 11:56
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
NewFromURI is a one-function helper that overlaps heavily with

NewClient (both call mongo.Connect on an ApplyURI options chain

with DefaultConnectTimeout). Folding it into client.go removes a

duplicate package doc comment and keeps both constructors and

their tests next to each other.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Deletes the duplicate test/e2e/pkg/e2eutils/mongo/client.go

(NewClient/BuildURI/ClientOptions/Ping/Seed/Count/DropDatabase

+ buildTLSConfig + DefaultConnectTimeout) and rewires the ten

e2e tests that imported it to call test/shared/mongo directly.

test/e2e/pkg/e2eutils/mongo now contains only the e2e-specific

Handle / NewFromDocumentDB / port-forward glue (connect.go),

which itself uses sharedmongo.NewClient under the hood.

Earns the keep of test/shared/mongo (previously consumed only

by the long-haul driver) and gives both suites a single source

of truth for URI construction, TLS config and CRUD wrappers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
shareddoc reads as 'shared document' which is misleading -- the

package is the shared DocumentDB lifecycle helper layer. shareddb

is symmetric with the existing sharedmongo alias and unambiguous.

Pure rename across 24 files (e2e tests, e2e pkg helpers, longhaul

monitor). No behaviour change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Drop the Project Structure tree (and the Phase 1 status note):

  it bit-rotted -- 7 source files now exist that aren't listed

  (workload/metrics.go, monitor/k8sclient.go, operations/upgrade.go,

  journal/policy.go, report/{alert,report}.go), and IDE/ls show

  layout faster than the diagram anyway.

- Rewrite Run Locally with the actual prerequisites that the old

  example silently skipped: port-forward to the gateway service,

  reading credentials out of the documentdb-credentials secret,

  and a note that gateway-port reachability is required (otherwise

  use the in-cluster deployment path).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 marked this pull request as ready for review May 15, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set up long haul test cluster

5 participants