feat: long-haul (canary) test driver for the operator (#220)#348
feat: long-haul (canary) test driver for the operator (#220)#348WentingWu666666 wants to merge 52 commits into
Conversation
ca9522c to
d1b3462
Compare
d1b3462 to
6bd4274
Compare
There was a problem hiding this comment.
Pull request overview
Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.
Changes:
- Introduces a long-haul Ginkgo suite with a
BeforeSuiteenablement gate (LONGHAUL_ENABLED) and a placeholder canary spec. - Adds a
test/longhaul/configpackage for env-based configuration loading/validation with unit tests. - Documents the long-haul test architecture and usage via a new design doc and package README.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/test/longhaul/suite_test.go | Adds the Ginkgo suite entry point for long-haul tests. |
| operator/src/test/longhaul/longhaul_test.go | Adds gated BeforeSuite config loading/validation and a placeholder canary spec. |
| operator/src/test/longhaul/config/suite_test.go | Adds the Ginkgo suite entry point for config unit tests. |
| operator/src/test/longhaul/config/config_test.go | Adds unit tests for config defaults, env parsing, validation, and enablement gating. |
| operator/src/test/longhaul/config/config.go | Implements long-haul test configuration (env vars, defaults, validation, enablement). |
| operator/src/test/longhaul/README.md | Adds usage docs, configuration reference, and CI-safety notes for long-haul tests. |
| docs/designs/long-haul-test-design.md | Adds the long-haul canary design document and phased implementation plan. |
xgerman
left a comment
There was a problem hiding this comment.
Review
Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.
Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅
🟠 Major
M1. Module placement will pollute operator/src/go.mod — operator/src/test/longhaul/ (entire tree)
The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.
For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:
- Phase 1b adds
go.mongodb.org/mongo-driver(design line 75) - Phase 1c adds a file-backed journal
- Phase 1d adds OTel / Prometheus client libs
- Phase 2f adds k8s Job manifests + likely controller-runtime client
…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.
M2. Design does not engage with the existing e2e suite — docs/designs/long-haul-test-design.md throughout
Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:
- Ginkgo v2 harness with
SynchronizedBeforeSuitefor cluster bootstrapping (test/e2e/pkg/testenv) - Label taxonomy (
test/e2e/labels.go) - Run-ID tagging for artifact separation (
test/e2e/runid.go) - envsubst-based manifest rendering
- A mongo-driver v2 wire-protocol client (same driver the design proposes)
.github/actions/setup-test-environmentcomposite action for cluster provisioning
The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.
M3. Four design gaps that will cost rework, not captured in §Open Questions — long-haul-test-design.md:443-445
- Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
- Workload resumption idempotency. Writers use a unique index on
(writer_id, seq)(line 80). If a writer or the Job pod restarts, where doesseqrestart from? Reset to 0 → writes collide. Bootstrap frommax(seq)perwriter_id→ the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described. - Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
- Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.
These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.
M4. MaxDuration=30m default contradicts "run until failure" — config.go:594, long-haul-test-design.md:37,188
The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.
🟡 Minor (cheap in-PR fixes)
- m1
IsEnabled()—config.go:637-640— doesn't trim whitespace.LONGHAUL_ENABLED=" true "→ false (silent skip). Addstrings.TrimSpace. Silent-failure mode; worth fixing in-PR. - m2
Validate()—config.go:625-633— accepts negativeMaxDuration.time.ParseDuration("-5m")succeeds. Addif c.MaxDuration < 0 { return fmt.Errorf(...) }. - m3 Error strings capitalized —
config.go:627,630—ST1005/ Go convention: lowercase. Matching tests useContainSubstring("Namespace")/"ClusterName"and will need adjustment. - m4 Path inconsistency —
long-haul-test-design.md:340sayskubectl apply -f test/longhaul/deploy/job.yamlbut the directory structure roots atoperator/src/test/longhaul/. Fix path, or resolve M1 first. - m5 README overstates CI safety —
README.md:543-547— phrase as "no CI workflow currently setsLONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code. - m6
configis a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Considerlonghaulcfgorcfg. Not blocking. - m7 Missing
IsEnabledtest cases —config_test.go:727-757— given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True","YES"),"0"/"no"/"false"→ false.
🟢 Nit
longhaul_test.go:799— package-level mutabletestConfig. Fine for a single-file suite; revisit if Phase 1b splits files.DefaultConfig()by value vs.Validate()pointer receiver — slight asymmetry. Leave it.- README line 509's
-timeout 0guidance is correct and rarely documented — nice touch.
Positive feedback
- Skip gate is correctly placed in
BeforeSuite, notBeforeEach— single skipped line instead of N per spec. Exactly right. LoadFromEnvcleanly separates parsing from validation, soValidate()can be called on a programmatically-builtConfigin Phase 1b unit tests without fighting env vars.- The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
- Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
- Test coverage on the config package is thorough for the surface area.
Questions for the author
- (M1) Any reason longhaul can't live at
test/longhaul/as a sibling module totest/e2e/? If there is one, please capture it in the design. - (M2) Have you looked at
test/e2e/pkg/for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation? - (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
- Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
- Should
CONTRIBUTING.mdorAGENTS.mdget a one-liner pointing atoperator/src/test/longhaul/README.mdso contributors aren't left wondering "what is this tree?"
Reviewed with the GitHub Copilot code-review agent.
f3ffbef to
8459cae
Compare
Response to @xgerman's ReviewThank you for the thorough review! I've addressed all items. Here's a summary: Major ItemsM1 (Module placement): Agreed extracted long-haul tests to M2 (E2E engagement): Designed a three-directory layout for test code reuse:
Each directory has its own M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:
These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs. M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets Minor Items All Fixed
Questions
Test results: 23 config specs + 1 canary spec (skipped) = all passing. |
9676d24 to
158a393
Compare
|
The plan looks good. We should keep running parallel tests for different combination. Like, operator n version with gateway/extension n-1 version. Operator n-2 version with keeps working fine with latest non-schema changes. At least, different operator version tests, not only the latest release restarts the tests. |
|
Thanks for the feedback @hossain-rayhan! Great point about version compatibility. Since the operator and database images follow independent version tracks, we agree cross-version testing matters e.g., upgrading the operator while the database is still on the previous version, and vice versa. That said, we think this is better suited for the E2E test suite rather than long-haul. Cross-version compat bugs surface immediately on upgrade (deterministic, finite), while long-haul focuses on time-dependent issues memory leaks, connection pool exhaustion, replication drift that only appear after hours/days of sustained operation. We'll make sure the E2E suite covers both operator-first and database-first upgrade paths with version N/N-1 combinations. |
158a393 to
1d63aea
Compare
…dDowntime TODO (documentdb#220) Sweep of xgerman's lower-severity findings on PR documentdb#348: [documentdb#8] journal.DisruptionWindow.ExceededPolicy() ignored Policy.AllowedDowntime even though every operation's OutagePolicy() sets it. Document the gap with a TODO(longhaul, documentdb#220) explaining what enforcement would require (per-write timestamps fed into the active window from writer.go) so the dead field is not mistaken for actual policy enforcement. [documentdb#9] UpgradeDocumentDB.waitForImage was 'continue'-ing on every status read error during the rolling restart, which made a sustained apiserver outage indistinguishable from a slow upgrade until the recovery timeout fired. Plumb the journal into the upgrade operation and emit j.Warn on each transient read error so a real outage shows up in the report immediately while one-off blips still retry on the next 10s tick. Constructor signature changes; main.go updated. [nit] report.EmitAnnotation called Duration.Round(1) which rounds to 1ns -- a no-op. Use Round(time.Second) to match GenerateMarkdown so annotations read '2h17m' instead of '2h16m59.834561287s'. [nit] monitor.ClusterHealth.WritesHealthy declared but never set or read. Removed; can be added back alongside its populator if/when the writer's healthy/unhealthy signal is wired in. [documentdb#7 deferred] Pinning go.mod to 1.25.x is non-trivial: indirect k8s.io/* deps were tidied against pre-release versions (dated 20260120) that require go 1.26 features. Downgrading them needs its own change (release-pinned k8s.io/* + transitive resolution) rather than just editing the directive. Tracking separately. Verified: go vet ./..., go test ./... (config: 19 specs) clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220) [documentdb#6] writer.writeOne now treats mongo.IsDuplicateKeyError as a successful, idempotent ACK. With retryable writes on by default in the v2 driver, a network blip during a disruption window can produce: server commits insert, ACK is dropped, driver auto-retries the same _id, server returns code 11000. The data is durably committed in that case, so counting it as a write failure (which feeds AllowedWriteFailures and the policy gate) was turning successful writes into spurious FAIL verdicts during the very disruption windows the harness exists to measure. [documentdb#5] verifier.verifyWriter now resumes from a per-writer nextSeq map instead of re-scanning the entire writer history every 10s. A 100ms-per-write writer accumulates ~864k docs/day, so the original full scan grew to ~75M doc-reads/hour per writer over a 3-day run, both saturating the cluster and turning the verifier's own load into a confounding signal in the report. Filter is now {writer_id, seq: \$gte\: nextSeq}; nextSeq is advanced to lastSeen+1 each cycle. Gaps are still detected because we always start the scan at the first unverified seq -- a future cycle that finds a gap-fill below nextSeq is intentionally not re-checked (an in-flight delayed write that lands below a previously-seen seq would already have produced a gap on the cycle that observed the higher seq). Verified: go vet ./..., go test ./... clean. Mongo-dependent unit tests for both behaviors land in the next commit alongside the rest of the unit-test sweep. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…ocumentdb#220) Adds the per-package unit-test sweep xgerman flagged in PR documentdb#348 [#3]. Workload writer/verifier remain integration-tested only because their meaningful behavior (mongo retryable writes, dup-key handling, incremental scan filters) requires a real or mocked mongo collection that is out of scope for this PR. journal/policy_test.go covers DisruptionWindow.{IsActive, Duration, ExceededPolicy} including boundary cases (write failures equal to budget allowed; active window evaluated against MustRecoverWithin). journal/journal_test.go covers Record/Info/Warn/Error append semantics, EventsSince cutoff, the OpenDisruptionWindow auto-close-previous behavior, RecordWriteFailure no-op when no active window, HasPolicyViolation across active and closed windows, and a concurrent-append correctness check. monitor/health_test.go uses a fakeClusterClient stub to exercise IsSteadyState transitions: not-yet-healthy, becoming-healthy after the configured wait, loss-of-health resets steadySince, poll-error resets steadySince, and WaitForSteadyState's reached/cancelled paths. monitor/leakdetect_test.go covers the linear-regression slope: below min-samples returns zeroed result, flat memory produces ~0 slope, growing memory above threshold flags HasLeak with the expected slope (~3600 MB/h for 1 MB/s growth), sub-threshold growth does not flag, and the minSamples<3 floor is enforced. operations/scheduler_test.go uses a fakeOp stub to validate selectOperation: nil when no precondition passes, nil when total weight is zero, only-eligible filtering, weighted-distribution within tolerance over 2000 trials, and executeOp's window-open/close + ERROR-event-on-failure behavior. report/report_test.go snapshots GenerateMarkdown invariants: PASS vs FAIL header, duration rounded to seconds, data-plane metrics table always present, disruption-windows and leak-analysis sections gated on data presence, and the recent-events tail truncated to 20. Verified: go test ./... clean. -race not run locally (CGO unavailable on this Windows env); the concurrent-append test still validates final-count correctness without it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
e3a5846 to
d6f9adc
Compare
Adds a section to test/longhaul/README.md explaining how the long haul harness relates to the Ginkgo-based e2e suite landed in PR documentdb#346: - Different shapes (daemon vs test binary) and lifetimes (days vs minutes) - Different operator-API access patterns (dynamic vs typed client) - Concrete e2e helpers (mongo, operatorhealth, clusterprobe) that long haul will likely consume once it grows beyond raw URI / no-TLS Explains why a shared test/shared/ module is deliberately not introduced yet: today's only duplicated surface (raw mongo connect + ping) is too small to justify a third Go module given the version drift between the two test modules. Revisit when the driver adopts e2e's connection model. No code changes. Signed-off-by: Wenting Wu <wentingwu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 8 new *_test.go files using plain `testing` (matching the in-package dominant pattern; only config/ uses Ginkgo for envtest-style suite setup). Coverage: - workload/metrics_test.go: Snapshot, WriteSuccessRate, HasDataLoss, concurrent counter aggregation - workload/writer_test.go: computeChecksum determinism + sensitivity + 16-char hex format; constructor wiring - workload/verifier_test.go: nextSeq resume-point semantics; mongo I/O paths skipped (integration-only) - operations/scale_test.go: clamping, OutagePolicy values, full Precondition table for ScaleUp/ScaleDown - operations/upgrade_test.go: readDesiredVersion (NotFound/found/empty), Precondition gating (no-desired/match/ipn<2/eligible) - report/alert_test.go: all 5 EmitAnnotation branches (PASS/FAIL/ warning-on-leak) using os.Pipe stdout capture - report/checkpoint_test.go: create-on-first-emit, update-on-second via fake clientset; RUNNING-for-PASS translation; FAIL preserved - monitor/k8sclient_test.go: K8sClusterClient end-to-end via fake clientset + fake dynamic client; isPodReady, GetClusterHealth, GetInstancesPerNode, ScaleCluster, GetCurrentDocumentDBImageTag (incl. registry-with-port edge), UpgradeDocumentDB, MetricsAvailable The mongo I/O paths (writer.writeOne, verifier.verifyAll) and cmd/longhaul/main.go remain integration-only -- documented inline. Signed-off-by: Wenting Wu <wentingwu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves xgerman review item: test/longhaul/go.mod declared go 1.26.0
while operator/src/go.mod and test/e2e/go.mod both pin go 1.25.9.
Root cause: longhaul required k8s.io/{api,apimachinery,client-go,metrics}
v0.36.0 which transitively requires Go 1.26+. Downgraded those deps to
v0.35.0 (matches operator and e2e), which allowed the go directive to
drop back to 1.25.9.
Verified with go mod tidy + go build ./... + go test ./... from
test/longhaul/ -- all 6 packages pass.
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@xgerman thanks again for the detailed review all items from your latest review are now addressed. Quick disposition summary: Critical
Major Suggestions / Nitpicks Verified locally: |
Aligns long-haul unit tests with the repo convention (operator and e2e both use Ginkgo/Gomega per AGENTS.md). Replaces stdlib testing patterns with Describe/Context/It/DescribeTable and Expect matchers across the journal, monitor, operations, report, and workload packages. Adds suite_test.go per package. Behavior unchanged; coverage identical. Verified with go build + go test from test/longhaul/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Review: long-haul harness vs. existing
|
| Longhaul (raw dynamic) | E2E (typed) |
|---|---|
K8sClusterClient.ScaleCluster (k8sclient.go:178) |
documentdb.PatchInstances (documentdb.go:299) |
K8sClusterClient.UpgradeDocumentDB (k8sclient.go:221) |
documentdb.PatchSpec — generic mutator (documentdb.go:325) |
K8sClusterClient.GetInstancesPerNode (k8sclient.go:155) |
documentdb.Get + field access (documentdb.go:428) |
K8sClusterClient.GetCurrentDocumentDBImageTag (k8sclient.go:200) |
documentdb.Get + field access |
K8sClusterClient.GetClusterHealth — status == "Cluster in healthy state" hardcoded at k8sclient.go:148 |
documentdb.ReadyStatus constant (documentdb.go:68) |
Polling-for-healthy in HealthMonitor.check (health.go:93) |
documentdb.WaitHealthy (documentdb.go:346) |
The hardcoded readiness string is the most concerning item: if the operator ever changes its ReadyStatus, e2e picks it up automatically; longhaul silently breaks.
2. cmd/longhaul/main.go mongo connection duplicates test/e2e/pkg/e2eutils/mongo/
- main.go does raw
mongo.Connect(options.Client().ApplyURI(cfg.MongoURI))plus a manual ping. - e2e already exports
mongo.NewClient,mongo.BuildURI, andmongo.NewFromDocumentDB— the last reads thedocumentdb-credentialsSecret, opens a port-forward when needed, applies TLS, and retries pings. - Longhaul sidesteps the credentials Secret by requiring an externally-injected
LONGHAUL_MONGO_URIenv var via a separatelonghaul-mongo-credentialsSecret (deploy/setup.yaml). That's an operational workaround, not a code-reuse decision —mongo.BuildURIand the credentials-secret reader still apply even when the cluster is LB-exposed.
3. GVR hardcoding
monitor/k8sclient.go:78-82 hardcodes {Group: "documentdb.io", Version: "preview", Resource: "dbs"}. The typed client encodes the same fact via AddToScheme and would never drift. (Note the resource is dbs, not the more obvious documentdbs — exactly the kind of detail you only get wrong once when you bypass the scheme.)
4. Operator pod health monitoring
monitor/health.go ad-hoc counts pod readiness. test/e2e/pkg/e2eutils/operatorhealth/gate.go already encapsulates "snapshot operator pod identity + restart count, detect churn" — the exact primitive needed to distinguish "operator died" from "operator was upgraded mid-run".
Not duplicated (longhaul-specific, fine as-is)
workload/verifier.go, workload/writer.go, monitor/leakdetect.go, journal/, operations/scheduler.go, report/report.go are genuinely longhaul-only — keep them where they are.
Recommendations (ranked)
- Add the operator replace directive and an e2eutils dependency to
test/longhaul/go.mod, or extracte2eutils/{documentdb,mongo,operatorhealth,portforward}into atest/shared/module that bothtest/e2e/andtest/longhaul/consume. The shared-module option is cleaner long-term because it avoids importing Ginkgo into a non-Ginkgo binary. - Replace the CR operations in
monitor/k8sclient.gowithdocumentdb.Get/PatchInstances/PatchSpec/WaitHealthy, and referencedocumentdb.ReadyStatusinstead of the literal string.operations/scale.goandoperations/upgrade.goconsume themonitor.ClusterClientinterface, so this is an internal refactor and won't ripple. - Replace
mongo.Connectincmd/longhaul/main.gowithmongo.NewFromDocumentDB(ormongo.NewClient+mongo.BuildURIwhen an external URI is genuinely required). - Lower priority — even if (1) is rejected as too invasive: at minimum expose
ReadyStatusfrom a small shared constants package and import it. Do not duplicate the magic string.
The harness itself is good work; this is a maintenance-liability call, not a behavioural one. Worth resolving before merge so we don't end up with two source-of-truths for "how to talk to a DocumentDB CR".
Create a new Go module test/shared/ that holds framework-agnostic
helpers used by both e2e and long-haul suites:
* documentdb/cr.go - typed CR ops (Get/List/PatchInstances/PatchSpec/
WaitHealthy/IsHealthy/Delete) plus DefaultWaitPoll and ReadyStatus
constants (single source of truth for the `Cluster in healthy
state'' string).
* mongo/client.go - mongo client constructor (verbatim copy from
e2e) for in-cluster service-DNS access.
* mongo/uri.go - NewFromURI helper for callers (long-haul) that
already hold a fully-formed mongodb:// URI; applies the same
DefaultConnectTimeout used by the rest of the helpers.
Tests are migrated alongside the code; cd test/shared && go test ./...
is green.
This is preparation for PR documentdb#348 review feedback (xgerman): the long-
haul monitor was duplicating CR-op and mongo-connect logic against
raw dynamic-client / raw mongo.Connect calls. Subsequent commits wire
e2e and long-haul to consume this module.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address xgerman's PR documentdb#348 review by removing the long-haul monitor's duplicated dynamic-client / raw-mongo logic in favor of the typed helpers in test/shared: * monitor/k8sclient.go: switch from dynamic.Interface + hardcoded GVR to a controller-runtime client.Client with the previewv1 scheme. CR reads/patches now go through documentdb.Get, documentdb.PatchInstances and documentdb.PatchSpec; CR readiness uses documentdb.IsHealthy / documentdb.ReadyStatus instead of the previously hardcoded `Cluster in healthy state'' string. The instancesPerNode-missing => 1 defaulting is preserved explicitly because the typed CR returns 0 (Go zero value) for an omitted int, where the unstructured path returned !found. HealthMonitor.WaitForSteadyState is intentionally left alone -- its semantics (pods + CR readiness over time) differ from WaitHealthy (CR-only) and are exercised by the long-haul flow. * monitor/k8sclient_test.go: switch dynamic-fake -> controller- runtime fakeclient with previewv1 scheme; keep kubernetes fake.Clientset for pod/metrics fixtures. * cmd/longhaul/main.go: replace mongo.Connect(options.Client(). ApplyURI(uri)) with sharedmongo.NewFromURI so the long-haul driver picks up the same DefaultConnectTimeout the rest of the helpers use. * go.mod: add replace directives for ../../operator/src and ../shared and pull in controller-runtime + previewv1 transitively. Verified: go build ./..., go vet ./..., go test ./... all green in test/longhaul. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create a new Go module test/shared/ that holds framework-agnostic
helpers used by both e2e and long-haul suites:
* documentdb/cr.go - typed CR ops (Get/List/PatchInstances/PatchSpec/
WaitHealthy/IsHealthy/Delete) plus DefaultWaitPoll and ReadyStatus
constants (single source of truth for the `Cluster in healthy
state'' string).
* mongo/client.go - mongo client constructor (verbatim copy from
e2e) for in-cluster service-DNS access.
* mongo/uri.go - NewFromURI helper for callers (long-haul) that
already hold a fully-formed mongodb:// URI; applies the same
DefaultConnectTimeout used by the rest of the helpers.
Tests are migrated alongside the code; cd test/shared && go test ./...
is green.
This is preparation for PR documentdb#348 review feedback (xgerman): the long-
haul monitor was duplicating CR-op and mongo-connect logic against
raw dynamic-client / raw mongo.Connect calls. Subsequent commits wire
e2e and long-haul to consume this module.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Switch test/e2e/pkg/e2eutils/documentdb to consume the typed CR ops
from test/shared/documentdb instead of defining them locally:
* Add replace directive for test/shared in test/e2e/go.mod.
* Re-export Get, List, PatchInstances, PatchSpec, WaitHealthy,
IsHealthy, Delete plus DefaultWaitPoll and ReadyStatus through
a thin shim so call sites across the e2e suite need no change.
* Drop the now-duplicated implementations and the migrated tests
(covered by test/shared/documentdb/cr_test.go).
Renderer + Create-with-mixin code stays put -- those are e2e-specific.
Verified: cd test/e2e && go vet ./... && go test ./pkg/e2eutils/...
is green.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Address xgerman's PR documentdb#348 review by removing the long-haul monitor's duplicated dynamic-client / raw-mongo logic in favor of the typed helpers in test/shared: * monitor/k8sclient.go: switch from dynamic.Interface + hardcoded GVR to a controller-runtime client.Client with the previewv1 scheme. CR reads/patches now go through documentdb.Get, documentdb.PatchInstances and documentdb.PatchSpec; CR readiness uses documentdb.IsHealthy / documentdb.ReadyStatus instead of the previously hardcoded `Cluster in healthy state'' string. The instancesPerNode-missing => 1 defaulting is preserved explicitly because the typed CR returns 0 (Go zero value) for an omitted int, where the unstructured path returned !found. HealthMonitor.WaitForSteadyState is intentionally left alone -- its semantics (pods + CR readiness over time) differ from WaitHealthy (CR-only) and are exercised by the long-haul flow. * monitor/k8sclient_test.go: switch dynamic-fake -> controller- runtime fakeclient with previewv1 scheme; keep kubernetes fake.Clientset for pod/metrics fixtures. * cmd/longhaul/main.go: replace mongo.Connect(options.Client(). ApplyURI(uri)) with sharedmongo.NewFromURI so the long-haul driver picks up the same DefaultConnectTimeout the rest of the helpers use. * go.mod: add replace directives for ../../operator/src and ../shared and pull in controller-runtime + previewv1 transitively. Verified: go build ./..., go vet ./..., go test ./... all green in test/longhaul. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
cf91ba4 to
ff01b4d
Compare
Mirror the long-haul suite's framework so test/shared tests (documentdb CR helpers + mongo client helpers) use Describe/It instead of testing.T tables. Adds a per-package suite_test.go wiring RunSpecs, and DescribeTable for the BuildURI missing-required cases. Pinned ginkgo/v2 v2.28.1 and gomega v1.39.1 to match test/longhaul/go.mod. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
These re-exports in test/e2e/pkg/e2eutils/documentdb had no external readers besides assertions.go and fixtures.go. Import shareddoc directly from those two call sites and drop the const aliases from the e2e shim the var aliases (Get, List, PatchInstances, etc.) stay because 14 e2e files depend on them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Adds the standard Microsoft MIT header to: - test/shared/documentdb/cr.go - test/shared/mongo/client.go - test/shared/mongo/uri.go - .github/workflows/longhaul-deploy.yml - .github/workflows/longhaul-image-build.yml Matches the convention used in operator/src/** and the existing .github/workflows/longhaul-monitor.yaml. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Removes the var (Get = shareddoc.Get, Delete = shareddoc.Delete, ...) re-export shim from test/e2e/pkg/e2eutils/documentdb. The 17 e2e test files that called the aliases now import shareddoc and use shareddoc.Get / shareddoc.Delete / etc. directly, so each call site shows where the helper actually lives. The e2e documentdb package is still imported where Create / CreateOptions / RenderCR / DropEmptyVarLines / ManifestsFS are needed (those stay e2e-only); files that only used the dropped aliases drop the e2e import. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Covers the empty-URI guard, the lazy-connect happy path (mongo-driver v2 Connect doesn't dial until first op so an unreachable host is fine), and a malformed-URI case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The 'cr' suffix was redundant inside a package already named documentdb. lifecycle.go matches the file's doc comment (CRUD and lifecycle helpers) and reads better at the call site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
9084237 to
6b647f7
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
NewFromURI is a one-function helper that overlaps heavily with NewClient (both call mongo.Connect on an ApplyURI options chain with DefaultConnectTimeout). Folding it into client.go removes a duplicate package doc comment and keeps both constructors and their tests next to each other. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Deletes the duplicate test/e2e/pkg/e2eutils/mongo/client.go (NewClient/BuildURI/ClientOptions/Ping/Seed/Count/DropDatabase + buildTLSConfig + DefaultConnectTimeout) and rewires the ten e2e tests that imported it to call test/shared/mongo directly. test/e2e/pkg/e2eutils/mongo now contains only the e2e-specific Handle / NewFromDocumentDB / port-forward glue (connect.go), which itself uses sharedmongo.NewClient under the hood. Earns the keep of test/shared/mongo (previously consumed only by the long-haul driver) and gives both suites a single source of truth for URI construction, TLS config and CRUD wrappers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
shareddoc reads as 'shared document' which is misleading -- the package is the shared DocumentDB lifecycle helper layer. shareddb is symmetric with the existing sharedmongo alias and unambiguous. Pure rename across 24 files (e2e tests, e2e pkg helpers, longhaul monitor). No behaviour change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Drop the Project Structure tree (and the Phase 1 status note):
it bit-rotted -- 7 source files now exist that aren't listed
(workload/metrics.go, monitor/k8sclient.go, operations/upgrade.go,
journal/policy.go, report/{alert,report}.go), and IDE/ls show
layout faster than the diagram anyway.
- Rewrite Run Locally with the actual prerequisites that the old
example silently skipped: port-forward to the gateway service,
reading credentials out of the documentdb-credentials secret,
and a note that gateway-port reachability is required (otherwise
use the in-cluster deployment path).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Closes #220.
What this PR adds
A standalone long-haul (canary) test driver for the DocumentDB Kubernetes Operator, plus the GitHub Actions plumbing to keep it running on a long-lived AKS cluster.
The driver lives under
test/longhaul/and is its own Go module. It is not a Ginkgo suite — it is a long-running binary (cmd/longhaul) deployed as aDeploymentinto a namespace alongside a real DocumentDB CR. It exercises the operator with a continuous, realistic workload while a weighted-random scheduler injects disruptive operations, and it journals every event so failures can be reproduced.Pieces that ship in this PR
workload/): writer + verifier goroutines, sequence / checksum gap detection, lag-aware reads — the durability oracle for FM1/FM3/FM7.operations/): weighted random selection, per-category cooldowns, steady-state gating, and an outage-policy verdict per operation. Operations implemented today:ScaleUp,ScaleDown,UpgradeDocumentDB(auto-skips wheninstancesPerNode < 2because rolling restart on a single instance is unavoidable downtime).monitor/k8sclient.go): reads the DocumentDB CR, watches pods, pulls metrics-server samples — no fakes.monitor/health.go,monitor/leakdetect.go): steady-state detection plus linear-regression slope on memory/CPU samples.journal/): every disruption window, every workload event, every verdict. The disruption-window oracle is what makes verdicts reproducible.report/): writes thelonghaul-reportConfigMap so a human cankubectl get cm longhaul-report -o yamlfor a single source of truth on pass/fail. Alerts surface in the monitor workflow.Dockerfile,deploy/longhaul.yaml): runs as aDeployment(not aJob) — if the process exits, it restarts; the ConfigMap is the durable verdict, not the pod's exit code.workflow_run:LONGHAUL - Build Test Driver Image— builds and pushes to GHCR.LONGHAUL - Deploy Test Driver to AKS— uses a long-lived ServiceAccount-token kubeconfig (LONGHAUL_KUBECONFIGrepo secret) to roll the Deployment.LONGHAUL - Monitor— hourly cron that pulls the report ConfigMap and posts a workflow summary; alerts on stale or failing reports.config,journal,monitor,report,operations— no cluster needed, run withgo test ./...fromtest/longhaul/.docs/designs/long-haul-test-design.md(problem statement, eight failure modes, design principles, architecture, operations catalog with status, env-var table, phase tracker) andtest/longhaul/README.md(how to run, what every env var does, whyDeploymentoverJob).Why this is structured the way it is
Deployment, notJob. AJobthat finishes is aJobno one looks at. Thelonghaul-reportConfigMap is the durable verdict, and the monitor cron makes sure a human sees stale runs.Live canary
I've stood up a long-running cluster — the driver Deployment lives there and gets re-rolled by the deploy workflow on every successful image build:
longhaul-reportConfigMap) runs indocumentdb-test-ns.kubectl logs deploy/longhaul-test -n documentdb-test-nskubectl get cm longhaul-report -n documentdb-test-ns -o yamlWhat's intentionally out of scope
Tracked in the design doc's phase table:
OutagePolicy.AllowedDowntimeenforcement: the field is reserved on the struct; current verdicts useAllowedWriteFailures+MustRecoverWithin. Called out explicitly in the design doc.How to review
If you only want to skim, three files give you the shape of the PR:
docs/designs/long-haul-test-design.md— the why.test/longhaul/cmd/longhaul/main.go— the wiring.test/longhaul/operations/scheduler.go— the heart of the disruption loop.If you want to run it locally:
cd test/longhaul LONGHAUL_CLUSTER_NAME=documentdb-sample LONGHAUL_NAMESPACE=documentdb-test-ns \ LONGHAUL_MAX_DURATION=10m \ go run ./cmd/longhaulHappy to walk through any area on a call.