diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..b8f5d418 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,208 @@ +# AGENTS.md + +## Docs Index + +Detailed domain-specific guidelines are in these files -- read them before working in the corresponding area: + +- [docs/security-guidelines.md](docs/security-guidelines.md) -- Container security, RBAC, TLS, network policies, webhook security +- [docs/performance-guidelines.md](docs/performance-guidelines.md) -- Cache architecture, watch predicates, reconciliation patterns, drift detection +- [docs/error-handling-guidelines.md](docs/error-handling-guidelines.md) -- ReconcileError types, retry logic, status conditions, logging +- [docs/api-contracts-guidelines.md](docs/api-contracts-guidelines.md) -- CRD types, kubebuilder markers, CEL validation, code generation +- [docs/testing-guidelines.md](docs/testing-guidelines.md) -- Unit/API/E2E test tiers, frameworks, test helpers, CI integration +- [docs/integration-guidelines.md](docs/integration-guidelines.md) -- Bindata pattern, cert-manager, OpenShift platform, Bitwarden plugin + +## Project Overview + +This is a Kubernetes operator (built with kubebuilder/controller-runtime) that deploys and manages the upstream [external-secrets](https://github.com/openshift/external-secrets) application on OpenShift clusters. The operator does not embed upstream code -- it manages upstream resources as static YAML manifests (bindata) applied imperatively. + +Three controllers run in a single binary: + +| Controller | Package | Watches | Purpose | +|---|---|---|---| +| `external-secrets-controller` | `pkg/controller/external_secrets/` | `ExternalSecretsConfig` CR + `ExternalSecretsManager` (spec) + managed resources (Deployments, RBAC, Services, Secrets metadata, ConfigMaps, NetworkPolicies, Webhooks) + conditionally `cert-manager.io/Certificate` | Installs/reconciles operand deployments, RBAC, services, webhooks, network policies | +| `external-secrets-manager` | `pkg/controller/external_secrets_manager/` | `ExternalSecretsManager` CR + `ExternalSecretsConfig` status | Aggregates controller statuses into a global status CR | +| `crd-annotator` | `pkg/controller/crd_annotator/` | ESO CRDs (metadata, label-filtered) + `ExternalSecretsConfig` | Adds cert-manager CA injection annotations to operand CRDs (conditional; only registered when cert-manager is installed) | + +## Project Structure + +``` +api/v1alpha1/ CRD type definitions, conditions, shared types, deepcopy +api/v1alpha1/tests/ Declarative YAML test suites for CRD validation +bindata/ Static operand YAML manifests (compiled into Go via go-bindata) +bundle/ OLM bundle manifests +cmd/external-secrets-operator/ Operator entrypoint (main.go, separate Go module) +docs/ Guideline docs (security, performance, error handling, etc.) +config/ Kustomize manifests (CRDs, RBAC, manager, samples, bundle) +hack/ Shell scripts (codegen, verification, CI helpers) +images/ci/ CI Dockerfiles (coverage-instrumented builds) +pkg/controller/ Controller implementations + client/ CtrlClient interface + counterfeiter fakes + common/ Shared utilities (errors, constants, decode helpers, drift detection) + commontest/ Shared test fixtures (TestExternalSecretsConfig, TestExternalSecretsManager) + crd_annotator/ CRD annotation controller + external_secrets/ Main operand reconciliation controller + external_secrets_manager/ Status aggregation controller +pkg/operator/ Manager setup, controller registration + assets/ Generated bindata (bindata.go) +pkg/version/ Build-time version info (ldflags) +test/apis/ API integration tests (Ginkgo + envtest) +test/e2e/ End-to-end tests (Ginkgo + live cluster) +test/utils/ E2E test helpers +tools/ Go module for build-time tool dependencies +vendor/ Workspace-level vendoring (go work vendor) +``` + +## Go Workspace and Module Layout + +The repo uses `go.work` with four modules: `.`, `./cmd/external-secrets-operator`, `./test`, `./tools`. This means: + +- `GOFLAGS` is cleared for test and fmt targets to avoid `-mod=vendor` conflicting with `go.work`. +- When adding a dependency, use `make update-dep PKG=pkg@version` to update across all modules, then `make update-vendor`. +- Vendoring is workspace-level (`go work vendor`), not per-module. +- All build-time tools (controller-gen, golangci-lint, ginkgo, etc.) are vendored and built from source via `go build -mod=vendor`. + +## Build System (Key Makefile Targets) + +| Target | What it does | +|---|---| +| `make build` | Full build: generate + manifests + fmt + vet + compile binary | +| `make build-operator` | Compile binary only (no codegen, fastest iteration) | +| `make test` | Run unit + API integration tests (no cluster needed) | +| `make test-unit` | Unit tests only | +| `make test-apis` | API validation tests via envtest | +| `make test-e2e` | E2E tests against a live cluster | +| `make lint` | Run golangci-lint with all configured linters | +| `make lint-fix` | Run golangci-lint with auto-fix | +| `make update` | Full regeneration: codegen + manifests + operand manifests + bindata + bundle + docs | +| `make verify` | Run vet + fmt + verify-deps + verify-bindata + verify-bindata-assets + verify-generated + govulncheck + check-git-diff | +| `make update-vendor` | Update vendor directory across all workspace modules | +| `make update-dep PKG=x@v` | Update a single dependency across all modules | +| `make manifests` | Regenerate CRD YAML, RBAC, webhook configs from kubebuilder markers | +| `make generate` | Regenerate DeepCopy methods | +| `make docs` | Regenerate API reference docs | +| `make clean` | Remove bin/, _output/, cover.out, dist/ | + +After any code change, run `make update && make verify` to ensure all generated files are consistent. CI runs `check-git-diff` which fails if generated files are stale. + +## Code Style and Formatting + +### Import Order + +Imports must follow the order enforced by `gci` in `.golangci.yml`: + +1. Standard library +2. Third-party packages +3. `github.com/openshift/external-secrets-operator` (project-local) +4. Blank imports, dot imports, aliases, local module + +### Linting + +The repo uses golangci-lint v2 with a comprehensive set of linters (see `.golangci.yml`). Key rules: + +- **depguard** blocks `math/rand` (use `math/rand/v2`), `github.com/sirupsen/logrus`, and `github.com/pkg/errors` (use `errors`/`fmt`). +- **kubeapilinter** runs only on `api/v1alpha1/*` files. Use `//nolint:kubeapilinter` with a comment for intentional deviations. +- **golines** max line length is 200 characters. +- **gofmt** rewrites `interface{}` to `any` and `a[b:len(a)]` to `a[b:]`. +- Generated files are excluded in `lax` mode. +- Test files have relaxed rules for `gocyclo`, `errcheck`, `gosec`, `forcetypeassert`, and others. + +### File Headers + +All `.go` files must include the Apache 2.0 license header from `hack/boilerplate.go.txt` (year 2025). + +### FIPS Build + +Production builds use `hack/go-fips.sh` which enables `GOEXPERIMENT=strictfipsruntime` and build tags `strictfipsruntime,openssl` when the Go compiler supports it. Local dev builds without FIPS work but cannot be used in CI/production. + +## Naming Conventions + +### Go Packages + +Controller packages use `snake_case`: `external_secrets`, `external_secrets_manager`, `crd_annotator`. This matches kubebuilder conventions. + +### Constants + +- Controller names: `"external-secrets-controller"`, `"external-secrets-manager"`, `"crd-annotator"` (kebab-case in strings). +- Finalizer format: `./` (e.g., `externalsecretsconfigs.operator.openshift.io/external-secrets-controller`). +- Asset name constants: `_AssetName` in camelCase (e.g., `controllerDeploymentAssetName`). +- Environment variable constants: all-caps with suffix `EnvVarName` (e.g., `externalSecretsImageEnvVarName`). + +### Bindata YAML Files + +Files in `bindata/external-secrets/resources/` follow the pattern: `_.yml` (e.g., `deployment_external-secrets.yml`, `clusterrole_external-secrets-controller.yml`). Network policies in `bindata/external-secrets/` use `.yaml` extension. + +### CRD Object Names + +Both `ExternalSecretsConfig` and `ExternalSecretsManager` are singletons named `"cluster"` (enforced by CEL). Constants `ExternalSecretsConfigObjectName` and `ExternalSecretsManagerObjectName` in `pkg/controller/common/constants.go` hold this value. + +## Architectural Patterns + +### Reconciler Structure + +Every controller follows the same pattern: + +1. A `Reconciler` struct embedding `operatorclient.CtrlClient`. +2. A `New(ctx, mgr)` constructor that builds the reconciler and client(s). +3. A `SetupWithManager(mgr)` method that wires watches, predicates, and map functions. +4. A `Reconcile(ctx, req)` method that fetches the primary CR and delegates to `processReconcileRequest`. +5. An `updateStatus(ctx, obj)` method using `retry.RetryOnConflict`. + +### CtrlClient Interface + +All controllers interact with Kubernetes through `pkg/controller/client.CtrlClient`, not the raw `controller-runtime client.Client`. This interface adds `UpdateWithRetry`, `StatusUpdate`, and `Exists` methods. Unit tests use counterfeiter-generated fakes (`pkg/controller/client/fakes/`). + +To regenerate fakes after changing the interface: `go generate ./pkg/controller/client/...` + +### Resource Reconciliation Pattern + +For each resource type (deployments, services, RBAC, etc.), the same flow is used: + +1. Decode static YAML from bindata (`common.Decode*ObjBytes(assets.MustAsset(...))`). +2. Mutate the decoded object (set namespace, labels, annotations, images, env vars, security context). +3. Check existence with `r.Exists()`. +4. If new: `r.Create()`. If existing: compare with `common.HasObjectChanged()`, then `r.UpdateWithRetry()` if drifted. +5. Record Kubernetes events for create/update operations. +6. Wrap errors with `common.FromClientError()`. + +### Conditional Resources + +Resources that depend on CR configuration use a slice of `{assetName string, condition bool}` structs. Only assets with `condition: true` are applied. Follow this pattern when adding new conditionally-created resources. + +## Common Pitfalls + +1. **Never return both `RequeueAfter` and a non-nil error** from `Reconcile`. Return one or the other. +2. **Never edit generated files by hand**: `zz_generated.deepcopy.go`, `config/crd/bases/*.yaml`, `pkg/operator/assets/bindata.go`, `config/rbac/role.yaml`, `docs/api_reference.md`. Always use `make update`. +3. **Always run `make update && make verify`** after any code change. CI will reject PRs with stale generated files. +4. **Use the cached client for managed resources** (those with `app: external-secrets` label). Use the uncached client only for external resources like cert-manager Issuers or user-provided Secrets. +5. **Add new watched resources to both `controllerManagedResources` and `buildCacheObjectList()`** in `pkg/controller/external_secrets/controller.go`. +6. **Add new resource types to `HasObjectChanged`** in `pkg/controller/common/utils.go` with field-level comparison, not full `DeepEqual`. +7. **Decode helpers panic on failure** -- this is intentional for build-time-constant assets. Do not add error handling around `Decode*ObjBytes` calls. +8. **RBAC markers** (`+kubebuilder:rbac`) go in `pkg/controller/external_secrets/controller.go` for the operator's own permissions. Operand RBAC is in static YAML under `bindata/`. +9. **The `go.work` workspace** means you cannot use `-mod=vendor` for `go test` or `go fmt`. The Makefile already clears `GOFLAGS` for affected targets. +10. **Container tool defaults to `podman`** (`CONTAINER_TOOL ?= podman`), not Docker. Override with `CONTAINER_TOOL=docker` if needed. + +## PR and Contribution Expectations + +- Run `make lint` and `make test` locally before submitting. +- Run `make verify` to ensure generated files are in sync. +- Add unit tests for new reconciliation logic using table-driven tests and `FakeCtrlClient`. +- Add API test cases in `api/v1alpha1/tests//` (e.g., `externalsecretsconfig.operator.openshift.io/`) for any new CRD field or validation rule. +- Add E2E test cases with appropriate Ginkgo labels for platform-specific tests. +- Follow existing error wrapping patterns: `common.FromClientError` for API calls, `common.NewIrrecoverableError` for config validation failures, `common.NewUserConfigurationError` for invalid user-provided configuration. +- Commit messages should reference the relevant Jira ticket (e.g., `OCPBUGS-12345: description`). +- PR reviewers/approvers are listed in `OWNERS`. + +## Environment Variables + +The operator reads these at runtime (typically set by OLM/CSV): + +| Variable | Purpose | +|---|---| +| `RELATED_IMAGE_EXTERNAL_SECRETS` | Container image for the external-secrets operand | +| `RELATED_IMAGE_BITWARDEN_SDK_SERVER` | Container image for the Bitwarden SDK server | +| `OPERAND_EXTERNAL_SECRETS_IMAGE_VERSION` | Version label for operand resources | +| `BITWARDEN_SDK_SERVER_IMAGE_VERSION` | Version label for Bitwarden resources | +| `OPERATOR_IMAGE_VERSION` | Operator version string | +| `HTTP_PROXY`, `HTTPS_PROXY`, `NO_PROXY` | Proxy fallback from OLM environment | + +For local development, set at minimum `RELATED_IMAGE_EXTERNAL_SECRETS` via `t.Setenv()` in tests or shell export for `make run`. diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 00000000..2aa6da03 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,248 @@ +# Architecture + +This document describes the high-level architecture of the External Secrets Operator for Red Hat OpenShift. +If you want to familiarize yourself with the codebase, you are in the right place! + +For detailed guidelines on specific areas, see the files listed in `AGENTS.md`. +For contribution process, see `CONTRIBUTING.md`. + +## Bird's Eye View + +On the highest level, this is a Kubernetes operator that installs and manages the upstream [external-secrets](https://github.com/external-secrets/external-secrets) application on OpenShift clusters. The upstream project provides the actual secret-syncing logic (ExternalSecret, SecretStore, etc.). This operator does **not** embed or fork that code. Instead, it manages upstream resources as a set of static YAML manifests that are compiled into the binary, decoded at runtime, mutated with operator-controlled configuration, and applied imperatively to the cluster. + +``` + ┌─────────────────────────────────────────────┐ + │ ExternalSecretsConfig CR │ + │ (singleton "cluster", user-facing API) │ + └──────────────────┬──────────────────────────┘ + │ + ┌──────────────────────┼──────────────────────┐ + │ │ │ + ┌─────────▼──────────┐ ┌───────▼────────┐ ┌──────────▼─────────┐ + │ external-secrets │ │ external- │ │ crd-annotator │ + │ controller │ │ secrets-manager│ │ controller │ + │ │ │ │ │ │ + │ Decodes bindata, │ │ Aggregates │ │ Patches ESO CRDs │ + │ mutates, creates │ │ controller │ │ with cert-manager │ + │ operand resources │ │ statuses into │ │ CA injection │ + │ (Deployments, │ │ a global │ │ annotations │ + │ RBAC, Services, │ │ status CR │ │ │ + │ NetworkPolicies, │ │ │ │ (conditional: only │ + │ Webhooks, etc.) │ │ │ │ when cert-manager │ + │ │ │ │ │ is installed) │ + └────────────────────┘ └────────────────┘ └────────────────────┘ + │ + ▼ + ┌────────────────────┐ + │ Upstream operand │ + │ (external-secrets │ + │ deployments in │ + │ "external-secrets"│ + │ namespace) │ + └────────────────────┘ +``` + +Two singleton CRs drive the system: + +- **ExternalSecretsConfig** (name: `cluster`): the primary user-facing CR. Controls operand installation, cert-manager toggle, Bitwarden plugin, proxy, network policies, per-component overrides, and custom trusted CA bundles. +- **ExternalSecretsManager** (name: `cluster`): global operator-level config, auto-created at install. Provides lower-priority defaults and optional features (e.g., `UnsafeAllowGenericTargets`). + +The precedence chain for shared fields is: `ExternalSecretsConfig > ExternalSecretsManager > OLM environment variables` (proxy only). + +## Code Map + +This section describes important directories and data structures. +Pay attention to the **Architecture Invariant** sections. + +### `api/v1alpha1/` + +CRD type definitions. This is where `ExternalSecretsConfig` and `ExternalSecretsManager` structs live, along with shared types (`CommonConfigs`, `ProxyConfig`, `Mode`, etc.), condition constants, and deepcopy code. + +Validation is entirely CEL-based via `+kubebuilder:validation:XValidation` markers on the types. There are no admission webhooks for the operator's own CRDs. + +`api/v1alpha1/tests/` contains declarative YAML test suites (`.testsuite.yaml` files) organized by CRD API group domain. These are automatically picked up by the test generator. + +**Architecture Invariant:** both CRDs are cluster-scoped singletons enforced by CEL (`self.metadata.name == 'cluster'`). There is exactly one instance of each, ever. + +**Architecture Invariant:** several fields on `CertManagerConfig` (`mode`, `issuerRef`, `injectAnnotations`) are immutable once set, enforced by CEL `self == oldSelf` rules. This prevents configuration drift that would leave the cluster in an inconsistent state. + +### `bindata/` + +Static YAML manifests for the upstream operand. These are the Deployments, Services, RBAC, NetworkPolicies, Certificates, and ValidatingWebhookConfigurations that the operator creates in the cluster. + +Files in `bindata/external-secrets/resources/` follow the pattern `_.yml`. Network policies in `bindata/external-secrets/` use `.yaml`. + +These files are compiled into Go via go-bindata and live as `pkg/operator/assets/bindata.go` at build time. + +**Architecture Invariant:** bindata manifests are templates, not final resources. The controller decodes them, mutates them (namespace, labels, annotations, images, env vars, security context), and then creates or updates them. Never deploy bindata YAML directly to a cluster. + +**Architecture Invariant:** do not hand-edit `bindata.go`. Run `make update` to regenerate it from the YAML sources. + +### `cmd/external-secrets-operator/` + +The operator binary entrypoint. This is a separate Go module. It sets up the controller-runtime manager, configures metrics/webhook servers (with HTTP/2 disabled by default), wires the cache builder, and starts the controllers. + +**Architecture Invariant:** HTTP/2 is disabled for metrics and webhook servers to mitigate known HTTP/2 vulnerabilities. The `--enable-http2` flag defaults to `false`. + +### `pkg/controller/` + +The heart of the operator. Three controller packages plus shared infrastructure: + +#### `pkg/controller/external_secrets/` + +The main reconciliation controller. Watches `ExternalSecretsConfig`, `ExternalSecretsManager`, and all managed operand resources. The reconciliation flow in `install_external_secrets.go` creates resources in strict dependency order: + +``` +Namespace → NetworkPolicies → ServiceAccounts → Certificates → Secrets +→ TrustedCA ConfigMap → RBAC → Services → Deployments +→ ValidatingWebhooks → CR annotation tracking +``` + +Each resource follows the same pattern: decode from bindata → mutate → check existence → create or update if drifted → record events. + +`controller.go` is the entry point. It defines `controllerManagedResources` (the set of resource types the controller owns), `NewCacheBuilder` (manager-level label-filtered cache), and `SetupWithManager` (watch/predicate wiring). + +`constants.go` contains all asset path constants, label keys, env var names, and network policy prefixes. + +**Architecture Invariant:** the reconciliation order must not change. CR annotation tracking is patched last to ensure obsolete annotations are removed before tracking is updated. + +**Architecture Invariant:** never return both `RequeueAfter` and a non-nil error from `Reconcile`. Return one or the other. + +#### `pkg/controller/external_secrets_manager/` + +Status aggregation controller. Watches `ExternalSecretsManager` and `ExternalSecretsConfig` status. Copies per-controller conditions into the manager's `controllerStatuses` list. Also handles auto-creation of the default `ExternalSecretsManager` CR at startup. + +#### `pkg/controller/crd_annotator/` + +Conditionally registered (only when cert-manager is installed). Watches ESO CRD metadata and `ExternalSecretsConfig`. When cert-manager is enabled with `injectAnnotations: "true"`, patches all ESO CRDs with the `cert-manager.io/inject-ca-from` annotation. + +Builds its own custom cache (`BuildCustomClient`) because it watches a disjoint resource set from the main controller. + +#### `pkg/controller/client/` + +The `CtrlClient` interface. All controllers interact with Kubernetes through this interface, not the raw controller-runtime `client.Client`. It adds `UpdateWithRetry`, `StatusUpdate`, and `Exists` methods. Unit tests use counterfeiter-generated fakes in `client/fakes/`. + +**Architecture Invariant:** all resource updates go through `UpdateWithRetry`, which wraps `retry.RetryOnConflict` with a read-modify-write pattern. This prevents silent data loss from stale resource versions. + +#### `pkg/controller/common/` + +Shared utilities used by all controllers: + +- `errors.go` — `ReconcileError` type with three reasons: `IrrecoverableError` (no retry), `RetryRequiredError` (30s requeue), `UserConfigurationError` (no retry, watches drive recovery). `FromClientError` auto-classifies Kubernetes API errors. +- `constants.go` — singleton object names, annotation keys, version strings. +- `utils.go` — `HasObjectChanged` (type-specific drift detection), `Decode*ObjBytes` (bindata decoders that panic on failure), `ApplyResourceMetadata`, finalizer helpers, the `Now` resettable-once type. + +**Architecture Invariant:** `Decode*ObjBytes` helpers panic on failure. This is intentional — they are called only with compile-time-known static assets, so a panic indicates a build-time bug, not a runtime error. + +#### `pkg/controller/commontest/` + +Shared test fixtures: `TestExternalSecretsConfig()`, `TestExternalSecretsManager()`, standard test constants and a sentinel error. + +### `pkg/operator/` + +Manager setup and controller registration. `setup_manager.go` wires all three controllers and handles the default ESM resource creation. `assets/bindata.go` is the generated bindata. + +### `config/` + +Kustomize manifests for CRDs, RBAC, manager deployment, samples, and OLM bundle inputs. CRDs in `config/crd/bases/` and RBAC in `config/rbac/role.yaml` are generated — do not hand-edit them. + +### `test/` + +A separate Go module (`./test` in `go.work`) containing: + +- `test/apis/` — API integration tests using Ginkgo + envtest. A generator in `generator.go` auto-creates Ginkgo specs from the YAML test suites in `api/v1alpha1/tests/`. +- `test/e2e/` — End-to-end tests using Ginkgo against a live cluster. Filtered by Ginkgo labels (`Platform:AWS`, `Provider:Bitwarden`, etc.). +- `test/utils/` — Shared E2E helpers (dynamic resource loading, pod readiness polling, artifact dumping). + +### `hack/` + +Shell scripts for codegen, verification, and CI. Notable scripts: +- `go-fips.sh` — enables FIPS build tags for production. +- `test-apis.sh` — runs API tests with Ginkgo flags. +- `e2e-coverage.sh` — manages coverage-instrumented E2E builds. + +### `tools/` + +A separate Go module for build-time tool dependencies (controller-gen, golangci-lint, ginkgo, etc.). All tools are vendored and built from source. + +## Cross-Cutting Concerns + +### Two Clients + +The operator maintains two Kubernetes clients: + +- **Cached client** (`r.CtrlClient`): reads from the manager cache, which is configured with label selectors (`app=external-secrets`) via `NewCacheBuilder`. Used for all managed operand resources. +- **Uncached client** (`r.UncachedClient`): reads directly from the API server. Used only for objects not tracked by the cache — cert-manager Issuers, user-provided Secrets (Bitwarden `secretRef`), and one-time bootstrap operations. + +**Architecture Invariant:** never use the uncached client for objects in `controllerManagedResources`. The cache is designed for those. + +### Drift Detection + +The operator continuously reconciles all managed resources back to desired state. `HasObjectChanged` in `pkg/controller/common/utils.go` uses type-specific field comparison (not full `DeepEqual`) to detect drift: + +- Deployments: containers, init containers, volumes, affinity, tolerations, env vars, security context, revision history limit, etc. +- RBAC: Rules (Roles) or RoleRef+Subjects (Bindings) +- NetworkPolicies: PodSelector, PolicyTypes, Ingress, Egress +- Webhooks: individual webhook entries by name +- Certificates: full Spec via DeepEqual + +Annotations are compared using managed-key tracking, which only checks keys the operator manages, ignoring annotations set by external controllers. This prevents infinite reconcile loops. + +**Architecture Invariant:** if someone manually modifies a managed ClusterRole, Deployment, or NetworkPolicy, the operator will detect the change and revert it. This is a critical security property. + +### Error Classification + +Reconciliation errors flow through three channels: + +| Reason | Requeue? | Status | +|---|---|---| +| `IrrecoverableError` | No | Degraded=True | +| `UserConfigurationError` | No (except NotFound → 30s) | Degraded=True | +| `RetryRequiredError` | Yes (30s) | Ready=False/Progressing | + +`FromClientError` auto-classifies Kubernetes API errors: `Unauthorized`, `Forbidden`, `Invalid`, and `BadRequest` become irrecoverable; everything else becomes retry-required. + +### Code Generation + +Several artifacts are generated and must be committed: + +- `zz_generated.deepcopy.go` — from `make generate` +- `config/crd/bases/*.yaml` — from `make manifests` +- `pkg/operator/assets/bindata.go` — from `make update-bindata` +- `config/rbac/role.yaml` — from `make manifests` +- `docs/api_reference.md` — from `make docs` + +`make update` runs the full pipeline. `make verify` checks that generated files are fresh via `check-git-diff`. CI will reject PRs with stale generated files. + +**Architecture Invariant:** never edit generated files by hand. Always use `make update`. + +### Container Security + +All operand containers are hardened programmatically via `updateContainerSecurityContext()`: + +- `AllowPrivilegeEscalation: false` +- `Capabilities.Drop: ["ALL"]` +- `ReadOnlyRootFilesystem: true` +- `RunAsNonRoot: true` +- `SeccompProfile.Type: RuntimeDefault` + +All Dockerfiles run as `USER 65534:65534` (nobody). The reconciler drift-detects container security context changes and reverts them. + +### Network Policy Architecture + +The operator deploys a **deny-all** base NetworkPolicy, then layers specific allow-policies (prefixed `eso-sys-`) for API server egress, webhook traffic, DNS, and conditionally cert-controller and Bitwarden. User-defined policies (prefixed `eso-user-`) are restricted to egress only. + +### Go Workspace + +The repo uses `go.work` with four modules: `.`, `./cmd/external-secrets-operator`, `./test`, `./tools`. This means: + +- `GOFLAGS` is cleared for test and fmt targets to avoid `-mod=vendor` conflicting with `go.work`. +- Dependencies are updated across all modules via `make update-dep PKG=pkg@version`. +- Vendoring is workspace-level (`go work vendor`). +- All build-time tools are vendored and built from source. + +### Relationship to Upstream + +This operator manages — but does not contain — the upstream [external-secrets](https://github.com/external-secrets/external-secrets) project. The upstream project defines the CRDs that end users interact with (`ExternalSecret`, `SecretStore`, `ClusterSecretStore`, `PushSecret`, etc.) and the controllers that sync secrets from external providers. This operator's job is to deploy, configure, and lifecycle-manage those upstream components on OpenShift, providing an opinionated, security-hardened, and OLM-integrated installation. + +The upstream operand version is controlled by the `RELATED_IMAGE_EXTERNAL_SECRETS` environment variable, typically set by OLM/CSV. The operator has no compile-time dependency on the upstream Go code. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..588f94e4 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,51 @@ +# CLAUDE.md + +@AGENTS.md + +## Build and Test Commands + +When working in this repository, use these commands: + +### Before Committing +```bash +make update && make verify +``` +This regenerates all code, manifests, and bindata, then runs verification checks. CI will reject PRs with stale generated files. + +### Development Workflow +```bash +make build # Full build with codegen +make build-operator # Fast rebuild (binary only, no codegen) +make test # Run unit + API tests (no cluster needed) +make lint # Run golangci-lint +make lint-fix # Auto-fix linting issues +``` + +### Testing +```bash +make test-unit # Unit tests only +make test-apis # API validation tests (envtest) +make test-e2e # E2E tests (requires cluster) +``` + +### Dependency Management +```bash +make update-vendor # Sync vendor across all workspace modules +make update-dep PKG=package@version # Update a dependency in all modules +``` + +## Claude Code Preferences + +- Always run `make update && make verify` after code changes that affect: + - CRD definitions (`api/v1alpha1/`) + - Kubebuilder markers (`+kubebuilder:*`) + - Bindata YAML (`bindata/`) + - Generated code triggers + +- Use `make lint-fix` to automatically fix formatting and linting issues before suggesting manual fixes + +- The repository uses a Go workspace (`go.work`) with 4 modules. Never manually edit `GOFLAGS` or use `-mod=vendor` in test/fmt commands — the Makefile handles this + +- Container builds default to `podman`. Override with `CONTAINER_TOOL=docker` if needed + +- All build-time tools are vendored. Do not suggest installing tools globally diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..efc84d71 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,191 @@ +# Contributing to External Secrets Operator for Red Hat OpenShift + +Thank you for your interest in contributing! This guide covers the PR process, review expectations, coding conventions, and testing requirements. + +## Getting Started + +1. Fork the repository and clone your fork. +2. Create a feature branch from `main`. +3. Set up your environment (Go version matching `go.mod`, `kubectl` v1.32.1+, access to a Kubernetes cluster for E2E tests). + +## Development Workflow + +### Build and Verify + +```sh +make build # Full build: codegen + manifests + fmt + vet + compile +make build-operator # Fast rebuild (binary only, no codegen) +``` + +After **any** code change, regenerate and verify before committing: + +```sh +make update && make verify +``` + +This regenerates all code, manifests, bindata, and bundle artifacts, then runs verification checks including `check-git-diff`. CI will reject PRs with stale generated files. + +### Linting + +```sh +make lint # Run golangci-lint with all configured linters +make lint-fix # Auto-fix linting issues +``` + +### Testing + +Run the full local test suite (no cluster required): + +```sh +make test # Unit + API integration tests +``` + +Individual test targets: + +| Target | Description | +|---|---| +| `make test-unit` | Unit tests (excludes `test/e2e`, `test/apis`, `test/utils`) | +| `make test-apis` | API validation tests via envtest | +| `make test-e2e` | End-to-end tests against a live cluster | + +For E2E tests with label filters: + +```sh +make test-e2e E2E_GINKGO_LABEL_FILTER="" +``` + +See [test/e2e/README.md](test/e2e/README.md) for E2E prerequisites and suite-specific instructions. + +### Pre-Submission Checklist + +Before opening a PR, always run: + +```sh +make lint +make test +make update && make verify +``` + +## Coding Conventions + +### Import Order + +Imports must follow the order enforced by `gci` in `.golangci.yml`: + +1. Standard library +2. Third-party packages +3. `github.com/openshift/external-secrets-operator` (project-local) +4. Blank imports, dot imports, aliases, local module + +### File Headers + +All `.go` files must include the Apache 2.0 license header from [`hack/boilerplate.go.txt`](hack/boilerplate.go.txt). + +### Naming Conventions + +- **Go packages**: Controller packages use `snake_case` (`external_secrets`, `crd_annotator`). +- **Controller names**: kebab-case in strings (`"external-secrets-controller"`, `"crd-annotator"`). +- **Finalizers**: `./`. +- **Asset name constants**: camelCase with pattern `_AssetName`. +- **Bindata YAML files**: `_.yml` in `bindata/external-secrets/resources/`. + +### Linter Rules + +The repo uses golangci-lint v2 (see [`.golangci.yml`](.golangci.yml)). Key rules to be aware of: + +- **depguard** blocks `math/rand` (use `math/rand/v2`), `github.com/sirupsen/logrus`, and `github.com/pkg/errors` (use `errors`/`fmt`). +- **golines** enforces a max line length of 200 characters. +- **gofmt** rewrites `interface{}` to `any`. + +### Error Handling + +- Wrap API call errors with `common.FromClientError`. +- Use `common.NewIrrecoverableError` for config validation failures. +- Never return both `RequeueAfter` and a non-nil error from `Reconcile`. + +### Generated Files + +Never edit these files by hand — always use `make update`: + +- `zz_generated.deepcopy.go` +- `config/crd/bases/*.yaml` +- `pkg/operator/assets/bindata.go` +- `config/rbac/role.yaml` +- `docs/api_reference.md` + +## Testing Requirements + +### Unit Tests + +- Add unit tests for new reconciliation logic using table-driven tests and `FakeCtrlClient`. +- Unit tests use counterfeiter-generated fakes from `pkg/controller/client/fakes/`. +- Regenerate fakes after changing the `CtrlClient` interface: `go generate ./pkg/controller/client/...` + +### API Tests + +- Add test cases in `api/v1alpha1/tests//` for any new CRD field or validation rule. +- API tests run via envtest (Ginkgo). + +### E2E Tests + +- Add E2E test cases with appropriate Ginkgo labels for platform-specific tests. +- E2E tests require a live cluster with the operator deployed and operands stable. + +## Pull Request Process + +### Creating a PR + +1. Keep diffs small and focused. One logical change per PR. +2. Write a clear, descriptive PR title. +3. Reference the relevant Jira ticket in your commit messages (e.g., `OCPBUGS-12345: description`). +4. Describe your changes and the motivation behind them in the PR body. + +### What to Include + +- **Code changes** with corresponding tests (unit, API, or E2E as appropriate). +- **Generated file updates** — run `make update` and commit the results. CI checks for staleness via `check-git-diff`. +- **Documentation updates** — update `README.md` or files under `docs/` when behavior changes are user-visible. + +### Review Process + +- PRs are reviewed and approved by maintainers listed in the [OWNERS](OWNERS) file. +- All CI checks must pass before merge. +- Reviewers will check for adherence to the architectural patterns documented in [AGENTS.md](AGENTS.md), including reconciler structure, resource reconciliation flow, and proper use of the `CtrlClient` interface. + +### CI Expectations + +CI runs the following checks (among others): + +- `make verify` — vet, fmt, dependency verification, bindata verification, generated file verification, govulncheck, and `check-git-diff`. +- `make lint` — full golangci-lint suite. +- `make test` — unit and API integration tests. + +Ensure all of these pass locally before pushing. + +## Go Workspace + +The repo uses `go.work` with four modules (`.`, `./cmd/external-secrets-operator`, `./test`, `./tools`). Key implications: + +- Do **not** use `-mod=vendor` for `go test` or `go fmt` — the Makefile handles `GOFLAGS` clearing. +- To add a dependency: `make update-dep PKG=pkg@version`, then `make update-vendor`. +- All build-time tools are vendored and built from source. Do not install tools globally. + +## Container Builds + +The default container tool is `podman`. Override with: + +```sh +CONTAINER_TOOL=docker make docker-build +``` + +## Additional Resources + +- [AGENTS.md](AGENTS.md) — Architecture, controller details, and common pitfalls +- [docs/testing-guidelines.md](docs/testing-guidelines.md) — Detailed testing framework and patterns +- [docs/api-contracts-guidelines.md](docs/api-contracts-guidelines.md) — CRD types, kubebuilder markers, CEL validation +- [docs/error-handling-guidelines.md](docs/error-handling-guidelines.md) — ReconcileError types, retry logic, status conditions +- [docs/security-guidelines.md](docs/security-guidelines.md) — Container security, RBAC, TLS, network policies + +## License + +By contributing, you agree that your contributions will be licensed under the [Apache License 2.0](LICENSE). diff --git a/GLOSSARY.md b/GLOSSARY.md new file mode 100644 index 00000000..21e1698a --- /dev/null +++ b/GLOSSARY.md @@ -0,0 +1,191 @@ +# Glossary + +This glossary defines domain-specific terminology used throughout the External Secrets Operator codebase and documentation. + +## Core Concepts + +**Bindata** +Static assets (YAML manifests) compiled into the Go binary via the `go-bindata` tool. Located in `bindata/` directory and compiled to `pkg/operator/assets/bindata.go`. These are templates that are decoded and mutated at runtime. + +**ESC** +Short for `ExternalSecretsConfig`, the primary user-facing CRD. A cluster-scoped singleton CR named "cluster" that controls operand installation, cert-manager integration, Bitwarden plugin, proxy configuration, and network policies. + +**ESM** +Short for `ExternalSecretsManager`, the global operator-level configuration CRD. A cluster-scoped singleton CR named "cluster" that provides lower-priority defaults and optional features. Auto-created at operator install. + +**ESO** +Ambiguous term that can mean: +1. **External Secrets Operator** - This project (the OpenShift operator that manages the operand) +2. **External-Secrets.io** - The upstream open-source project we deploy as the operand + +Context determines which meaning applies. In code comments and docs, prefer "operator" or "operand" for clarity. + +**Operand** +The upstream [external-secrets](https://github.com/external-secrets/external-secrets) application that this operator manages. The operand consists of deployments, services, RBAC, webhooks, and CRDs that end users interact with (`ExternalSecret`, `SecretStore`, `ClusterSecretStore`, etc.). + +**Operator** +This codebase (`github.com/openshift/external-secrets-operator`). The operator manages the operand's lifecycle (installation, upgrades, configuration, drift detection). + +## Resource Types + +**Managed Resources** +Kubernetes resources owned and reconciled by the operator, labeled with `app=external-secrets`. Includes operand Deployments, Services, RBAC, NetworkPolicies, Secrets, ConfigMaps, Certificates, and ValidatingWebhookConfigurations. The operator drift-detects and reverts manual changes to these resources. + +**Singleton CR** +A custom resource limited to a single instance cluster-wide. Both `ExternalSecretsConfig` and `ExternalSecretsManager` are singletons enforced via CEL validation (`self.metadata.name == 'cluster'`). + +**Watched Resources** +Kubernetes resources that trigger reconciliation when changed, but are not owned by the operator. Examples: user-provided ConfigMaps (trusted CA bundle), cert-manager Issuers/ClusterIssuers, upstream ESO CRDs. + +## Architectural Patterns + +**Drift Detection** +The process of comparing actual cluster state against desired state and reverting manual changes. Implemented via `common.HasObjectChanged()` which uses type-specific field comparison (not full `DeepEqual`). + +**Reconciliation Order** +The strict dependency order in which resources are created: +`Namespace → NetworkPolicies → ServiceAccounts → Certificates → Secrets → TrustedCA ConfigMap → RBAC → Services → Deployments → ValidatingWebhooks → CR annotation tracking` + +**Conditional Resources** +Resources created only when specific configuration is enabled. Examples: cert-controller deployment (when cert-manager is NOT enabled), Bitwarden deployment (when Bitwarden IS enabled), Certificates (when cert-manager IS enabled). + +**Three-Controller Architecture** +The operator runs three separate controllers in a single binary: +- `external-secrets-controller` - main operand reconciliation +- `external-secrets-manager` - status aggregation +- `crd-annotator` - cert-manager CA injection (conditional) + +## Code Generation + +**Generated Files** +Files created by code generators that must NOT be edited by hand: +- `zz_generated.deepcopy.go` - DeepCopy methods (from `make generate`) +- `config/crd/bases/*.yaml` - CRD manifests (from `make manifests`) +- `pkg/operator/assets/bindata.go` - Compiled YAML assets (from `make update-bindata`) +- `config/rbac/role.yaml` - Operator RBAC (from `make manifests`) +- `docs/api_reference.md` - API documentation (from `make docs`) + +Always use `make update` after changes that affect these files. + +**Kubebuilder Markers** +Special Go comments (`// +kubebuilder:...`) that control code generation. Examples: +- `+kubebuilder:validation:XValidation` - CEL validation rules +- `+kubebuilder:rbac` - RBAC role generation +- `+kubebuilder:printcolumn` - kubectl output columns +- `+kubebuilder:resource` - CRD metadata + +## Error Handling + +**ReconcileError** +Custom error type in `pkg/controller/common/errors.go` with three reasons: +- `IrrecoverableError` - No retry (config validation failures, permission errors) +- `RetryRequiredError` - Requeue after 30s (transient failures, conflicts) +- `UserConfigurationError` - No retry, recovery driven by watches (invalid user config) + +**FromClientError** +Helper function that auto-classifies Kubernetes API errors into the appropriate `ReconcileError` reason. `Unauthorized`, `Forbidden`, `Invalid`, and `BadRequest` become irrecoverable; everything else becomes retry-required. + +## Testing + +**envtest** +Kubernetes API server testing environment used for API integration tests. Runs a real API server (etcd + kube-apiserver) without kubelet. Requires Kubernetes >= 1.25 for CEL validation support. + +**FakeCtrlClient** +Counterfeiter-generated mock implementation of the `CtrlClient` interface used in unit tests. Located in `pkg/controller/client/fakes/`. + +**Table-Driven Tests** +Unit test pattern where test cases are defined as a slice of structs, then iterated with `t.Run()`. Standard pattern across all `*_test.go` files in `pkg/`. + +**Test Tiers** +Three levels of testing: +1. **Unit** - `pkg/**/*_test.go`, stdlib testing, no cluster +2. **API** - `test/apis/`, Ginkgo + envtest, declarative YAML test suites +3. **E2E** - `test/e2e/`, Ginkgo + live cluster, labeled by platform/provider + +## Integrations + +**cert-manager** +Optional dependency for automated TLS certificate provisioning. When installed, the operator can create `Certificate` resources instead of using the built-in cert-controller. Detected via CRD discovery at startup. + +**Bitwarden Plugin** +The only currently supported external secrets provider plugin. Runs as a separate `bitwarden-sdk-server` deployment when enabled via `spec.plugins.bitwardenSecretManagerProvider.mode: Enabled`. + +**CNO (Cluster Network Operator)** +OpenShift component that injects cluster CA certificates into ConfigMaps labeled with `config.openshift.io/inject-trusted-cabundle: "true"`. Used for proxy CA bundle injection. + +**OLM (Operator Lifecycle Manager)** +OpenShift framework for managing operator installation, upgrades, and permissions. Sets environment variables like `RELATED_IMAGE_EXTERNAL_SECRETS` on the operator pod. + +## Network & Security + +**Network Policy Architecture** +Default-deny base policy with layered allow-policies: +- System policies (prefixed `eso-sys-`) for API server, webhook, DNS, cert-controller, Bitwarden +- User policies (prefixed `eso-user-`) for custom egress rules +- Deny-all base applied first, allow-policies layered on top + +**Hardened Security Context** +Programmatically enforced container security settings: +- `AllowPrivilegeEscalation: false` +- `Capabilities.Drop: ["ALL"]` +- `ReadOnlyRootFilesystem: true` +- `RunAsNonRoot: true` +- `SeccompProfile.Type: RuntimeDefault` + +Applied via `updateContainerSecurityContext()` to all operand containers. + +**HTTP/2 Disabled** +Default configuration (`--enable-http2=false`) to mitigate known HTTP/2 vulnerabilities. Applies to operator metrics and webhook servers. + +## Tooling + +**go.work (Go Workspace)** +Multi-module workspace configuration with four modules: `.`, `./cmd/external-secrets-operator`, `./test`, `./tools`. Requires special handling in Makefile (clearing `GOFLAGS` for test/fmt targets). + +**controller-runtime** +Kubernetes controller framework providing manager, cache, client, reconciler primitives. The operator uses version specified in `go.mod`. + +**Ginkgo** +BDD-style testing framework used for API and E2E tests. Provides `Describe`, `Context`, `It`, `BeforeEach`, `AfterEach`, labels, and table-driven specs. + +**counterfeiter** +Mock/fake code generator used to create `FakeCtrlClient` from the `CtrlClient` interface. Run via `go generate ./pkg/controller/client/...`. + +## Abbreviations + +- **API**: Application Programming Interface (or API integration tests) +- **CA**: Certificate Authority +- **CEL**: Common Expression Language (Kubernetes validation) +- **CR**: Custom Resource +- **CRD**: Custom Resource Definition +- **CSV**: ClusterServiceVersion (OLM bundle metadata) +- **E2E**: End-to-End (integration tests on live cluster) +- **FIPS**: Federal Information Processing Standards +- **RBAC**: Role-Based Access Control +- **TLS**: Transport Layer Security +- **YAML**: YAML Ain't Markup Language + +## Field Naming Conventions + +**Asset Name Constants** +Pattern: `_AssetName` in camelCase +Example: `controllerDeploymentAssetName`, `webhookServiceAssetName` + +**Environment Variable Constants** +Pattern: all-caps with suffix `EnvVarName` +Example: `externalSecretsImageEnvVarName`, `bitwardenSDKServerImageEnvVarName` + +**Finalizer Format** +Pattern: `./` +Example: `externalsecretsconfigs.operator.openshift.io/external-secrets-controller` + +**Bindata YAML Files** +Pattern: `_.yml` in `bindata/external-secrets/resources/` +Example: `deployment_external-secrets.yml`, `clusterrole_external-secrets-controller.yml` + +## See Also + +- **AGENTS.md** - Entry point for AI agents and developers +- **ARCHITECTURE.md** - High-level system architecture +- **docs/decisions/** - Architecture Decision Records explaining why things are designed as they are +- **CONTRIBUTING.md** - Contribution process and coding conventions diff --git a/docs/CONTEXT_MAP.md b/docs/CONTEXT_MAP.md new file mode 100644 index 00000000..a648abb9 --- /dev/null +++ b/docs/CONTEXT_MAP.md @@ -0,0 +1,253 @@ +# Context Documentation Map + +This document maps all documentation in this repository, organized by audience and purpose. + +## For Track 1: Contextualization + +This repository implements **prescriptive context registration** following the Track 1 approach: + +- **Prescriptive context** (official organizational direction): architecture, decisions, guidelines +- **Process context**: contributing workflows + +When the Cyborg schema supports context registration (Track 1 Phase 2), this repo can register: +- **Entry point**: AGENTS.md +- **Architecture**: ARCHITECTURE.md + docs/decisions/ +- **Domain guidelines**: docs/*-guidelines.md +- **Terminology**: GLOSSARY.md +- **Process**: CONTRIBUTING.md + +--- + +## Entry Points by Audience + +### 🤖 For AI Agents + +**Start here**: [AGENTS.md](../AGENTS.md) + +AGENTS.md provides: +- Quick index to domain-specific guidelines +- Project structure and build system +- Architectural patterns +- Common pitfalls + +**Then read**: +- [GLOSSARY.md](../GLOSSARY.md) - Terminology definitions +- [docs/decisions/](decisions/) - Architecture Decision Records (why decisions were made) +- Domain-specific guidelines linked from AGENTS.md + +### 👨‍💻 For New Contributors + +**Start here**: [CONTRIBUTING.md](../CONTRIBUTING.md) + +**Then read**: +- [AGENTS.md](../AGENTS.md) - Architecture and patterns +- [GLOSSARY.md](../GLOSSARY.md) - Terminology +- [docs/testing-guidelines.md](testing-guidelines.md) - Testing framework details + +**Before first PR**: +- Run `make lint && make test && make update && make verify` + +### 🏗️ For Architects & Tech Leads + +**Start here**: [ARCHITECTURE.md](../ARCHITECTURE.md) + +**Then read**: +- [docs/decisions/](decisions/) - ADRs explaining "why" +- Domain guidelines: + - [api-contracts-guidelines.md](api-contracts-guidelines.md) - CRD design + - [performance-guidelines.md](performance-guidelines.md) - Cache, reconciliation patterns + - [security-guidelines.md](security-guidelines.md) - RBAC, network policies, hardening + +--- + +## Documentation Hierarchy + +### Organizational Level (Future - Track 1 Phase 2) + +When registered in Cyborg, this repo will inherit context from: +- **Hybrid Platforms org-level** - Development standards, escalation procedures +- **Area/Group/Pillar-level** - Operational context + +*Currently: Not yet registered (waiting for Cyborg schema changes)* + +### Repo-Level Documentation + +#### Entry Points + +| Document | Audience | Purpose | +|---|---|---| +| [AGENTS.md](../AGENTS.md) | AI agents, developers | Technical entry point with architecture, patterns | +| [README.md](../README.md) | All | General overview | +| [CONTRIBUTING.md](../CONTRIBUTING.md) | Contributors | PR process and coding conventions | +| [ARCHITECTURE.md](../ARCHITECTURE.md) | Architects | High-level system design | + +#### Reference + +| Document | Purpose | +|---|---| +| [GLOSSARY.md](../GLOSSARY.md) | Domain-specific terminology | +| [CLAUDE.md](../CLAUDE.md) | Claude Code quick reference | + +### Domain-Specific Technical Docs + +#### Guidelines (Prescriptive) + +Located in `docs/`: + +| Document | Domain | +|---|---| +| [api-contracts-guidelines.md](api-contracts-guidelines.md) | CRD types, kubebuilder markers, CEL validation | +| [error-handling-guidelines.md](error-handling-guidelines.md) | ReconcileError types, retry logic, status conditions | +| [integration-guidelines.md](integration-guidelines.md) | Bindata pattern, cert-manager, OpenShift platform | +| [performance-guidelines.md](performance-guidelines.md) | Cache architecture, watch predicates, reconciliation | +| [security-guidelines.md](security-guidelines.md) | Container security, RBAC, TLS, network policies | +| [testing-guidelines.md](testing-guidelines.md) | Unit/API/E2E test tiers, frameworks | + +#### Architecture Decision Records + +Located in `docs/decisions/`: + +| ADR | Decision | +|---|---| +| [001-bindata-over-helm.md](decisions/001-bindata-over-helm.md) | Why static YAML + bindata instead of Helm | +| [002-three-controller-design.md](decisions/002-three-controller-design.md) | Why three controllers instead of monolith | +| [003-cel-validation-only.md](decisions/003-cel-validation-only.md) | Why CEL validation instead of webhooks | +| [004-singleton-cr-pattern.md](decisions/004-singleton-cr-pattern.md) | Why singleton CRs named "cluster" | +| [README.md](decisions/README.md) | ADR index | + +### Generated Documentation + +| Document | Generated From | Purpose | +|---|---|---| +| [api_reference.md](api_reference.md) | `api/v1alpha1/` types | API field documentation | +| `config/crd/bases/*.yaml` | Kubebuilder markers | CRD manifests | +| `config/rbac/role.yaml` | `+kubebuilder:rbac` markers | Operator RBAC | + +**Never edit generated files manually** - use `make update`. + +--- + +## Documentation Type Classification + +For **Track 1 context registration**: + +### Prescriptive Context (Authoritative) + +Official decisions and architectural direction: +- AGENTS.md (patterns) +- ARCHITECTURE.md (design) +- docs/*-guidelines.md (how to implement correctly) +- docs/decisions/ (ADRs - why decisions were made) +- CONTRIBUTING.md (process requirements) +- api_reference.md (API contracts) + +### Meta-Documentation + +Navigation and terminology: +- GLOSSARY.md (defines terms) +- CONTEXT_MAP.md (this file - navigation) +- docs/decisions/README.md (ADR index) + +--- + +## Finding Information + +### "How do I...?" + +| Question | Start Here | +|---|---| +| Understand the architecture? | [ARCHITECTURE.md](../ARCHITECTURE.md) → [AGENTS.md](../AGENTS.md) | +| Add a new CRD field? | [api-contracts-guidelines.md](api-contracts-guidelines.md) | +| Write tests? | [testing-guidelines.md](testing-guidelines.md) | +| Understand error handling? | [error-handling-guidelines.md](error-handling-guidelines.md) | +| Configure RBAC? | [security-guidelines.md](security-guidelines.md) | + +### "Why is it designed this way?" + +All "why" questions → [docs/decisions/](decisions/) + +| Topic | ADR | +|---|---| +| Why bindata instead of Helm? | [ADR-001](decisions/001-bindata-over-helm.md) | +| Why three controllers? | [ADR-002](decisions/002-three-controller-design.md) | +| Why CEL validation? | [ADR-003](decisions/003-cel-validation-only.md) | +| Why singleton CRs? | [ADR-004](decisions/004-singleton-cr-pattern.md) | + +### "What does X mean?" + +[GLOSSARY.md](../GLOSSARY.md) - Comprehensive terminology reference + +--- + +## Cross-References + +### Most Referenced Docs + +1. **AGENTS.md** - Referenced by README, CONTRIBUTING, CLAUDE.md, all guidelines +2. **ARCHITECTURE.MD** - Referenced by AGENTS, ADRs, guidelines +3. **GLOSSARY.md** - Referenced by technical docs for terminology +4. **CONTRIBUTING.md** - Referenced by README, AGENTS + +### Documentation Dependencies + +``` +README.md + ├── AGENTS.md (AI entry point) + ├── CONTRIBUTING.md (process) + └── ARCHITECTURE.md (design) + +AGENTS.md + ├── docs/*-guidelines.md (domain docs) + ├── GLOSSARY.md (terminology) + └── ARCHITECTURE.md (system design) + +ARCHITECTURE.md + └── docs/decisions/ (why decisions) + +CONTRIBUTING.md + ├── AGENTS.md (architecture context) + └── docs/testing-guidelines.md (testing) +``` + +--- + +## Version History + +**Track 1 Contextualization** (PR #150 + enhancements): +- **PR #150**: AGENTS.md, ARCHITECTURE.md, CONTRIBUTING.md, 6 guideline docs +- **Enhancements**: ADRs, GLOSSARY, CONTEXT_MAP + +--- + +## Related: Track 1 Contextualization + +This repository implements **prescriptive context registration** as described in the Agentic SDLC Strategy Track 1. + +**What we've done**: +- ✅ Written prescriptive context (ADRs, guidelines, architecture) +- ✅ Centralized context in structured format +- ✅ Provided entry points for AI agents and developers +- ✅ Defined terminology via GLOSSARY + +**What's next (Track 1 Phase 2)**: +- ⏳ Cyborg schema extension for context registration +- ⏳ Register this repo's context (AGENTS.md, decisions/, guidelines/) +- ⏳ Service account access for AI agents + +**Context registry example**: +```yaml +# Future Cyborg YAML +team: external-secrets-operator +context: + entry_point: github.com/.../AGENTS.md + architecture: github.com/.../ARCHITECTURE.md + decisions: github.com/.../docs/decisions/ + domain_guidelines: + - github.com/.../docs/api-contracts-guidelines.md + - github.com/.../docs/error-handling-guidelines.md + # ... etc + terminology: github.com/.../GLOSSARY.md + process: github.com/.../CONTRIBUTING.md +``` + +This documentation is already usable by AI tools - registration makes it **discoverable** across the org. diff --git a/docs/api-contracts-guidelines.md b/docs/api-contracts-guidelines.md new file mode 100644 index 00000000..71e947d8 --- /dev/null +++ b/docs/api-contracts-guidelines.md @@ -0,0 +1,157 @@ +# API Contracts Guidelines + +## API Group and Versioning + +- Group: `operator.openshift.io`, Version: `v1alpha1`. All types live in `api/v1alpha1/`. +- Only one version exists. The package-level marker `+groupName=operator.openshift.io` in `groupversion_info.go` sets the group. +- Every new type must be registered via `SchemeBuilder.Register()` in its `init()` function within the types file. + +## CRD Resources + +Two cluster-scoped singletons exist: + +| Kind | Plural | Short Names | Purpose | +|---|---|---|---| +| ExternalSecretsConfig | externalsecretsconfigs | esc, externalsecretsconfig, esconfig | Configures the external-secrets operand | +| ExternalSecretsManager | externalsecretsmanagers | esm, externalsecretsmanager, esmanager | Global operator-level config, auto-created at install | + +Both are enforced singletons via CEL: `self.metadata.name == 'cluster'`. Both use `+genclient:nonNamespaced` and `scope=Cluster`. + +## Required Kubebuilder Markers on CRD Types + +Every root CRD type must carry these markers: +``` ++genclient ++genclient:nonNamespaced ++k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object ++kubebuilder:object:root=true ++kubebuilder:subresource:status ++kubebuilder:resource:path=,scope=Cluster,categories={external-secrets-operator, external-secrets},shortName= ++kubebuilder:metadata:labels={"app.kubernetes.io/name=", "app.kubernetes.io/part-of=external-secrets-operator"} ++operator-sdk:csv:customresourcedefinitions:displayName="" +``` + +List types need `+kubebuilder:object:root=true` and `+k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object`. + +## Field-Level Validation Conventions + +### String Fields +- Set `+kubebuilder:validation:MinLength` and `+kubebuilder:validation:MaxLength` on spec string fields (e.g., 1-253 for names, 0-2048 for URLs, 0-4096 for noProxy). +- Status fields (e.g., `ExternalSecretsImage`, `BitwardenSDKServerImage`) and custom condition message fields typically omit validation markers. +- Use `+kubebuilder:validation:Pattern` for key-format constraints (e.g., ConfigMap keys: `^[-._a-zA-Z0-9]+$`). + +### Map Fields +- Set `+kubebuilder:validation:MinProperties` and `+kubebuilder:validation:MaxProperties` on spec maps (typical max: 20 for labels/annotations, 50 for nodeSelector). +- Set `+mapType=granular` for strategic-merge-patchable maps; `+mapType=atomic` for replace-on-update maps. + +### List Fields +- Set `+kubebuilder:validation:MinItems` and `+kubebuilder:validation:MaxItems` on spec lists. +- Status lists (e.g., `Conditions`) typically omit MinItems/MaxItems validation. +- For merge-patchable lists: use `+listType=map` with `+listMapKey=`. Composite keys are supported (e.g., networkPolicies uses both `name` and `componentName`). +- For replace-on-update lists: use `+listType=atomic`. +- Status subresource lists use `+patchMergeKey=` and `+patchStrategy=merge` in addition to listType markers for proper merge behavior. + +### Numeric Fields +- Use `+kubebuilder:validation:Minimum` and `+kubebuilder:validation:Maximum` (e.g., logLevel: 1-5, revisionHistoryLimit: 1-50). + +### Enum Fields +- Use `+kubebuilder:validation:Enum` for closed sets. Define a named Go type with exported constants (e.g., `Mode` with `Enabled`/`Disabled`, `ManagementState` with `Managed`/`Unmanaged`, `ComponentName` with four values). + +## CEL Validation Rules (XValidation) + +CEL rules via `+kubebuilder:validation:XValidation` are the primary mechanism for cross-field and immutability validation. Patterns used: + +- **Singleton enforcement**: `rule="self.metadata.name == 'cluster'"` on the root type. +- **Immutability**: `rule="self == oldSelf"` on individual fields (e.g., certManager `mode`, `issuerRef`, `injectAnnotations`). +- **List key immutability**: `rule="oldSelf.all(op, self.exists(p, p.name == op.name && p.componentName == op.componentName))"` to prevent renaming list entries. +- **Cross-field dependencies**: Complex rules on spec structs to enforce "if X is enabled, then Y or Z must be configured." +- **Annotation domain blocklists**: Regex-based `self.all(key, !key.matches(...))` rules blocking `kubernetes.io/`, `openshift.io/`, `k8s.io/`, `cert-manager.io/` prefixes. +- **Reserved env var names**: `self.all(e, !['PREFIX_'].exists(p, e.name.startsWith(p)) && e.name != 'EXACT_NAME')`. + +Always provide a `message` on every XValidation rule. + +## kubeapilinter Nolint Directives + +The repo uses `//nolint:kubeapilinter` comments for intentional deviations. Document the reason inline. Known patterns: +- `listMapKey` fields must NOT have `omitempty` (for proper patch merge identification). Apply `//nolint:kubeapilinter // is a listMapKey and must not have omitempty`. +- Custom `Condition` type (on ExternalSecretsManager) that intentionally omits some `metav1.Condition` fields. +- `metav1.Duration` fields retained despite linter preference, annotated with `// Duration type retained to avoid breaking API change`. + +## Status Subresource Patterns + +### ExternalSecretsConfig Status +- Embeds `ConditionalStatus` (from `meta.go`) which holds `[]metav1.Condition` with standard merge markers (`+listType=map`, `+listMapKey=type`, `+patchMergeKey=type`, `+patchStrategy=merge`). +- Uses standard `metav1.Condition` with `ObservedGeneration` set to `esc.GetGeneration()`. +- Condition types: `Ready`, `Degraded`, and `UpdateAnnotation` (all defined as constants in `conditions.go`). The `UpdateAnnotation` condition is set by the `crd-annotator` controller on the `ExternalSecretsConfig` status. +- Reasons: `Failed`, `Ready`, `Progressing`, `Completed` (also constants, with `Progressing` defined as constant `ReasonInProgress`). +- Both conditions are set atomically before a single status update call. +- Additional status fields: `externalSecretsImage`, `bitwardenSDKServerImage`. + +### ExternalSecretsManager Status +- Uses a custom `ControllerStatus` list (`+listType=map`, `+listMapKey=name`) tracking per-controller conditions. +- Each `ControllerStatus` contains a custom `[]Condition` type (Type, Status, Message -- no Reason or LastTransitionTime) and `ObservedGeneration`. +- Top-level `lastTransitionTime` is set on every condition change. + +### Status Update Pattern +All controllers follow the same retry-on-conflict pattern: +1. Fetch current object from API server. +2. DeepCopy the changed status into the fetched object. +3. Call `StatusUpdate` (status subresource update). +4. Wrap in `retry.RetryOnConflict(retry.DefaultRetry, ...)`. + +## Defaults + +Use `+kubebuilder:default` for server-side defaulting: +- `logLevel`: 1 +- `mode`: `Disabled` +- `injectAnnotations`: `"false"` +- `certificateCheckInterval`: `"5m"` +- `certificateDuration`: `"8760h"` +- `certificateRenewBefore`: `"30m"` +- `revisionHistoryLimit`: 10 +- `networkPolicyProvisioning`: `Managed` +- `key` (ConfigMapKeyReference): `"ca-bundle.crt"` + +## Shared Types and Composition + +- `CommonConfigs` (in `meta.go`) is embedded via `json:",inline"` in both `ApplicationConfig` and `GlobalConfig`. It provides `logLevel`, `resources`, `affinity`, `tolerations`, `nodeSelector`, `proxy`. +- `ObjectReference`, `SecretReference`, `ConfigMapKeyReference`, `ConditionalStatus` are reusable building blocks in `meta.go`. +- Named types (`Mode`, `ManagementState`, `ComponentName`, `FeatureName`) with constants must be used instead of raw strings for enum fields. + +## Code Generation Pipeline + +After any API type change, run: +1. `make generate` -- regenerates `zz_generated.deepcopy.go` via controller-gen. +2. `make manifests` -- regenerates CRD YAMLs in `config/crd/bases/`, RBAC, and webhook configs. +3. `make update` -- full pipeline: generate + manifests + operand manifests + bindata + bundle + docs. +4. `make docs` -- regenerates `docs/api_reference.md` from `api/v1alpha1/` using `crd-ref-docs`. +5. `make verify` -- runs `check-git-diff` to ensure all generated files are committed. + +Never edit `zz_generated.deepcopy.go` or CRD YAML files by hand. + +## API Integration Tests + +API validation is tested via declarative YAML test suites in `api/v1alpha1/tests//` (e.g., `externalsecretsconfig.operator.openshift.io/`, `externalsecretsmanager.operator.openshift.io/`). Each `.testsuite.yaml` file defines `onCreate` and `onUpdate` test cases specifying `initial`, `expected`, and `expectedError` YAML. The test framework (in `test/apis/`) installs CRDs into a real envtest API server (requires Kube >= 1.25 for CEL). + +When adding a new CEL validation rule or field constraint, add corresponding test cases to the relevant `.testsuite.yaml` covering: +- Valid creation (initial + expected) +- Invalid creation (initial + expectedError with substring match) +- Immutability on update (initial + updated + expectedError) +- Boundary values (min/max lengths, min/max items) + +Run API tests with `make test-apis` or via the general `make test`. + +## Webhook Architecture + +This operator does NOT use kubebuilder-generated admission webhooks for its own CRDs. CEL-based CRD validation handles all admission validation. The webhook code in `pkg/controller/external_secrets/` manages the upstream external-secrets operand's `ValidatingWebhookConfiguration` resources (not the operator's own API validation). + +## Adding New API Fields Checklist + +1. Add the Go field to the appropriate struct in `api/v1alpha1/`. +2. Add validation markers (min/max, enum, XValidation) for spec fields; status fields typically omit validation. +3. Use pointer types for optional sub-structs; value types for required scalars with defaults. +4. Add `+optional` or `+required` marker. Use `json:"fieldName,omitempty"` for optional fields. +5. Exception: listMapKey fields must NOT have `omitempty`; add `//nolint:kubeapilinter` with reason. +6. Write test cases in the corresponding `.testsuite.yaml` for both valid and invalid inputs. +7. Run `make update && make verify` to regenerate and validate all artifacts. +8. If the field adds a new condition type or reason, add constants in `conditions.go`. diff --git a/docs/decisions/001-bindata-over-helm.md b/docs/decisions/001-bindata-over-helm.md new file mode 100644 index 00000000..b70b8a86 --- /dev/null +++ b/docs/decisions/001-bindata-over-helm.md @@ -0,0 +1,97 @@ +# ADR-001: Use Bindata Pattern Instead of Helm Charts + +## Status + +Accepted + +## Context + +The External Secrets Operator needs to manage the deployment and lifecycle of the upstream [external-secrets](https://github.com/external-secrets/external-secrets) application on OpenShift clusters. Several approaches were considered for how to package and deploy the operand resources (Deployments, Services, RBAC, etc.): + +1. **Helm charts**: Use upstream Helm charts with values overrides +2. **Embedded Go code**: Define all Kubernetes resources as Go structs in code +3. **Static YAML manifests (bindata)**: Store YAML templates, compile into binary, decode and mutate at runtime +4. **CRDs only**: Define a CRD that generates resources dynamically + +Each approach has different trade-offs around: +- Control over resource configuration +- Ability to track upstream changes +- Type safety and validation +- Operational complexity +- Drift detection capability + +## Decision + +We will use **static YAML manifests compiled via go-bindata** (option 3). All operand Kubernetes resources are stored as YAML files in `bindata/external-secrets/` and `bindata/external-secrets/resources/`. These are: + +1. Compiled into Go binary via `go-bindata` → `pkg/operator/assets/bindata.go` +2. Decoded at reconcile time using typed decoders (`DecodeDeploymentObjBytes`, etc.) +3. Mutated with operator-controlled configuration (images, env vars, labels, security context) +4. Applied imperatively to the cluster via `Create`/`Update` + +## Rationale + +**Why not Helm?** +- Helm adds operational complexity (Tiller/3-way merge, release tracking) +- Difficult to implement precise drift detection with Helm-managed resources +- Values overrides don't provide complete control over all resource fields +- Helm templating language limits type safety +- OLM (OpenShift's operator lifecycle manager) doesn't require Helm + +**Why not embedded Go code?** +- Harder to review resource definitions (Go structs vs YAML) +- Difficult to track upstream changes (must manually translate YAML → Go) +- Loses the benefits of YAML as the canonical format for Kubernetes resources +- More verbose for large resource definitions + +**Why not pure CRD generation?** +- Upstream resources have many fields that would need to be exposed as CRD spec +- Would create a non-standard API that differs from upstream +- Harder to track upstream changes when they add new resource types + +**Why bindata works for us:** +- ✅ Full control over every field in every resource +- ✅ YAML manifests are easy to review and compare against upstream +- ✅ Type-safe decoding catches schema mismatches at reconcile time +- ✅ Precise drift detection via field-level comparison (`HasObjectChanged`) +- ✅ No external dependencies (Helm, Kustomize) at runtime +- ✅ OLM-compatible pattern used by other OpenShift operators +- ✅ Clear separation between template (bindata YAML) and runtime configuration (mutation) + +## Consequences + +### Positive + +- **Predictable behavior**: Resources are defined explicitly, no hidden templating logic +- **Strong drift detection**: We can detect and revert manual changes to managed resources +- **Build-time validation**: Decode failures panic, indicating build-time bugs +- **Upstream tracking**: Easy to diff bindata YAML against upstream releases +- **Type safety**: Typed decoders ensure schema compatibility + +### Negative + +- **Manual YAML maintenance**: Must manually update bindata when upstream changes +- **Binary size**: Compiled YAML increases binary size (acceptable trade-off) +- **Regeneration step**: Requires `make update-bindata` after YAML changes (enforced by CI) +- **No templating**: Cannot use Helm helpers or Kustomize overlays (we mutate in Go instead) + +### Mitigations + +- CI enforces that `bindata.go` is regenerated via `make verify` → `check-git-diff` +- Decode helpers panic on failure to catch manifest corruption early +- `HasObjectChanged` provides type-specific field comparison for precise drift detection +- Asset name constants in `constants.go` catch typos at compile time + +## Alternatives Considered + +1. **Helm with custom controller** - Rejected due to operational complexity and drift detection limitations +2. **Kustomize** - Rejected for similar reasons to Helm (no runtime templating, harder drift detection) +3. **Hybrid (CRD + bindata)** - Considered but adds complexity without clear benefit + +## References + +- Bindata implementation: `pkg/operator/assets/bindata.go` +- Decode helpers: `pkg/controller/common/utils.go` +- Resource reconciliation pattern: `pkg/controller/external_secrets/install_external_secrets.go` +- Drift detection: `pkg/controller/common/utils.go:HasObjectChanged` +- Upstream project: https://github.com/external-secrets/external-secrets diff --git a/docs/decisions/002-three-controller-design.md b/docs/decisions/002-three-controller-design.md new file mode 100644 index 00000000..d5cfa2dc --- /dev/null +++ b/docs/decisions/002-three-controller-design.md @@ -0,0 +1,131 @@ +# ADR-002: Three-Controller Architecture + +## Status + +Accepted + +## Context + +The External Secrets Operator needs to: +1. Install and manage the upstream external-secrets operand (Deployments, RBAC, Services, etc.) +2. Provide a global status view for operators/admins +3. Conditionally integrate with cert-manager when available + +We needed to decide whether to implement this as: +- A single monolithic controller +- Multiple specialized controllers +- A controller with multiple reconcilers +- An event-driven architecture + +## Decision + +We implement **three separate controllers** running in a single binary: + +| Controller | Package | Primary Watch | Purpose | +|---|---|---|---| +| `external-secrets-controller` | `pkg/controller/external_secrets/` | `ExternalSecretsConfig` CR | Installs/reconciles operand deployments, RBAC, services, webhooks, network policies | +| `external-secrets-manager` | `pkg/controller/external_secrets_manager/` | `ExternalSecretsManager` CR | Aggregates controller statuses into a global status CR | +| `crd-annotator` | `pkg/controller/crd_annotator/` | ESO CRDs (metadata only) | Adds cert-manager CA injection annotations to operand CRDs (conditional) | + +All three controllers are registered in `pkg/operator/setup_manager.go` with a single `controller-runtime` manager. + +## Rationale + +**Why separate controllers instead of a monolith?** + +1. **Separation of concerns** + - Each controller has a distinct purpose and lifecycle + - `crd-annotator` is conditionally registered only when cert-manager is installed + - Status aggregation (`external-secrets-manager`) is independent of resource reconciliation + +2. **Different watch patterns** + - `external-secrets-controller`: Watches `ExternalSecretsConfig` + managed resources (labeled `app=external-secrets`) + - `external-secrets-manager`: Watches `ExternalSecretsManager` + `ExternalSecretsConfig` status + - `crd-annotator`: Watches ESO CRDs (metadata only, label-filtered) + `ExternalSecretsConfig` + +3. **Independent cache configurations** + - Main controller uses manager-level label-filtered cache + - `crd-annotator` builds its own custom cache because it watches a disjoint resource set + +4. **Testability** + - Each controller can be tested independently + - Fakes for `CtrlClient` interface work across all controllers + - Unit tests don't need to mock the entire system + +5. **Conditional registration** + - `crd-annotator` is only registered when cert-manager CRD exists + - Avoids startup failures when cert-manager is not installed + - Clean conditional logic in `setup_manager.go` + +**Why not event-driven or pub/sub?** +- Kubernetes watch-based reconciliation is simpler and more reliable +- No need for message queues or complex event routing +- Controller-runtime provides battle-tested reconciliation primitives + +**Why not multiple binaries?** +- Single binary simplifies deployment and versioning +- All controllers share the same manager and cache (where appropriate) +- Easier to coordinate status updates and configuration changes + +## Consequences + +### Positive + +- **Clear boundaries**: Each controller has well-defined responsibilities +- **Conditional features**: `crd-annotator` can be disabled cleanly when cert-manager is absent +- **Independent evolution**: Controllers can change independently within their interface contracts +- **Testable**: Each controller has focused unit tests with minimal mocking +- **Debuggable**: Controller-specific logs and metrics make troubleshooting easier + +### Negative + +- **Coordination complexity**: Must ensure controllers don't conflict (e.g., status update races) +- **More code**: Three controller packages instead of one (acceptable trade-off) +- **Watch overhead**: Three sets of watches on some resources (mitigated by predicates) + +### Mitigations + +- **Status coordination**: `external-secrets-manager` aggregates status from `ExternalSecretsConfig`, avoiding races +- **Watch predicates**: Each controller uses predicates to filter events (`GenerationChangedPredicate`, `LabelChangedPredicate`, etc.) +- **Shared interfaces**: `CtrlClient` interface used by all controllers, fakes shared across tests +- **Single manager**: All controllers share the same cache and rate limiter + +## Implementation Notes + +### Reconciliation Order +Resources are created in strict dependency order within the main controller: +``` +Namespace → NetworkPolicies → ServiceAccounts → Certificates → Secrets +→ TrustedCA ConfigMap → RBAC → Services → Deployments +→ ValidatingWebhooks → CR annotation tracking +``` + +### Status Flow +1. `external-secrets-controller` updates `ExternalSecretsConfig.status.conditions` (Ready, Degraded) +2. `external-secrets-manager` watches `ExternalSecretsConfig.status` changes +3. `external-secrets-manager` aggregates into `ExternalSecretsManager.status.controllerStatuses` +4. `crd-annotator` updates `ExternalSecretsConfig.status` with `UpdateAnnotation` condition + +### Conditional Registration Pattern +```go +if isCRDInstalled(ctx, mgr, certmanagerv1.SchemeGroupVersion.WithResource("certificates")) { + if err := (&crd_annotator.Reconciler{}).SetupWithManager(mgr); err != nil { + return err + } +} +``` + +## Alternatives Considered + +1. **Single monolithic controller** - Rejected due to mixing concerns and complex conditional logic +2. **Separate binaries** - Rejected due to deployment complexity and coordination overhead +3. **Plugin architecture** - Overengineered for current requirements +4. **Kubernetes aggregated API server** - Too heavyweight for operator use case + +## References + +- Controller registration: `pkg/operator/setup_manager.go` +- Main controller: `pkg/controller/external_secrets/controller.go` +- Status aggregator: `pkg/controller/external_secrets_manager/controller.go` +- CRD annotator: `pkg/controller/crd_annotator/controller.go` +- Cache builder: `pkg/controller/external_secrets/controller.go:NewCacheBuilder` diff --git a/docs/decisions/003-cel-validation-only.md b/docs/decisions/003-cel-validation-only.md new file mode 100644 index 00000000..878107e7 --- /dev/null +++ b/docs/decisions/003-cel-validation-only.md @@ -0,0 +1,165 @@ +# ADR-003: Use CEL Validation Instead of Admission Webhooks + +## Status + +Accepted + +## Context + +The operator's own CRDs (`ExternalSecretsConfig`, `ExternalSecretsManager`) need validation to ensure: +- Fields meet length/range constraints +- Cross-field dependencies are satisfied +- Immutability rules are enforced (e.g., cert-manager mode cannot change once set) +- Singleton pattern is enforced (only one CR named "cluster" allowed) + +Kubernetes provides two primary validation mechanisms: +1. **CEL (Common Expression Language)** via `+kubebuilder:validation:XValidation` markers +2. **Admission webhooks** via kubebuilder-generated ValidatingWebhookConfiguration + +We needed to decide which approach to use for the operator's own API validation. + +## Decision + +We use **CEL-based CRD validation exclusively** for the operator's own CRDs. No admission webhooks are implemented for `ExternalSecretsConfig` or `ExternalSecretsManager`. + +All validation rules are expressed as `+kubebuilder:validation:XValidation` markers in `api/v1alpha1/external_secrets_config_types.go` and `external_secrets_manager_types.go`. + +## Rationale + +**Why CEL?** + +1. **No additional infrastructure** + - CEL validation runs in the API server, no webhook endpoint needed + - No need to manage webhook certificates, service, or deployment + - One less point of failure (webhook unavailable → admission blocked) + +2. **Simpler deployment** + - No webhook configuration to reconcile + - No network policies for webhook traffic (beyond operand webhooks) + - Fewer RBAC rules (no webhook configuration management for operator CRDs) + +3. **Better performance** + - CEL evaluates in-process in the API server (nanoseconds) + - Webhooks add network round-trip latency (milliseconds) + - No webhook pod startup time or scaling concerns + +4. **Declarative validation** + - Validation rules live alongside field definitions in Go types + - `make manifests` generates CRD YAML with validation embedded + - No separate webhook implementation code to maintain + +5. **API server guarantees** + - CEL validation is atomic with admission + - No race conditions between webhook calls and API writes + - Consistent behavior across all API server replicas + +6. **Testing simplicity** + - API tests use envtest with real API server + CEL support + - Declarative test cases in `.testsuite.yaml` files + - No need to mock webhook server in tests + +**Why not webhooks?** + +Webhooks add operational complexity we don't need: +- Webhook certificate rotation (even with cert-manager) +- Webhook failure modes (timeout, network issues, pod crashes) +- Version skew between webhook and API server +- Conversion webhooks not needed (single version `v1alpha1`) + +**CEL limitations we accept:** +- Cannot call external systems (fine - our validation is self-contained) +- Cannot mutate objects (fine - we use defaults via `+kubebuilder:default`, not defaulting webhooks) +- Requires Kubernetes >= 1.25 (acceptable - OpenShift 4.x supports this) + +## Consequences + +### Positive + +- **Operational simplicity**: No webhook infrastructure to manage +- **Faster admission**: No network round-trip for validation +- **Declarative**: Validation rules defined in Go type markers +- **Reliable**: No webhook failure modes (timeout, pod crash, network partition) +- **Easier testing**: envtest provides real CEL validation + +### Negative + +- **Kubernetes version dependency**: Requires API server >= 1.25 for CEL +- **CEL learning curve**: Team must learn CEL syntax (mitigated by good examples) +- **Limited expressiveness**: Cannot validate against cluster state (e.g., "issuerRef must exist") - these checks happen in controller code instead + +### Mitigations + +- **Minimum Kubernetes version**: Documented in `go.mod` and enforced by CI envtest version +- **CEL examples**: Comprehensive validation rules in types serve as reference +- **Controller-level validation**: Complex checks (issuerRef existence) in reconciler with clear error messages +- **API test coverage**: `.testsuite.yaml` files cover all validation rules + +## Implementation Notes + +### CEL Patterns Used + +**Singleton enforcement:** +```go +// +kubebuilder:validation:XValidation:rule="self.metadata.name == 'cluster'",message="..." +``` + +**Immutability:** +```go +// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="..." +``` + +**Cross-field dependencies:** +```go +// +kubebuilder:validation:XValidation:rule="!has(self.certManager) || has(self.certManager.issuerRef)",message="..." +``` + +**List key immutability:** +```go +// +kubebuilder:validation:XValidation:rule="oldSelf.all(op, self.exists(p, p.name == op.name))",message="..." +``` + +**Domain blocklists:** +```go +// +kubebuilder:validation:XValidation:rule="self.all(key, !key.matches('^(kubernetes\\.io|k8s\\.io|openshift\\.io|cert-manager\\.io)/'))",message="..." +``` + +### Validation Responsibilities Split + +| Validation Type | Where It Happens | +|---|---| +| Field constraints (length, range, enum) | CEL (`+kubebuilder:validation:*`) | +| Cross-field logic | CEL (`XValidation`) | +| Immutability | CEL (`self == oldSelf`) | +| Singleton pattern | CEL (`self.metadata.name == 'cluster'`) | +| External resource existence | Controller (`assertIssuerRefExists`, `assertSecretRefExists`) | +| Configuration validity | Controller (`ReconcileError` with `UserConfigurationError`) | + +### Test Coverage + +API validation tests are in `api/v1alpha1/tests//` as `.testsuite.yaml` files: +- Valid creation +- Invalid creation (CEL rejection) +- Immutability on update +- Boundary values + +Run via `make test-apis` (Ginkgo + envtest with real API server). + +## Alternatives Considered + +1. **Admission webhooks** - Rejected due to operational complexity +2. **No validation** (rely on controller error handling) - Rejected due to poor user experience +3. **OpenAPI schema validation only** (no CEL) - Rejected due to inability to express cross-field rules +4. **Hybrid (CEL + webhooks)** - Rejected as overengineered; CEL covers our needs + +## References + +- CEL specification: https://github.com/google/cel-spec +- Kubebuilder CEL docs: https://book.kubebuilder.io/reference/markers/crd-validation.html +- Type definitions: `api/v1alpha1/external_secrets_config_types.go` +- API tests: `api/v1alpha1/tests/` +- Test framework: `test/apis/generator.go` +- Envtest setup: `test/apis/suite_test.go` + +## Note on Operand Webhooks + +This decision applies ONLY to the operator's own CRDs. The operator **does** manage ValidatingWebhookConfigurations for the upstream operand (external-secrets), but those are bindata resources reconciled like any other operand resource. That webhook validates user-created `ExternalSecret`, `SecretStore`, etc. resources, not the operator's configuration CRs. diff --git a/docs/decisions/004-singleton-cr-pattern.md b/docs/decisions/004-singleton-cr-pattern.md new file mode 100644 index 00000000..33b40fb0 --- /dev/null +++ b/docs/decisions/004-singleton-cr-pattern.md @@ -0,0 +1,188 @@ +# ADR-004: Singleton Cluster-Scoped CRs Named "cluster" + +## Status + +Accepted + +## Context + +The operator provides two CRDs for configuration: +- `ExternalSecretsConfig`: User-facing configuration for operand installation +- `ExternalSecretsManager`: Global operator-level configuration and status aggregation + +We needed to decide: +1. Should these be namespace-scoped or cluster-scoped? +2. Should we allow multiple instances or enforce a singleton? +3. What should the singleton instance be named? +4. How do we enforce the singleton constraint? + +## Decision + +Both `ExternalSecretsConfig` and `ExternalSecretsManager` are: +- **Cluster-scoped** (not namespaced) +- **Singletons** enforced via CEL validation +- **Named "cluster"** as the only valid instance name + +The singleton constraint is enforced at admission time via CEL: +```go +// +kubebuilder:validation:XValidation:rule="self.metadata.name == 'cluster'",message="..." +``` + +Constants in `pkg/controller/common/constants.go`: +```go +const ( + ExternalSecretsConfigObjectName = "cluster" + ExternalSecretsManagerObjectName = "cluster" +) +``` + +## Rationale + +### Why Cluster-Scoped? + +1. **Global scope of operand**: The external-secrets operand runs in a single namespace (`external-secrets`) but manages cluster-scoped resources (ClusterRoles, ClusterRoleBindings, ClusterSecretStores, ValidatingWebhookConfigurations) + +2. **Consistency with OpenShift patterns**: Many OpenShift operators use cluster-scoped singleton CRs for global configuration (e.g., `Image.config.openshift.io`, `Ingress.config.openshift.io`, `Proxy.config.openshift.io`) + +3. **Single source of truth**: One cluster should have one external-secrets installation with one configuration + +4. **Simpler RBAC**: No namespace-level permission complexity; cluster-admin or dedicated ClusterRole grants access + +### Why Singleton? + +1. **Prevents conflicts**: Multiple configs would create ambiguity about which should apply + +2. **Simplifies reconciliation**: Controller always knows which CR to watch (no need for leader election or config merging) + +3. **Clear ownership**: One CR, one operator instance managing it + +4. **Security boundary**: Prevents privilege escalation via conflicting configurations + +5. **Status clarity**: One status object for the entire cluster's external-secrets state + +### Why Named "cluster"? + +1. **OpenShift convention**: Matches naming pattern of OpenShift config CRs: + - `Image.config.openshift.io/cluster` + - `Ingress.config.openshift.io/cluster` + - `Proxy.config.openshift.io/cluster` + - `DNS.config.openshift.io/cluster` + +2. **Semantic clarity**: The name "cluster" signals scope (cluster-wide configuration) + +3. **Hard to mistype**: Simple, memorable name + +4. **Avoids magic strings**: Could have used "default", "main", "global", but "cluster" aligns with ecosystem + +### Why Enforce via CEL? + +1. **Admission-time rejection**: Invalid names fail immediately at creation, not during reconciliation + +2. **Clear error message**: CEL can provide specific validation error to user + +3. **No controller code needed**: Validation happens before controller sees the object + +4. **Immutable by design**: Cannot rename the object (renaming would fail CEL validation) + +## Consequences + +### Positive + +- **Clear semantics**: One configuration per cluster, named predictably +- **Security**: No ambiguity about which config applies +- **Consistent with platform**: Follows OpenShift config CR patterns +- **Simple controller logic**: Controllers hardcode the reconcile key to `"cluster"` +- **Prevents user errors**: Cannot create multiple conflicting configs + +### Negative + +- **No multi-tenancy**: Cannot have different configurations per namespace (acceptable - not a use case) +- **Must delete/recreate to rename**: Cannot rename (acceptable - renaming a singleton makes no sense) +- **Cluster-admin required**: Creating these CRs requires cluster-scoped permissions (acceptable - this is platform-level config) + +### Mitigations + +- **Default creation**: `ExternalSecretsManager` is auto-created by operator on install, so users typically only interact with `ExternalSecretsConfig` +- **Clear documentation**: AGENTS.md, ARCHITECTURE.md, and API docs all explain the singleton pattern +- **Helpful error messages**: CEL validation provides clear message if wrong name used + +## Implementation Notes + +### Controller Reconciliation Pattern + +All controllers hardcode the reconcile key: +```go +func (r *Reconciler) mapToExternalSecretsConfig(ctx context.Context, obj client.Object) []reconcile.Request { + return []reconcile.Request{ + {NamespacedName: types.NamespacedName{Name: common.ExternalSecretsConfigObjectName}}, + } +} +``` + +### Default Resource Creation + +The `ExternalSecretsManager` CR is auto-created in `pkg/operator/setup_manager.go:CreateDefaultESMResource`: +```go +esm := &operatorv1alpha1.ExternalSecretsManager{ + ObjectMeta: metav1.ObjectMeta{ + Name: common.ExternalSecretsManagerObjectName, + }, + Spec: operatorv1alpha1.ExternalSecretsManagerSpec{ + ManagementState: operatorv1alpha1.Managed, + }, +} +``` + +### CEL Validation + +Both CRDs include this validation marker: +```go +// +kubebuilder:validation:XValidation:rule="self.metadata.name == 'cluster'",message="metadata.name must be 'cluster'" +``` + +Enforced at: +- Creation (cannot create with wrong name) +- Update (cannot rename to wrong name) +- Validation is server-side (API server enforces before controller sees it) + +### Testing + +API tests in `api/v1alpha1/tests//invalid-name.testsuite.yaml` verify: +```yaml +tests: + onCreate: + - name: "reject non-cluster name" + initial: | + apiVersion: operator.openshift.io/v1alpha1 + kind: ExternalSecretsConfig + metadata: + name: wrong-name + spec: + managementState: Managed + expectedError: "metadata.name must be 'cluster'" +``` + +## Alternatives Considered + +1. **Namespace-scoped CRs** - Rejected; operand is cluster-scoped +2. **Multiple instances with leader election** - Overengineered; no multi-tenancy requirement +3. **Name it "default"** - Less semantic; "cluster" is clearer about scope +4. **No name enforcement** - Rejected; allows user errors and conflicts +5. **Enforce in controller code** - Rejected; CEL enforcement is cleaner and earlier + +## Future Considerations + +If multi-tenancy is ever required (different configs per namespace), we could: +- Introduce namespace-scoped `ExternalSecretsNamespaceConfig` CRD +- Keep cluster-scoped `ExternalSecretsConfig` for global defaults +- Namespace configs override cluster config for their scope + +This would be a new CRD, not a change to the existing singleton pattern. + +## References + +- CEL validation: `api/v1alpha1/external_secrets_config_types.go` +- Constants: `pkg/controller/common/constants.go` +- Default creation: `pkg/operator/setup_manager.go:CreateDefaultESMResource` +- OpenShift config CRs: https://docs.openshift.com/container-platform/latest/rest_api/config_apis/ +- API tests: `api/v1alpha1/tests/externalsecretsconfig.operator.openshift.io/` diff --git a/docs/decisions/README.md b/docs/decisions/README.md new file mode 100644 index 00000000..0483f32a --- /dev/null +++ b/docs/decisions/README.md @@ -0,0 +1,67 @@ +# Architecture Decision Records (ADRs) + +This directory contains Architecture Decision Records (ADRs) documenting significant architectural and design decisions made for the External Secrets Operator. + +## What is an ADR? + +An Architecture Decision Record captures an important architectural decision along with its context and consequences. ADRs help: +- New contributors understand why things are the way they are +- Prevent re-litigating settled decisions +- Document alternatives that were considered +- Explain trade-offs and consequences + +## ADR Format + +Each ADR follows this structure: +- **Status**: Accepted, Proposed, Deprecated, Superseded +- **Context**: The problem or situation requiring a decision +- **Decision**: What we decided to do +- **Rationale**: Why this decision was made (the "why" behind the "what") +- **Consequences**: Positive and negative outcomes, mitigations +- **Alternatives Considered**: Other options and why they were rejected +- **References**: Links to relevant code, docs, or external resources + +## Index + +| ADR | Title | Status | +|-----|-------|--------| +| [001](001-bindata-over-helm.md) | Use Bindata Pattern Instead of Helm Charts | Accepted | +| [002](002-three-controller-design.md) | Three-Controller Architecture | Accepted | +| [003](003-cel-validation-only.md) | Use CEL Validation Instead of Admission Webhooks | Accepted | +| [004](004-singleton-cr-pattern.md) | Singleton Cluster-Scoped CRs Named "cluster" | Accepted | + +## When to Write an ADR + +Write an ADR when making decisions about: +- Architecture patterns (controller design, resource management) +- API design (CRD structure, validation approach) +- Technology choices (libraries, frameworks, tools) +- Security architecture (RBAC model, certificate management) +- Performance trade-offs (caching strategy, reconciliation patterns) +- Operational patterns (deployment model, upgrade strategy) + +**Do NOT write an ADR for:** +- Implementation details that don't affect architecture +- Temporary workarounds or tactical fixes +- Decisions that are easily reversible +- Standard practices with no alternatives considered + +## How to Propose a New ADR + +1. Copy the template: `cp 000-template.md XXX-your-decision.md` +2. Fill in the sections (focus on "why" not just "what") +3. Get feedback from team/reviewers +4. Update status to "Accepted" when decision is made +5. Update this README.md index + +## ADR Lifecycle + +- **Proposed**: Decision under discussion +- **Accepted**: Decision has been made and is active +- **Deprecated**: Decision is no longer recommended but still in use +- **Superseded**: Decision has been replaced (link to superseding ADR) + +## References + +- ADR concept: https://adr.github.io/ +- OpenShift operator patterns: https://docs.openshift.com/container-platform/latest/operators/ diff --git a/docs/error-handling-guidelines.md b/docs/error-handling-guidelines.md new file mode 100644 index 00000000..6d7b7ed7 --- /dev/null +++ b/docs/error-handling-guidelines.md @@ -0,0 +1,145 @@ +# Error Handling Guidelines + +## Custom Error Type: `ReconcileError` + +Resource reconciliation errors flow through the `ReconcileError` type defined in `pkg/controller/common/errors.go`. It carries a `Reason` (`ErrorReason` string) that determines requeue behavior, a human-readable `Message`, and the underlying `Err`. Setup errors (finalizers, client construction, CR fetching) use plain `fmt.Errorf` wrapping as described below. + +### Three Error Reasons + +| Reason | Constant | Requeue? | When to use | +|---|---|---|---| +| `IrrecoverableError` | `common.IrrecoverableError` | No | Invalid config, missing env vars, permission errors, bad requests | +| `RetryRequiredError` | `common.RetryRequiredError` | Yes (30s) | Transient network issues, resource conflicts, timeouts, not-found | +| `UserConfigurationError` | `common.UserConfigurationError` | No (unless NotFound) | Invalid or incomplete user-provided configuration (e.g., missing referenced Secret or Issuer). Sets Degraded. Recovery is driven by watches on the affected resource; NotFound errors still use periodic requeue (30s) until the referenced object exists. | + +### Constructor Functions + +- `common.NewIrrecoverableError(err, messageFmt, args...)` -- for errors that cannot be fixed by retrying. +- `common.NewRetryRequiredError(err, messageFmt, args...)` -- for transient errors worth retrying. +- `common.NewUserConfigurationError(err, messageFmt, args...)` -- for errors caused by invalid or incomplete user configuration. The operator sets Degraded but generally does not requeue (relying on watches instead), except when the underlying error is NotFound. +- `common.FromClientError(err, messageFmt, args...)` -- auto-classifies Kubernetes API errors: `Unauthorized`, `Forbidden`, `Invalid`, and `BadRequest` become irrecoverable; everything else becomes retry-required. +- All four return `nil` when the input `err` is `nil`. Always check this at call sites when the constructor is the last expression before return. + +### Checking Error Type + +- `common.IsIrrecoverableError(err)` -- checks whether an error is irrecoverable. Uses `errors.As` to traverse wrapped error chains. +- `common.IsRetryRequiredError(err)` -- checks whether an error is retry-required. +- `common.IsUserConfigurationError(err)` -- checks whether an error is a user configuration error. +- `common.IsUserConfigurationNotFound(err)` -- checks whether a user configuration error is caused by a missing object (NotFound). Used to decide whether to requeue. + +## Error-to-Reconcile-Result Mapping + +The main reconcile dispatcher in `processReconcileRequest` (in `pkg/controller/external_secrets/controller.go`) maps errors to `ctrl.Result` as follows: + +``` +err == nil -> Result{}, nil (success, no requeue) +IsIrrecoverableError -> Result{}, errUpdate (no requeue; only status update error propagates) +IsUserConfigurationError -> Result{}, errUpdate (no requeue; Degraded=True; watches drive recovery) + IsUserConfigurationNotFound -> Result{RequeueAfter: 30s}, nil (requeue because watches won't fire for unmanaged objects) +RetryRequiredError -> Result{RequeueAfter: 30s}, nil (manual requeue, no error to controller-runtime) +status update failure -> Result{}, errUpdate (let controller-runtime handle backoff) +NotFound on primary CR -> Result{}, nil (skip reconciliation silently) +``` + +Key rule: never return both `RequeueAfter` and a non-nil error. For recoverable errors, return `RequeueAfter` with `nil` error. For irrecoverable and user-configuration errors, return empty `Result` so no requeue happens (except user-configuration NotFound, which requeues). + +The default requeue interval is `common.DefaultRequeueTime` (30 seconds), defined in `pkg/controller/common/constants.go`. + +## Wrapping Errors from Kubernetes API Calls + +For any Kubernetes client operation (Get, Create, Update, Delete, Patch, Exists), wrap the error with `common.FromClientError`: + +```go +if err := r.Create(r.ctx, obj); err != nil { + return common.FromClientError(err, "failed to create %s deployment resource", name) +} +``` + +This is the standard pattern across all resource reconciliation files (rbacs.go, deployments.go, services.go, secret.go, certificate.go, configmap.go, networkpolicy.go, validatingwebhook.go, serviceaccounts.go). `FromClientError` auto-classifies the API error so callers do not manually pick irrecoverable vs retryable. + +For non-API errors that are definitively unrecoverable (e.g., missing environment variable, invalid configuration, failed validation), use `common.NewIrrecoverableError` directly: + +```go +return common.NewIrrecoverableError( + fmt.Errorf("ENV_VAR not set"), + "failed to update image in %s deployment object", name, +) +``` + +## `fmt.Errorf` with `%w` for Non-Reconcile Errors + +For errors outside the reconcile-result classification path (setup, client construction, utility functions), use standard `fmt.Errorf` with `%w` wrapping: + +```go +return fmt.Errorf("failed to create uncached client: %w", err) +``` + +Do not wrap these in `ReconcileError` -- they propagate as plain errors up through controller setup or `Reconcile` return. + +## Retry Logic + +### `UpdateWithRetry` (Client-Level Retry) + +All resource updates that may hit conflicts use `r.UpdateWithRetry(ctx, obj)` instead of `r.Update(ctx, obj)`. This method (in `pkg/controller/client/client.go`) wraps `retry.RetryOnConflict(retry.DefaultRetry, ...)`, re-fetching the latest resource version before each attempt. + +Use `UpdateWithRetry` for updating managed resources (Deployments, RBAC, Services, etc.) and for finalizer updates. + +### `retry.RetryOnConflict` (Status Updates) + +Status subresource updates use `retry.RetryOnConflict(retry.DefaultRetry, ...)` directly. The pattern is: fetch latest, deep-copy desired status into it, then call `StatusUpdate`. This appears identically in all three controllers. + +### `retry.OnError` with Custom Predicate + +The ESM default resource creation (`pkg/controller/external_secrets_manager/externalsecretsmanager.go`) uses `retry.OnError` with a custom `shouldRetryOnError` predicate that stops retrying on `AlreadyExists`, `Conflict`, `Invalid`, `BadRequest`, `Unauthorized`, `Forbidden`, and `TooManyRequests`. + +## Status Condition Updates + +### Condition Types + +| Type | Defined In | Used By | +|---|---|---| +| `Degraded` | `api/v1alpha1/conditions.go` | external_secrets controller | +| `Ready` | `api/v1alpha1/conditions.go` | external_secrets controller | +| `UpdateAnnotation` | `api/v1alpha1/conditions.go` | crd_annotator controller | + +### Condition Update Rules + +1. On irrecoverable error: set `Degraded=True/Failed` and `Ready=False/Failed`. +2. On user configuration error: set `Degraded=True/Failed` and `Ready=False/Failed` with message `"user configuration is invalid: ..."`. +3. On retryable error: set `Degraded=False/Ready` and `Ready=False/Progressing` with the error message. +4. On success: set `Degraded=False/Ready` and `Ready=True/Ready`. +5. Set both conditions atomically via `apimeta.SetStatusCondition` before calling `updateStatus`. +6. Only call `updateStatus` when `SetStatusCondition` returns `true` (condition actually changed). +7. Always include `ObservedGeneration` from the CR's current `.metadata.generation`. + +### Error Aggregation on Status Update Failure + +When the status update itself fails alongside a reconciliation error, aggregate both using `utilerrors.NewAggregate([]error{err, errUpdate})`. This pattern exists in both the external_secrets and crd_annotator controllers. + +## Kubernetes Event Recording + +- Use `corev1.EventTypeNormal` + reason `"Reconciled"` for successful create/update of resources. +- Use `corev1.EventTypeWarning` + reason `"ResourceAlreadyExists"` when a resource exists during initial creation reconciliation. +- Use `corev1.EventTypeWarning` + reason `"RemoveDeployment"` on CR deletion. +- Use `corev1.EventTypeWarning` + reason `"Read"` on fetch failures (ESM controller). +- Event messages must include the resource name (namespace/name format). + +## Logging Conventions + +- `r.log.V(1).Info(...)` -- operational events (reconcile start, resource modifications, label/annotation changes). +- `r.log.V(4).Info(...)` -- detailed debug (resource already in expected state, status update attempts, metadata builds). +- `r.log.Error(err, "message")` -- always log the error at the point it occurs in `reconcileExternalSecretsDeployment`, before returning it upward. The top-level `processReconcileRequest` also logs the error again with `r.log.Error(err, "failed to reconcile external-secrets deployment")`. +- `ctrl.Log.V(1).WithName("subsystem")` -- for setup-time logging outside reconciler instances (cache setup, CRD discovery). +- Use structured key-value pairs (`"name"`, `"request"`, `"key"`, `"namespace"`, `"error"`, `"installed"`), never string interpolation in log messages. + +## `Exists` Helper + +The `CtrlClient.Exists(ctx, key, obj)` method converts `NotFound` errors to `(false, nil)`, passing all other errors through. Use this instead of raw `Get` + manual `IsNotFound` checks when you only need existence. + +## Panics (Decode Helpers Only) + +The `Decode*ObjBytes` functions in `pkg/controller/common/utils.go` panic on decode failure or type assertion failure. These are called only with compile-time-known static assets (`assets.MustAsset`), so panics indicate a build-time bug, not a runtime error. Never use panic for runtime error handling elsewhere. + +## NotFound Handling in Reconcile + +When the primary CR (`ExternalSecretsConfig` or `ExternalSecretsManager`) is not found, return `ctrl.Result{}, nil` -- do not requeue. Log at V(1) and skip reconciliation. When a secondary/dependent CR is not found, either skip gracefully (ESM looking for ESC) or wrap in `fmt.Errorf` and return for requeue. diff --git a/docs/integration-guidelines.md b/docs/integration-guidelines.md new file mode 100644 index 00000000..32ab0768 --- /dev/null +++ b/docs/integration-guidelines.md @@ -0,0 +1,96 @@ +# Integration Guidelines + +## Architecture Overview + +This operator does NOT embed or fork the upstream external-secrets project. It manages the upstream operand as a set of static YAML manifests (bindata) that are decoded at runtime, mutated with operator-controlled labels/annotations/env/images, and applied to the cluster. There is no Helm chart integration; resource creation is fully imperative via controller-runtime `Create`/`Update`. + +## CRD Hierarchy and Singleton Pattern + +- **ExternalSecretsConfig** (`externalsecretsconfigs.operator.openshift.io`): primary CR, singleton named `cluster`. Controls operand installation, cert-manager toggle, Bitwarden plugin, proxy, network policies, and per-component overrides. +- **ExternalSecretsManager** (`externalsecretsmanagers.operator.openshift.io`): global config CR, also singleton named `cluster`. Provides lower-priority defaults (labels, resources, proxy, tolerations, affinity, nodeSelector, logLevel) that ExternalSecretsConfig overrides. Also provides optional `features` (e.g., `UnsafeAllowGenericTargets`) that apply across managed deployments. +- Precedence chain for shared fields: `ExternalSecretsConfig > ExternalSecretsManager > OLM environment variables` (proxy only). + +## Static Manifest (Bindata) Pattern + +All operand Kubernetes resources live as YAML in `bindata/external-secrets/` and `bindata/external-secrets/resources/`. They are compiled into Go via `pkg/operator/assets/bindata.go` (generated with `go-bindata`). At reconcile time: +1. Decode bytes with typed decoders in `pkg/controller/common/utils.go` (e.g., `DecodeDeploymentObjBytes`, `DecodeServiceObjBytes`). +2. Mutate the decoded object: set namespace, apply labels/annotations via `common.ApplyResourceMetadata`, update container images/args, inject proxy env vars, mount CA bundles. +3. Check existence with `r.Exists()`, compare with `common.HasObjectChanged()`, then `Create` or `UpdateWithRetry`. + +When adding a new resource: add the YAML to `bindata/`, add a constant in `constants.go` mapping the asset path, regenerate bindata, and follow the existing `createOrApply*` pattern. + +## Reconciliation Order + +Resources are created in strict dependency order within `reconcileExternalSecretsDeployment`: +Namespace -> NetworkPolicies -> ServiceAccounts -> Certificates -> Secrets -> TrustedCA ConfigMap -> RBAC -> Services -> Deployments -> ValidatingWebhooks -> CR annotation tracking. + +Never reorder. CR annotation tracking (managed-annotations) is patched last to ensure obsolete annotations are removed from resources before tracking is updated. + +## Conditional Resource Creation + +Many resources are conditionally created based on CR spec. The pattern uses a slice of `{assetName, condition}` structs: +- **cert-controller deployment/service**: created when cert-manager is NOT enabled (`!isCertManagerConfigEnabled(esc)`) +- **bitwarden deployment/service/service-account/network-policy**: created when Bitwarden IS enabled (`isBitwardenConfigEnabled(esc)`) +- **bitwarden certificate**: created only when Bitwarden IS enabled AND no `secretRef` is provided AND cert-manager is enabled +- **webhook TLS secret**: created only when cert-manager is NOT enabled (cert-manager manages the secret otherwise, named `external-secrets-webhook-cm` to avoid collision) + +## cert-manager Integration + +- cert-manager is an optional dependency detected at startup via CRD discovery (`isCRDInstalled` checks `cert-manager.io/v1` `certificates` resource). +- When detected, the Certificate CRD informer is registered with the manager cache. When not detected, the cache omits it entirely to prevent startup failures. +- The `CertManagerConfig` in ExternalSecretsConfig spec requires an `issuerRef` (Issuer or ClusterIssuer). The operator validates the issuer exists via `assertIssuerRefExists` using the uncached client before creating Certificate resources. +- When cert-manager is enabled with `injectAnnotations: "true"`, the `crd-annotator` controller patches all ESO CRDs with `cert-manager.io/inject-ca-from: external-secrets/external-secrets-webhook`. +- Webhook ValidatingWebhookConfigurations get the `cert-manager.io/inject-ca-from` annotation via `withCertManagerAnnotation()`, tracked in managed-annotations for proper lifecycle. + +## OpenShift Platform Integrations + +- **Trusted CA bundle**: when proxy is configured, a ConfigMap `external-secrets-trusted-ca-bundle` is created with label `config.openshift.io/inject-trusted-cabundle: "true"`. OpenShift's Cluster Network Operator (CNO) injects cluster CA certs into it. The ConfigMap data is owned by CNO; the operator only manages labels/annotations. Mounted at `/etc/pki/tls/certs` in all containers. +- **Proxy**: both uppercase and lowercase variants (`HTTP_PROXY`/`http_proxy`, etc.) are set on all containers and init containers. Proxy is removed cleanly when config is cleared. +- **Network policies**: default deny-all is always applied. Static allow policies (prefixed `eso-sys-`) for API server egress, webhook, DNS, and optionally cert-controller/bitwarden are applied from bindata. An automatic proxy egress NetworkPolicy (`eso-sys-allow-proxy-egress`) is created when proxy URLs are set and `networkPolicyProvisioning` is `Managed`. Users add custom egress-only policies via `controllerConfig.networkPolicies` (prefixed `eso-user-`) with component targeting. +- **Security context**: all containers get hardened `SecurityContext` (non-root, read-only root FS, drop ALL capabilities, seccomp RuntimeDefault). +- **Console capability annotation**: ConsoleYAMLSample resources require `capability.openshift.io/name: Console` annotation. + +## Bitwarden Plugin Integration + +- Bitwarden is the only currently supported plugin. Enabled via `spec.plugins.bitwardenSecretManagerProvider.mode: Enabled`. +- Requires either a `secretRef` (pre-existing TLS secret with keys `tls.crt`, `tls.key`, `ca.crt`) or cert-manager configuration for automated TLS. +- The bitwarden-sdk-server runs as a separate deployment in the operand namespace. Its image comes from env var `RELATED_IMAGE_BITWARDEN_SDK_SERVER`. +- Network policies must explicitly allow egress from the core controller to bitwarden-sdk-server on port 9998. +- ClusterSecretStore for Bitwarden requires: `bitwardenServerSDKURL`, `caBundle` (base64 CA cert), `organizationID`, `projectID`, and auth `secretRef.credentials` pointing to an access token secret. + +## Image Management + +Container images in static YAML manifests (`bindata/`) have placeholder values that are replaced at runtime. The reconciler reads images from environment variables set on the operator pod (typically by OLM/CSV): +- `RELATED_IMAGE_EXTERNAL_SECRETS`: external-secrets operand image +- `RELATED_IMAGE_BITWARDEN_SDK_SERVER`: bitwarden-sdk-server image +- `OPERAND_EXTERNAL_SECRETS_IMAGE_VERSION`: version label for operand resources +- `BITWARDEN_SDK_SERVER_IMAGE_VERSION`: version label for bitwarden resources +- `OPERATOR_IMAGE_VERSION`: operator version + +The reconciler returns an irrecoverable error if `RELATED_IMAGE_EXTERNAL_SECRETS` or `RELATED_IMAGE_BITWARDEN_SDK_SERVER` is empty. The version env vars (`OPERAND_EXTERNAL_SECRETS_IMAGE_VERSION`, `BITWARDEN_SDK_SERVER_IMAGE_VERSION`, `OPERATOR_IMAGE_VERSION`) may be empty without causing errors -- version labels will simply be blank. + +## Client Architecture + +- **Cached client** (`r.CtrlClient`): reads from manager cache filtered by `app=external-secrets` label selector. Used for all managed resources. +- **Uncached client** (`r.UncachedClient`): reads directly from API server. Used for objects NOT managed by the controller (issuer validation, secret ref existence checks). +- **UpdateWithRetry**: all mutations use `UpdateWithRetry` which fetches latest resourceVersion before update, wrapped in `retry.RetryOnConflict`. + +## Label and Annotation Conventions + +- All managed resources carry `app: external-secrets` label (cache filter key). +- Default labels: `app`, `app.kubernetes.io/version`, `app.kubernetes.io/managed-by: external-secrets-operator`, `app.kubernetes.io/part-of: external-secrets-operator`. +- Disallowed label prefixes (rejected silently): `app.kubernetes.io/`, `external-secrets.io/`, `rbac.authorization.k8s.io/`, `servicebinding.io/controller`, `app`. +- Disallowed annotation domain prefixes (rejected by CRD CEL validation): `kubernetes.io/`, `k8s.io/`, `openshift.io/`, `cert-manager.io/`. +- Managed annotation keys are tracked in a base64-encoded JSON array on the CR annotation `externalsecretsconfig.operator.openshift.io/managed-annotations`. + +## Adding a New Provider Plugin + +Follow the Bitwarden pattern: +1. Add new `Mode`-gated fields to `PluginsConfig` in `api/v1alpha1/external_secrets_config_types.go`. +2. Add static YAML manifests (deployment, service, service account, network policy) to `bindata/external-secrets/resources/`. +3. Add asset name constants to `constants.go`. +4. Add conditional entries to `createOrApplyDeployments`, `createOrApplyServices`, `createOrApplyServiceAccounts`, `createOrApplyNetworkPolicies` using the `{assetName, condition}` struct pattern. +5. Add image env var constants and wire them into `getDeploymentObject`. +6. Register the new component name in the `ComponentName` enum for network policy targeting. +7. Add the container name mapping in `getComponentNameFromAsset`. +8. Add e2e tests with appropriate Ginkgo labels and credential secret conventions. diff --git a/docs/performance-guidelines.md b/docs/performance-guidelines.md new file mode 100644 index 00000000..2dc9bca6 --- /dev/null +++ b/docs/performance-guidelines.md @@ -0,0 +1,100 @@ +# Performance Guidelines + +## Cache Architecture + +### Manager-Level Label-Filtered Cache +The operator uses a single manager-level cache with per-object label selectors configured via `NewCacheBuilder()` in `pkg/controller/external_secrets/controller.go`. All controller-managed resources (Deployments, Services, RBAC, etc.) are filtered by `app=external-secrets` label. Own CRs (`ExternalSecretsConfig`, `ExternalSecretsManager`) are cached without label filters. + +- When adding a new watched resource type, register it in both `controllerManagedResources` (line ~68) and `buildCacheObjectList()` with the appropriate label selector. +- Never create a second manager-level cache for the main controller. The `crd_annotator` controller is the only one that builds its own custom cache (`BuildCustomClient`), because it watches a disjoint resource set (`CustomResourceDefinition` with different label selectors). +- Set `ReaderFailOnMissingInformer: true` on custom caches (as `crd_annotator` does) to fail fast on missing informers rather than silently returning empty results. + +### Uncached Client Usage +The operator maintains a separate uncached `CtrlClient` (`r.UncachedClient`) for objects not tracked by the cache, specifically cert-manager Issuers/ClusterIssuers and user-provided Secrets (Bitwarden secretRef). Use uncached reads only for: +- Validating existence of external resources (e.g., `assertIssuerRefExists`, `assertSecretRefExists`) +- One-time bootstrap operations (e.g., `CreateDefaultESMResource` in `setup_manager.go`) + +Never use the uncached client for objects in `controllerManagedResources` -- the cache is designed for those. + +### Conditional CRD Cache Registration +Before creating the cache, `NewCacheBuilder` checks whether the cert-manager CRD exists via `isCRDInstalled`. If absent, `certmanagerv1.Certificate` is excluded from the cache object list to prevent startup failure. When the CRD exists, an informer is explicitly registered via `mgr.GetCache().GetInformer()`. Follow this pattern for any future optional CRD dependencies. + +## Watch and Predicate Patterns + +### Predicate Strategy by Resource Type +The controller uses different predicate combinations per resource type to minimize reconciliation: + +| Resource | Watch Method | Predicates | Rationale | +|---|---|---|---| +| ExternalSecretsConfig (primary) | `For()` | `GenerationChangedPredicate` | Ignore status-only updates | +| Deployments | `Watches()` | `Or(GenerationChangedPredicate, LabelChangedPredicate)` + `managedResources` | Skip status/replica updates; catch label changes | +| Secrets | `WatchesMetadata()` | `LabelChangedPredicate` | Avoid caching Secret data in memory | +| ConfigMaps | `Watches()` | `ResourceVersionChangedPredicate` + `managedOrWatchedResources` | Watch both operator-managed and user-referenced (trustedCABundle) ConfigMaps | +| Other managed resources | `Watches()` | `managedResources` (label filter) | Only watch operator-owned objects | +| ExternalSecretsManager | `Watches()` | `GenerationChangedPredicate` (no `managedResources`) | Skip status-only changes; ESM is not labeled `app=external-secrets` | +| Certificates (conditional) | `Watches()` | `managedResources` | Only when cert-manager CRD is installed | +| CRDs (crd_annotator) | `WatchesMetadata()` | `AnnotationChangedPredicate` + label filter | Only metadata matters | + +When adding new watched resources, always apply at minimum the `managedResources` predicate to avoid reconciling unrelated objects. Use `WatchesMetadata()` when only labels/annotations matter -- it avoids fetching full object bodies. + +### Map Function Convention +The `external-secrets-controller` maps events to a single reconcile key: `common.ExternalSecretsConfigObjectName` ("cluster"). The map function checks `hasManagedOrWatchLabel(objLabels)` which matches either the managed resource label (`app=external-secrets` via `ManagedResourceLabelKey`/`ManagedResourceLabelValue`) or the watch label (`externalsecretsconfig.operator.openshift.io/watching`), and returns an empty slice for non-matching objects. The `crd-annotator` uses a separate map function with its own label filter (`external-secrets.io/component=controller`). The `external-secrets-manager` controller uses `EnqueueRequestForObject` directly for its primary CR. Never enqueue multiple requests from a single event. + +## Reconciliation Patterns + +### Error Classification and Requeue Strategy +The operator classifies errors into three categories via `common.ReconcileError`: + +- **IrrecoverableError**: Config validation failures, permission errors, bad requests. Returns `ctrl.Result{}` with no requeue. Examples: missing CRD, invalid cert-manager config, missing env var. +- **UserConfigurationError**: Invalid or incomplete user-provided configuration. Returns `ctrl.Result{}` with no requeue (Degraded=True); recovery is driven by watches. Exception: `IsUserConfigurationNotFound` requeues with 30s since watches won't fire for unmanaged objects. +- **RetryRequiredError**: Transient API failures, conflicts. Requeues with `DefaultRequeueTime` (30s). The helper `common.FromClientError()` auto-classifies based on API error type. + +Convention: never return both a `RequeueAfter` result and an error simultaneously. Return the error alone and let controller-runtime handle backoff, or return `RequeueAfter` with nil error. + +### Condition Update Optimization +Status conditions are updated only when they actually change. The pattern uses `apimeta.SetStatusCondition()` which returns a boolean, and the controller checks both `degradedChanged` and `readyChanged` before issuing a status update call. Follow this pattern to avoid unnecessary API writes. + +### CR Annotation Patching +After all resources are reconciled, CR annotations (managed-annotations tracking + processed annotation) are updated via a single `MergePatch` rather than a full object update. This reduces conflict risk and avoids overwriting spec changes made concurrently. + +## Retry and Conflict Handling + +### UpdateWithRetry Pattern +Regular object updates (non-status) use `retry.RetryOnConflict(retry.DefaultRetry, ...)`. The `UpdateWithRetry` method on `CtrlClient` follows the read-modify-write pattern: it fetches the latest `ResourceVersion`, sets it on the object, then updates. Use `UpdateWithRetry` instead of bare `Update` for any object that might be modified concurrently. + +### Status Update Pattern +Status subresource updates use `retry.RetryOnConflict(retry.DefaultRetry, ...)` directly. The pattern is: fetch latest, deep-copy desired status into it, then call `StatusUpdate`. This appears identically in all three controllers. + +### Bootstrap Resource Creation +`CreateDefaultESMResource` uses `retry.OnError` with a custom `shouldRetryOnError` function that stops on permanent errors (AlreadyExists, Conflict, Invalid, BadRequest, Unauthorized, Forbidden, TooManyRequests) and retries on transient ones. Follow this pattern for one-time resource creation at startup. + +## Concurrency Primitives + +### Resettable sync.Once (`common.Now`) +The `Now` type in `pkg/controller/common/utils.go` extends `sync.Once` with a `Reset()` method using `sync.Mutex` + `atomic.Uint32` with double-checked locking. It is used in both the `external_secrets_manager` controller (to emit a warning event only once per error cycle) and the `external_secrets` controller (for one-shot validation warnings in trusted CA bundle handling, reset on successful reconciliation via `r.now.Reset()`). Use this type when you need one-shot behavior that can be re-armed. + +No goroutines are spawned directly by controller code -- all concurrency is handled by the controller-runtime framework. Do not introduce raw goroutines in reconcile loops. + +## Drift Detection + +### HasObjectChanged Pattern +The `common.HasObjectChanged()` function uses type-specific field comparison (not full `reflect.DeepEqual` on the entire object) to detect drift. Each resource type has a dedicated comparison: +- Deployments: compares replicas, containers (including init containers, security context, resources, ports, probes), volumes, affinity, tolerations, node selector, env vars, service account, DNS policy, template labels, revision history limit +- RBAC: compares only Rules (Roles) or RoleRef+Subjects (Bindings) +- Services: compares Type, Ports, Selector +- Webhooks: compares individual webhook entries by name +- Certificates: compares full `Spec` via `DeepEqual` +- NetworkPolicies: compares PodSelector, PolicyTypes, Ingress, Egress +- ServiceAccounts: metadata-only drift detection (spec comparison always returns `false`) +- Secrets: uses `ObjectMetadataModified()` only (data co-managed by cert-controller) +- ConfigMaps: metadata-only patching pattern (similar to Secrets) + +Annotations are compared using managed-key tracking (`annotationMapsModified`) which only checks keys the operator manages, ignoring annotations set by external controllers (e.g., `deployment.kubernetes.io/revision`). This prevents infinite reconcile loops. + +When adding a new managed resource type, add a case to the `HasObjectChanged` switch and implement field-level comparison rather than using full `DeepEqual` on the entire spec. + +## Asset Decoding +Static manifests are embedded via go-bindata (`pkg/operator/assets/bindata.go`) and decoded at reconcile time using typed `Decode*ObjBytes` helpers. Each decode call allocates a new object. These helpers `panic` on decode failure since the assets are build-time constants. Do not add error handling around these -- a panic here indicates a build or manifest corruption problem. + +## E2E Test Polling +E2E tests use `wait.PollUntilContextTimeout` with 5-second intervals and 2-minute default timeouts. Bitwarden-related tests use 4-minute timeouts due to SDK server startup latency. Follow these conventions for new e2e tests rather than using arbitrary sleep durations. diff --git a/docs/security-guidelines.md b/docs/security-guidelines.md new file mode 100644 index 00000000..a7863e67 --- /dev/null +++ b/docs/security-guidelines.md @@ -0,0 +1,127 @@ +# Security Guidelines + +## Container Security Context (Hardened Defaults) + +All operand containers are hardened programmatically via `updateContainerSecurityContext()` in `pkg/controller/external_secrets/deployments.go`. This function is called for every container (controller, webhook, cert-controller, bitwarden-sdk-server) during deployment reconciliation. The enforced settings are: + +- `AllowPrivilegeEscalation: false` +- `Capabilities.Drop: ["ALL"]` +- `ReadOnlyRootFilesystem: true` +- `RunAsNonRoot: true` +- `RunAsUser: nil` (defers to pod-level or image default; static manifests use `1000`) +- `SeccompProfile.Type: RuntimeDefault` + +When adding a new container or deployment, always call `updateContainerSecurityContext(&container)` on the container spec. Never set `RunAsUser` to `0` or add capabilities. The reconciler drift-detects security context changes and reverts them. + +Static deployment manifests in `bindata/external-secrets/resources/` also set `hostNetwork: false` explicitly. Never change this. + +## Container Image (Non-root) + +All Dockerfiles (`Dockerfile`, `images/ci/Dockerfile`, `images/ci/operand.Dockerfile`) run as `USER 65534:65534` (nobody). Never change the USER to root or remove this line. The base image is `ubi9-minimal`. + +## HTTP/2 Disabled by Default + +In `cmd/external-secrets-operator/main.go`, HTTP/2 is disabled for both metrics and webhook servers to mitigate known HTTP/2 vulnerabilities. The `--enable-http2` flag defaults to `false`. The implementation sets `c.NextProtos = []string{"http/1.1"}` on the TLS config. Do not change this default. + +## Metrics Endpoint Security + +The metrics server uses `filters.WithAuthenticationAndAuthorization` as `FilterProvider`, enforcing authn/authz on the metrics endpoint. Secure metrics serving (HTTPS on `:8443`) is enabled by default (`--metrics-secure=true`). The OpenShift service CA (`/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt`) is loaded for client verification. When modifying metrics configuration, preserve these protections. + +## RBAC Architecture (Three Tiers) + +The repo manages RBAC at three distinct levels. Changing one level does not affect the others. + +**1. Operator's own ClusterRole** (`config/rbac/role.yaml`): Grants permissions the operator process needs. Generated from `+kubebuilder:rbac` markers in `pkg/controller/external_secrets/controller.go` and `pkg/controller/external_secrets_manager/controller.go`. When adding new API interactions, add markers to the appropriate controller file and run `make manifests`. + +**2. Operand ClusterRoles** (static YAML in `bindata/external-secrets/resources/`): +- `clusterrole_external-secrets-controller.yml` -- main controller (secrets CRUD, ESO CRDs) +- `clusterrole_external-secrets-cert-controller.yml` -- cert-controller (webhook configs, secrets for TLS, CRDs) +- `clusterrole_external-secrets-view.yml` -- aggregated to `view`/`edit`/`admin` (read-only ESO resources) +- `clusterrole_external-secrets-edit.yml` -- aggregated to `edit`/`admin` (mutate ESO resources) +- `clusterrole_external-secrets-servicebindings.yml` -- service bindings integration (read-only ESO resources with servicebinding.io/controller label) + +The cert-controller role uses `resourceNames` restrictions on webhook configurations (`secretstore-validate`, `externalsecret-validate`) for its update permissions. Follow this pattern of scoping when adding new permissions. + +**3. Namespace-scoped roles**: `role_external-secrets-leaderelection.yml` restricts leader election to the operand namespace. + +Key convention: The controller ClusterRole grants `secrets` verbs `get/list/watch/create/update/delete/patch`. This is inherently powerful. Never broaden this (e.g., never add `*` verbs or `*` resources). + +## RBAC Label Protection + +The regex `disallowedLabelMatcher` in `install_external_secrets.go` prevents users from overriding internal labels via `ControllerConfig.Labels`: +``` +^app.kubernetes.io\/|^external-secrets.io\/|^rbac.authorization.k8s.io\/|^servicebinding.io\/controller$|^app$ +``` +When adding new internal labels, add them to this regex. + +## Annotation Domain Restrictions + +The CRD validation (CEL rules in `external_secrets_config_types.go`) blocks user annotations with reserved domain prefixes: `kubernetes.io/`, `openshift.io/`, `k8s.io/`, `cert-manager.io/`. These rules are enforced at admission time. Do not weaken them. When the operator adds its own annotations (e.g., `cert-manager.io/inject-ca-from`), it does so programmatically, bypassing user-facing validation. + +## Environment Variable Reservation + +The `ComponentConfig.OverrideEnv` field uses CEL validation to block reserved env var names/prefixes: `KUBERNETES_*`, `EXTERNAL_SECRETS_*`, `HOSTNAME`, `SSL_CERT_DIR`, `SSL_CERT_FILE`. When adding new operator-managed env vars that users should not override, add them to the CEL rule in the CRD type definition. + +## TLS Certificate Management (Dual Mode) + +Two certificate provisioning paths exist, controlled by `spec.controllerConfig.certProvider.certManager.mode`: + +**Built-in cert-controller** (default): A separate deployment (`external-secrets-cert-controller`) manages TLS certificates for the webhook. The cert-controller has its own RBAC and creates/updates the `external-secrets-webhook` Secret. The webhook reads certs from `/tmp/certs` volume mount. + +**cert-manager integration**: When enabled, `Certificate` resources are created from templates in `bindata/`. The `issuerRef` is validated to exist before creating the Certificate (via `assertIssuerRefExists`). The webhook secret name changes to `external-secrets-webhook-cm` to avoid clashing with the cert-controller secret. The `cert-manager.io/inject-ca-from` annotation is conditionally added to webhook configurations and ESO CRDs (via the `crd-annotator` controller) only when `injectAnnotations` is set to `"true"` (not merely when cert-manager mode is Enabled). + +Key constraint: `certManager.mode`, `issuerRef`, and `injectAnnotations` are immutable once set (enforced via `XValidation:rule="self == oldSelf"`). + +## Bitwarden TLS Requirements + +When `BitwardenSecretManagerProvider` is enabled, TLS certificates are mandatory. Users must provide either: +- A `secretRef` pointing to a K8s Secret with keys `tls.crt`, `tls.key`, `ca.crt` (validated via `assertSecretRefExists` using an uncached client) +- cert-manager configuration to auto-generate certificates + +This is enforced at the CRD level via CEL and in controller code. The bitwarden deployment mounts certs at `/certs` and probes use `scheme: HTTPS`. + +## Network Policy Architecture (Default-Deny) + +The operator deploys a **deny-all** base NetworkPolicy (`networkpolicy_deny-all.yaml`) that blocks all ingress and egress for all pods in the operand namespace. Specific allow-policies are then layered: + +- `allow-api-server-egress-for-main-controller-traffic` -- egress to API server (port 6443) +- `allow-api-server-and-webhook-traffic` -- egress to API server + ingress on webhook port 10250 and metrics port 8080 (from monitoring namespace only) +- `allow-api-server-egress-for-cert-controller-traffic` -- only when cert-controller is active +- `allow-api-server-egress-for-bitwarden-sever` -- only when Bitwarden is enabled (note: file has typo "sever" instead of "server") +- `allow-dns` -- DNS egress + +When adding new components or network requirements, follow this pattern: add a conditional static policy in `bindata/` and register it in `createOrApplyStaticNetworkPolicies()` with the appropriate condition. + +User-defined network policies via `spec.controllerConfig.networkPolicies` are prefixed with `eso-user-` and restricted to `Egress` policy type only. The operator auto-selects pods via component label. + +## Webhook Security Configuration + +Validating webhooks use: +- `failurePolicy: Fail` (on the ExternalSecret webhook) -- rejecting requests if the webhook is unavailable +- `sideEffects: None` +- `timeoutSeconds: 5` +- Webhook listens on port `10250` (not the default 443) + +The SecretStore webhook also explicitly sets `failurePolicy: Fail` on both SecretStore and ClusterSecretStore webhooks. Maintain `Fail` for security-critical validations. + +## Singleton Pattern (Cluster-Scoped CRs) + +Both `ExternalSecretsConfig` and `ExternalSecretsManager` are singletons enforced via CEL: `self.metadata.name == 'cluster'`. This prevents privilege escalation through multiple conflicting configurations. Never remove this validation. + +## Trusted CA Bundle Handling + +When proxy configuration is present, a ConfigMap labeled `config.openshift.io/inject-trusted-cabundle: "true"` is created. The Cluster Network Operator (CNO) populates it with cluster-wide CA certificates. The operator mounts this at `/etc/pki/tls/certs` (Go's default cert path) as a read-only volume. The operator never writes CA data directly -- it only manages the label and lets CNO handle the content. + +## Sensitive Data in Tests + +E2E tests reference AWS credentials via a well-known Secret (`aws-creds` in `kube-system`). Credential keys are `aws_access_key_id` and `aws_secret_access_key`. These are read from the cluster, never hardcoded. When adding e2e tests for new providers, follow this pattern: reference credentials from pre-existing cluster Secrets, never embed them in test code or YAML fixtures. + +## Reconciler Drift Detection + +The operator reconciles all managed resources back to desired state. RBAC rules, deployments (including container security context), webhook configurations, network policies, and certificates are compared field-by-field (via `HasObjectChanged` in `pkg/controller/common/utils.go`). If any resource is externally modified (e.g., someone manually adds permissions to a ClusterRole), the operator detects the drift and reverts it. This is a critical security property -- do not disable drift detection for security-sensitive resources. + +Note: drift detection does not currently cover pod-level `SecurityContext`, `hostNetwork`, or webhook `failurePolicy`. Manual changes to those fields will not be automatically reverted. + +## Error Classification for Security + +`FromClientError` in `pkg/controller/common/errors.go` classifies `Unauthorized` and `Forbidden` API errors as `IrrecoverableError`, meaning the operator will not retry. This prevents retry storms against permission boundaries. Maintain this classification when adding new API interactions. diff --git a/docs/testing-guidelines.md b/docs/testing-guidelines.md new file mode 100644 index 00000000..87e5b9ba --- /dev/null +++ b/docs/testing-guidelines.md @@ -0,0 +1,131 @@ +# Testing Guidelines + +## Test Architecture + +This repo has three test tiers. The repo uses `go.work` with modules at `.`, `./test`, `./cmd/external-secrets-operator`, `./tools`. Unit tests live in the root `.` module; API and E2E tests share the `./test` module: + +| Tier | Location | Make Target | Framework | Cluster Required | +|------|----------|-------------|-----------|-----------------| +| Unit | `pkg/**/*_test.go` | `make test-unit` | stdlib `testing` + table-driven | No | +| API Integration | `test/apis/` | `make test-apis` | Ginkgo/Gomega + envtest | No (envtest) | +| E2E | `test/e2e/` | `make test-e2e` | Ginkgo/Gomega + live cluster | Yes | + +`make test` runs `manifests`, `generate`, `fmt`, `vet`, then both `test-apis` and `test-unit` (no cluster needed). It also builds envtest binaries as a prerequisite. + +## Unit Tests (`pkg/`) + +### Conventions +- Files live alongside source in the same package (e.g., `deployments.go` / `deployments_test.go`). +- Use stdlib `testing` package exclusively; no Ginkgo. +- Follow table-driven test pattern: define a `tests` slice of structs with `name`, `preReq`, `wantErr`, and optional validation funcs, then iterate with `t.Run`. +- No build tags; unit tests compile under normal `go test`. + +### Fake Client Pattern +- Use `pkg/controller/client/fakes/FakeCtrlClient` (generated by counterfeiter) to mock the controller-runtime client. +- Set up behavior per test via `mock.ExistsCalls(...)`, `mock.CreateCalls(...)`, `mock.UpdateWithRetryCalls(...)`, etc. +- Assign mock to `r.CtrlClient = mock` on the test reconciler. + +### Test Helpers +- `pkg/controller/external_secrets/test_utils.go` provides `testReconciler(t)`, `testDeployment(name)`, `testService(name)`, `testResourceMetadata(esc)`, and typed decode helpers (`testClusterRole`, `testSecret`, etc.). +- `pkg/controller/commontest/utils.go` provides `TestExternalSecretsConfig()`, `TestExternalSecretsManager()`, `ErrTestClient`, and standard test constants (`TestExternalSecretsImageName`, `TestBitwardenImageName`, `TestExternalSecretsNamespace`). + +### Error Assertion +- Compare exact error strings: `if err == nil || err.Error() != tt.wantErr`. +- Use `t.Setenv()` for environment variables needed by the code under test (e.g., `RELATED_IMAGE_EXTERNAL_SECRETS`). + +### Adding a New Unit Test +1. Create or edit `*_test.go` next to the source file, same package. +2. Define a table-driven test with `name`, mock setup (`preReq`), optional input mutation (`updateExternalSecretsConfig`), and `wantErr` string. +3. Use `testReconciler(t)` for the reconciler, `fakes.FakeCtrlClient{}` for the client. +4. Assert with `t.Errorf`; avoid external assertion libraries in unit tests. + +## API Integration Tests (`test/apis/`) + +### Data-Driven Test Suites +- Tests are defined in YAML files at `api/v1alpha1/tests//.testsuite.yaml`. +- Each file specifies `crdName`, `tests.onCreate` (create + validate or expect error), and `tests.onUpdate` (create, update, validate or expect error). +- The Go code in `test/apis/generator.go` auto-generates Ginkgo specs from these YAML files using `DescribeTable`. + +### envtest Setup +- Suite file: `test/apis/suite_test.go`. +- Bootstraps `envtest.Environment` with CRDs from `config/crd/bases/`. +- Uses Kubernetes API version >= 1.25 (required for CEL validation). +- Uses `komega` for object comparison with `IgnoreAutogeneratedMetadata`. + +### Running +- `make test-apis` invokes `hack/test-apis.sh`, which runs Ginkgo with `--randomize-all --randomize-suites --keep-going --timeout=30m`. +- In CI (`OPENSHIFT_CI=true`), JUnit and coverage artifacts are written to `ARTIFACT_DIR`. + +### Adding a New API Test +1. Create or edit a `.testsuite.yaml` file under `api/v1alpha1/tests//`. +2. Define `onCreate` entries with `initial` YAML (input) and either `expected` YAML (success) or `expectedError` (failure substring). +3. Define `onUpdate` entries with `initial`, `updated`, `expected`/`expectedError`/`expectedStatusError`. Optionally use `initialCRDPatches` for ratcheting tests. +4. The generator picks them up automatically; no Go code changes needed. + +## E2E Tests (`test/e2e/`) + +### Build Tag +- All files in `test/e2e/` and `test/utils/` require `//go:build e2e`. The Makefile passes `-tags e2e` via `go test -C test`. + +### Label-Based Platform Selection +Tests are filtered by Ginkgo labels. The default filter is `"Platform: isSubsetOf {AWS}"`. + +| Label | Scope | Required Secrets | +|-------|-------|-----------------| +| `Platform:AWS` | AWS SecretStore + ExternalSecret + PushSecret | `aws-creds` in `kube-system` | +| `Platform:Generic` | Platform-independent operand tests | None | +| `CrossPlatform:GCP-AWS` | GCP cluster using AWS Secrets Manager | `aws-creds` in `kube-system` | +| `Provider:Bitwarden` | Bitwarden ESO provider via ClusterSecretStore | `bitwarden-creds` in `external-secrets-operator` | +| `API:Bitwarden` | Direct HTTP tests to bitwarden-sdk-server | `bitwarden-creds` in `external-secrets-operator` | +| `Suite:Bitwarden` | Full Bitwarden suite | `bitwarden-creds` in `external-secrets-operator` | +| `Proxy:HTTP` | Proxy-related tests | Proxy configuration | +| `Migration` | Migration/upgrade tests | None | +| `PostUpgradeCheck` | Post-upgrade verification | None | + +Note: some E2E test contexts (e.g., Environment Variables, Deployment Revision History, Annotations, Network Policies, Trusted CA Bundle) have **no platform label** and only run when the label filter is empty (`E2E_GINKGO_LABEL_FILTER=""`). + +Run a specific suite: `make test-e2e E2E_GINKGO_LABEL_FILTER="Provider:Bitwarden"`. +Run all: `make test-e2e E2E_GINKGO_LABEL_FILTER=""`. + +### E2E Structure +- Suite setup (`e2e_suite_test.go`): `BeforeSuite` initializes K8s clients (`kubernetes.Clientset`, `dynamic.DynamicClient`, `controller-runtime client.Client`), sets timeout with 5-minute cleanup buffer, configures JSON/JUnit reports. `AfterSuite` runs global `CleanupESOOperandAndRelated`. +- `BeforeAll` (in `e2e_test.go`): creates test namespace (prefix `external-secrets-e2e-test-`), creates/verifies `ExternalSecretsConfig` CR, waits for operator/operand pods. +- `BeforeEach`: verifies operand pods are ready. +- `AfterEach`: on failure, dumps artifacts via `utils.DumpE2EArtifacts()` to `ARTIFACT_DIR` or `_output/`. +- `AfterAll` (in `e2e_test.go`): deletes test namespace and cluster CR. + +### Test Data +- YAML fixtures embedded via `//go:embed testdata/*` in `e2e_test.go`. +- Pattern substitution in YAML via `utils.ReplacePatternInAsset("${PATTERN}", "value")`. +- Fixtures live in `test/e2e/testdata/`. + +### E2E Test Utilities (`test/utils/`) +Key helpers (all require `e2e` build tag): +- `DynamicResourceLoader`: create/delete K8s resources from YAML files or `Unstructured` objects. +- `VerifyPodsReadyByPrefix(ctx, clientset, namespace, prefixes)`: poll until pods with given name prefixes are Running + Ready. +- `WaitForESOResourceReady(ctx, dynamicClient, gvr, namespace, name, timeout)`: poll until ESO resource has `Ready=True` condition. +- `WaitForExternalSecretsConfigReady(ctx, dynamicClient, name, timeout)`: wait for both `Ready=True` and `Degraded=False`. +- `CleanupESOOperandAndRelated(ctx, cfg)`: best-effort cleanup of operand CRs, webhooks, RBAC, namespace. +- `GetRandomString(n)`: random alphanumeric string for unique resource names. + +### Adding a New E2E Test +1. Add `//go:build e2e` at the top of any new file. +2. Place test in `test/e2e/` package; use existing `suiteClientset`, `suiteDynamicClient`, `suiteRuntimeClient`. +3. Apply a Ginkgo `Label(...)` to the `Context` or `Describe` for platform filtering. +4. Use `Ordered` for tests with shared state dependencies. +5. Create test namespace with `GenerateName: testNamespacePrefix`; clean up in `AfterAll`. +6. Use `Eventually(...).Should(Succeed())` for async assertions with polling. +7. Use `defer loader.DeleteFromFile(...)` for resource cleanup. +8. Add YAML fixtures to `test/e2e/testdata/`. + +## Coverage + +- `make test-unit` writes `cover.out` (Go coverage profile) to the repo root. +- E2E coverage uses a special instrumented image: `make docker-build-coverage` builds from `images/ci/Dockerfile.coverage`. `make e2e-coverage-collect` gathers profiles from the running operator and optionally uploads to Codecov. + +## CI Notes + +- `GOFLAGS` is cleared for test targets (line: `fmt vet test test-unit test-e2e run update-vendor update-dep: GOFLAGS=`) to avoid conflicts with `go.work`. +- envtest uses Kubernetes `1.32.0` binaries (`ENVTEST_K8S_VERSION`). +- E2E timeout defaults to `1h` (`E2E_TIMEOUT`); Ginkgo suite timeout is set to `E2E_TIMEOUT - 5min`. +- `ARTIFACT_DIR` controls where CI artifacts (JUnit XML, JSON reports, failure dumps) are written.