Skip to content

AnouarMohamed/StateSight

Repository files navigation

StateSight

StateSight is a GitOps forensic platform for Kubernetes.

Its purpose is to compare desired state from Git with live cluster state, explain drift, group it into incidents, and recommend actions (ignore, monitor, investigate, reconcile).

StateSight is not a deployment controller like Argo CD or Flux.

What This Baseline Includes

  • Go API service with versioned routes, request IDs, structured JSON responses, health/readiness, and basic metrics.
  • Go worker service that consumes Redis queue jobs and writes deterministic analysis outputs to Postgres.
  • React + TypeScript + Vite + Tailwind web app with routed pages and API-backed data loading.
  • PostgreSQL migrations for core domain entities, suppression audit records, scoped ignore rules, provisioned OIDC identities, and workspace-qualified relationships.
  • Seed workflow with realistic sample data.
  • Docker Compose local stack for Postgres, Redis, API, worker, and web.
  • Makefile commands for setup, migrate, seed, run, format, test, and docs checks.

Current Limitations (Intentional)

  • Semantic diffing currently covers resource presence, replica counts, first-container images, named pod-template container presence, named container environment entries and resource requests/limits, annotations, metadata labels, and Service selectors; it is not a complete Kubernetes diff engine.
  • Live-state collection uses kubectl for a limited resource set rather than a Kubernetes client integration.
  • Evidence records provide Git/live-state provenance and exact Kubernetes managedFields ownership where available; they do not yet correlate audit logs or prove which actor caused drift.
  • GitHub webhook endpoint is baseline-only (not full GitHub App install/auth flow).
  • Git desired-state ingestion reads plain YAML/JSON manifests; Helm, Kustomize, Argo CD, and Flux integrations are not implemented.
  • The web client does not yet implement an interactive OIDC sign-in flow; authenticated deployments currently expose the verified API boundary for an integrated client or gateway.
  • No auto-remediation.

Authentication and RBAC

Local demo mode defaults to AUTH_REQUIRED=false. When AUTH_REQUIRED=true, the API requires a verified OIDC JWT bearer token for operator API endpoints. The GitHub webhook keeps its independent HMAC signature validation path.

Required API configuration:

  • OIDC_ISSUER_URL: OIDC discovery issuer URL.
  • OIDC_AUDIENCE: expected token audience for the StateSight API.
  • OIDC_ALLOW_INSECURE_ISSUER: defaults to false; enable only for a local plain-HTTP identity provider.

At startup the API discovers the provider and rejects an auth-enabled configuration that cannot be initialized. For each protected request it validates the bearer token signature through the discovered JWKS, issuer, audience, and token lifetime. Production issuer and JWKS URLs must use HTTPS.

Verified iss and sub claims map to a local user through user_identities; unmapped identities are denied. Roles continue to come from workspace_memberships (viewer, editor, admin). X-User-ID and X-User-Email are not authentication inputs and are not sent by the web client.

GET /api/v1/overview and GET /api/v1/applications additionally require X-Workspace-ID to choose the workspace being viewed. That header selects scope only: the authenticated local user must hold a membership for the selected workspace. Resource-addressed endpoints derive the workspace from the stored resource before enforcing membership.

The database enforces that an application's cluster and source definition belong to its workspace, and that an application-scoped ignore rule belongs to the same workspace as its application. A migration will fail if legacy data violates those tenant relationships; repair that data before deploying the constraint.

Identity provisioning is intentionally administrative until a managed enrollment flow exists. After creating a local user and its workspace membership, bind its verified provider identity using an operator-controlled migration or SQL statement:

INSERT INTO user_identities (issuer, subject, user_id)
VALUES ('https://identity.example.com', '<provider-subject>', '<local-user-uuid>');

Analysis Safety and Configuration

The worker honors:

  • GIT_BIN and GIT_CACHE_DIR for desired-state checkouts.
  • KUBECTL_BIN for live-state collection.
  • ALLOW_SYNTHETIC_LIVE_STATE, which defaults to false.

When kubectl cannot collect live resources, analysis fails by default. Set ALLOW_SYNTHETIC_LIVE_STATE=true only for local pipeline demonstrations; resulting incidents do not represent observations from a cluster.

Evidence Provenance

For each unsuppressed drift incident, the worker persists:

  • desired-state provenance identifying the analyzed Git repository, path, and revision;
  • live-state provenance identifying whether the value was observed through kubectl or generated by explicit synthetic demo fallback;
  • Kubernetes managedFields evidence only when the live object reports ownership of the exact field path compared by the current diff engine.

Git and kubectl records describe where compared values came from and use not-attributed instead of inventing an actor. A managedFields manager is field-ownership evidence, not proof that the manager introduced the drift. Named resource request/limit findings can receive exact ownership evidence; aggregate named-container presence and environment-entry findings intentionally do not. Synthetic live state is recorded as untrusted and does not yield manager attribution.

Ignore Rules

An ignore rule's match_expression is one exact drift field path, such as:

  • spec.replicas
  • spec.template.spec.containers[0].image
  • metadata.annotations.example.com/managed-by
  • metadata.labels.app.kubernetes.io/name
  • spec.selector.app.kubernetes.io/name for a Service
  • spec.template.spec.containers[name=ledger-api].env[name=LOG_LEVEL]
  • spec.template.spec.containers[name=ledger-api].resources.requests.cpu

Matching is case-sensitive and trims surrounding whitespace. Wildcards and regular expressions are not supported. Rules created through the application API or UI are scoped to that application and can optionally specify an exact resource_ref. Active resource-specific application rules are evaluated before application-wide rules, which are evaluated before inherited workspace rules. Within the same scope, the oldest matching rule is used first.

Existing rows with no application_id remain inherited workspace rules for compatibility. They are displayed on application details as read-only because changing one affects every application in that workspace.

A suppressed candidate does not create a drift incident. The worker stores a suppressed_findings audit record linked to the analysis snapshots, including the matching rule name and reason captured at analysis time. Application details expose the audit history under Suppressed and application-owned rule management under Ignore Rules: operators can create, edit, enable, disable, and delete those rules. Editing or deleting a rule changes future evaluation only; existing suppression audit records retain the captured rule explanation. Inherited workspace-rule administration is not implemented.

Operator Interface

The web application is a dense, warm-black Git-oriented investigation console designed around scanning and evidence review. Its compact sidebar and table language are adapted from the local GitOps forensic UI reference while all data and actions remain backed by StateSight APIs. It surfaces compared field values, provenance trust state, absent attribution, and Kubernetes ownership caveats without implying unobserved causality.

The reference prototype contains separate Supabase-backed features, including AI insights, remediation actions, audit views, commit views, sync history, and cluster administration. Those screens are not exposed in StateSight until equivalent backend contracts exist. The shell uses the official orange Git logomark from git-scm.com/community/logos, credited there to Jason Long under CC BY 3.0. The visual and interaction contract is recorded in PRODUCT.md and DESIGN.md.

Architecture Overview

High-level structure:

  • apps/api: HTTP API service
  • apps/worker: async job processor
  • apps/web: frontend app
  • internal/*: service internals and pipeline boundaries
  • pkg/*: reusable domain/model utilities
  • migrations/: SQL schema migrations
  • scripts/migrate, scripts/seed: operational bootstrap commands

Detailed notes:

Local Setup

1) Prerequisites

  • Go 1.25.10+
  • Node 22.13+
  • Docker + Docker Compose

2) Environment

cp .env.example .env
cp apps/web/.env.example apps/web/.env

During npm run dev, Vite proxies /api to the local API by default; keep VITE_API_BASE_URL empty for that workflow. Docker Compose serves the built web application through Nginx on port 5173, with Nginx proxying API requests to the API container.

The checked-in web configuration is for unauthenticated local demonstration. Enabling AUTH_REQUIRED=true requires an OIDC-capable client or gateway that supplies Authorization: Bearer <token>; browser authorization-code/PKCE login remains follow-up work.

3) Start Infrastructure + Services

docker compose up --build -d

4) Run Migrations

make migrate-up

5) Seed Sample Data

make seed

The seed data provides a prebuilt incident for UI inspection. Its source repository is illustrative, not a runnable analysis input. Running a real analysis requires an accessible manifest repository and Kubernetes credentials available to the worker. A containerized worker also needs any kubeconfig path referenced by a cluster record mounted inside its container.

6) Run Services Locally (optional alternative to containerized app services)

make api
make worker
make web

Make Commands

  • make help
  • make setup
  • make up
  • make down
  • make migrate-up
  • make seed
  • make api
  • make worker
  • make web
  • make fmt
  • make test
  • make lint
  • make test-race
  • make security-go
  • make verify-web
  • make workflow-lint
  • make script-lint
  • make docs-check

Required API Endpoints in This Baseline

  • GET /healthz
  • GET /readyz
  • GET /api/v1/overview
  • GET /api/v1/applications
  • POST /api/v1/applications
  • GET /api/v1/applications/:id
  • POST /api/v1/applications/:id/analyze
  • POST /api/v1/applications/:id/ignore-rules
  • PUT /api/v1/applications/:id/ignore-rules/:ruleID
  • PATCH /api/v1/applications/:id/ignore-rules/:ruleID
  • DELETE /api/v1/applications/:id/ignore-rules/:ruleID
  • GET /api/v1/incidents/:id
  • GET /api/v1/incidents/:id/timeline
  • POST /api/v1/github/webhook

Application detail responses include incidents, suppressions, and applicable ignore_rules; suppressed findings include the matching rule name and reason captured at analysis time.

Next Suggested Implementation Steps

  1. Correlate persisted provenance with audit, deployment, or controller signals without treating field ownership as causality.
  2. Add deliberate workspace-wide rule management with an appropriate authorization and blast-radius review boundary.
  3. Expand normalization, diff coverage, and incident grouping with focused tests.
  4. Complete the protected operator access flow with browser OIDC login and managed identity provisioning on top of the verified API bearer boundary.
  5. Add GitOps rendering/integration support and hardened Kubernetes collection.

About

GitOps forensic platform for Kubernetes — detects and explains drift between desired Git state and live cluster state, groups findings into incidents, and recommends remediation actions.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors