Skip to content

feat(infra): TMI Component Platform — Kubernetes-native component architecture (roadmap) #414

Description

@ericfitz

Summary

Refactor TMI from a monolith-with-deployment-variability into a Kubernetes-native system of cooperating components. The server binary has accreted significant complexity (most recently content providers with their authorization, fetching, chunking, and embedding logic). This roadmap establishes a generic component model so functionality can be moved out of the monolith over time, and standardizes deployment on container + Kubernetes — removing today's deployment-shape variability (local Docker, Heroku, multiple cloud registries).

This is a tracking/roadmap issue. Individual components ship as their own issues. The first tenant is #347.

Thesis

  • The monolith becomes the stateful coordinator — sole DB writer, fetch/egress owner, job-state authority, result-consumer. It does not dissolve; it becomes a coordinator.
  • Components are stateless workers — declared as TMIComponent custom resources.

Architecture

  • NATS JetStream — the spine for asynchronous job work (durable subjects, fan-out, autoscaling signal).
  • TMIComponent CRD — the durable, runtime-editable contract for a component type. kubectl apply registers a new type; no monolith redeploy. CRD OpenAPI schema validates declarations before any pod runs. Key fields: jobSubjects, inputMode, egress, spec.config, secretRefs, resources, timeouts, scratchVolume, scaling.
  • Custom controller — reconciles each TMIComponent CR into a Deployment + KEDA ScaledObject + NetworkPolicy + NATS stream/consumer wiring.
  • KEDA — autoscales workers on JetStream queue depth, scale-to-zero capable. Neither the monolith nor the worker decides scaling.
  • Worker heartbeat — workers heartbeat on the bus; the monolith distinguishes "type declared, no healthy instance yet" from "instances present."

Cross-cutting contracts (derived while designing the first tenant, #347)

  • Egress is a first-class CRD field with three postures: none / fetch-controlled / allowlist. The controller renders the NetworkPolicy from it, and always renders a cluster-layer NetworkPolicy backstop regardless of in-code guarding. fetch-controlled is backed by one shared egress-guard library (owned by the T3 issue) — components never reimplement SSRF defense.
  • Filesystem: read-only root is a hard, universal invariant. Writable space is only ever a size-capped ephemeral emptyDir — never writable root, hostPath, or a PV.
  • Job input: two modes (content-ref / source-locator), one stable job-envelope schema with optional fields, mode declared per-CR. Large payloads travel by reference via a JetStream Object Store, never as large NATS messages.
  • Config has three categories: bootstrap (local, chicken-and-egg) / operational-monolith-local (DB settings service) / shared-cross-component (platform-owned, controller-projected). Components consume a minimal local bootstrap + projected shared config — never the monolith's config cascade. See the config-system issue.

Dev/test — three tiers

Tier Where Frequency
Unit/integration Process-mode; workers as processes against NATS as a CI services: container Every PR
E2E/cluster kind with Calico/Cilium (kindnet does not enforce NetworkPolicy); CRD + controller + KEDA Gated merge check
Local dev kind / k3d, prod-shaped manifests Developer-driven

make start-dev is reworked to bring up a local cluster; the standalone-Docker dev path is retired.

Roadmap (component extraction order)

  1. Content extractors (feat(infra): isolate document/content extractors in a sandboxed worker container (T12/T3) #347) — first tenant. Sandboxed tmi-extractor + tmi-chunk-embed workers. Proves the platform contracts.
  2. Code extractor — first source-locator / fetch-controlled tenant (depends on the T3 shared egress-guard library).
  3. OAuth provider — extraction of a synchronous request-path component (see open question below).
  4. Webhook dispatch / orchestration — synchronous request-path component.
  5. "Core" webhook extensions — first-party extension functionality as components.

Open questions / deferred decisions

  • Synchronous component interaction pattern — async bus jobs do not fit latency-sensitive request work (OAuth, webhook dispatch). Deferred until the first such tenant. Stated direction: reuse TMI's existing REST/HTTP API style rather than introduce a second protocol (gRPC, etc.). The TMIComponent CRD is designed to accept an interactionType field later without a breaking change.
  • Controller build vs. Helm templating — preference is a custom controller; a Helm-templated interim is the documented fallback if controller delivery is too large for a milestone.

Design reference

docs/superpowers/specs/2026-05-16-extractor-component-isolation-design.md — full design of the first tenant (#347), which derived these platform contracts.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions