Kafka system and components for Suncoast systems platform
Kafka client traffic on kafka.kafka.svc.internal.lan:9092 is configured for:
SASL_PLAINTEXT- mechanism
SCRAM-SHA-256
An internal broker listener (INTERNAL://:9094) remains plaintext for broker-internal/bootstrap operations.
ACL authorizer is enabled with allow.everyone.if.no.acl.found=true, and bootstrap ACLs are created for admin/connect/schema-registry/ui/graphql-async principals.
manifests/ includes:
- Security bootstrap job
kafka-security-bootstrapthat idempotently creates/rotates Kafka SCRAM users - Kubernetes secret
kafka-security-credentials(Vault placeholder backed) for workload env injection, using one-secret-per-key format.
Create these Vault KVv2 paths (each with field value):
secret/data/kafka-admin-usernamesecret/data/kafka-admin-passwordsecret/data/kafka-connect-usernamesecret/data/kafka-connect-passwordsecret/data/kafka-schema-registry-usernamesecret/data/kafka-schema-registry-passwordsecret/data/kafka-ui-usernamesecret/data/kafka-ui-passwordsecret/data/graphql-kafka-async-usernamesecret/data/graphql-kafka-async-passwordsecret/data/kafka-nifi-usernamesecret/data/kafka-nifi-passwordsecret/data/kafka-flink-usernamesecret/data/kafka-flink-passwordsecret/data/kafka-camel-usernamesecret/data/kafka-camel-passwordsecret/data/k8s-kafka-keycloak-internal-oidc-issuer-urlsecret/data/k8s-kafka-keycloak-internal-oidc-discovery-urlsecret/data/keycloak-client-id-kafka-gui-proxysecret/data/keycloak-client-secret-kafka-gui-proxysecret/data/k8s-kafka-kafka-gui-keycloak-cookiesecretsecret/data/keycloak-client-id-nifi-guisecret/data/keycloak-client-secret-nifi-guisecret/data/k8s-kafka-nifi-keystore-passwordsecret/data/keycloak-client-id-flink-gui-proxysecret/data/keycloak-client-secret-flink-gui-proxysecret/data/k8s-kafka-flink-gui-keycloak-cookiesecret
manifests/batch-processing-platform.yaml adds a baseline batch platform in the existing kafka namespace:
- Apache NiFi (
StatefulSet) for scheduled ingestion/orchestration - Apache Flink session cluster (
FlinkDeploymentmanaged by the Flink Kubernetes Operator) for distributed batch compute - Shared Kafka connection config (
batch-kafka-config) and SCRAM credentials for NiFi/Flink
manifests/nifi-kubernetes-operator.yaml installs NiFiKop (CRDs + operator) so NiFi flows can be managed declaratively via Kubernetes custom resources.
Current default watch scope is kafka namespace only (safe with existing Argo project boundaries).
Kafka bootstrap also provisions ACLs for:
kafka-nifi-*principal forbatch.*topicskafka-flink-*principal forbatch.*topicskafka-camel-*principal forbatch.*topics (for Camel routes/connectors you run in Kafka Connect or separate runtimes)
Account model:
kafka-*credentials are robot/service principals for workload-to-Kafka auth- human UI access uses Keycloak only:
- Kafka UI and Flink UI use Keycloak-gated
oauth2-proxy - NiFi UI uses native NiFi OIDC with Keycloak
- Kafka UI and Flink UI use Keycloak-gated
Keycloak authorization:
- Kafka UI, NiFi, and Flink are internal apps and should authenticate against the
internalrealm - create internal realm role
platform_batch_internal_ui_user - assign
platform_batch_internal_ui_userto users/groups that should access Kafka UI / NiFi / Flink - configure Keycloak mappers so users receive the
groupsclaim containingplatform_batch_internal_ui_userfor NiFi policy group matching - each UI has its own Keycloak OIDC client credentials secret paths listed above
NiFi 2.x is HTTPS-only and OIDC-enabled; UI runs on https://nifi-gui.teleport.app.suncoast.systems/nifi.
Allow these Keycloak redirect URIs for NiFi:
https://nifi-gui.teleport.app.suncoast.systems/nifi-api/access/oidc/callbackhttps://nifi-gui.teleport.app.suncoast.systems:443/nifi-api/access/oidc/callbackhttps://nifi-gui.teleport.app.suncoast.systems/nifi-api/access/oidc/callback/consumer(compatibility)
Reference dataflow workloads are managed outside this platform repo.
- app-of-apps repo:
https://github.com/dotcomrow/dataflow-apps - example workload repo:
https://github.com/dotcomrow/dataflow-example-app
k8s-kafka provides the runtime platform (Kafka, NiFi, Flink, security).
Pipeline/job manifests should be deployed via the dataflow app-of-apps layer.
NiFi flow lifecycle is handled with NiFiKop resources in the dataflow workload repo:
NifiCluster(external cluster reference to existing NiFi runtime)NifiRegistryClientNifiParameterContextNifiDataflow
The starter CR templates are stored in dataflow-example-app/templates/nifi-declarative-flow-crs.yaml.
Important:
- NiFiKop external-cluster mode requires non-interactive NiFi API auth (
basicortls) for automation. - if you use
basic, provideusername,password, andca.crtin the referenced Kubernetes secret. bucketIdandflowIdinNifiDataflowcome from the versioned flow metadata (bucket.yml/ flow definition metadata).- to manage CRs in
dataflownamespace, update operator watch namespaces and Argo project destination allow-lists accordingly.
Both NiFi and Flink use the same shared Kafka config and their own per-workload credentials:
- shared endpoint/mechanism:
BATCH_KAFKA_BOOTSTRAP_SERVERS/KAFKA_BOOTSTRAP_SERVERS=kafka.kafka.svc.internal.lan:9092BATCH_KAFKA_SECURITY_PROTOCOL/KAFKA_SECURITY_PROTOCOL=SASL_PLAINTEXTBATCH_KAFKA_SASL_MECHANISM/KAFKA_SASL_MECHANISM=SCRAM-SHA-256
- NiFi principal:
- username/password from
kafka-security-credentialskeysnifi_username/nifi_password
- username/password from
- Flink principal:
- username/password from
kafka-security-credentialskeysflink_username/flink_password
- username/password from
- ACL scope:
- bootstrap grants both principals access to
batch.*topics.
- bootstrap grants both principals access to
Dynamic Kafka accounts are reconciled continuously by CronJob kafka-security-reconciler (every minute, non-overlapping runs via concurrencyPolicy: Forbid).
Safety behavior for eventual Vault consistency:
-
Reconciler only applies a user when both username and password keys exist and both
valuefields are non-empty. -
VAULT_MIN_SECRET_AGE_SECONDS(default120) delays apply until secrets are stable for at least that age. -
Kafka admin CLI calls use
KAFKA_CLI_TIMEOUT=75swithKAFKA_AUTH_MAX_RETRIES=3to tolerate slower broker responses without indefinite retries. -
Critical Kafka config updates retry transient failures via
KAFKA_COMMAND_MAX_RETRIES=3(e.g., brief DNS resolution blips). -
Kafka bootstrap uses service IP env (
KAFKA_SERVICE_HOST:KAFKA_SERVICE_PORT) by default (KAFKA_USE_SERVICE_IP_BOOTSTRAP=true) to avoid DNS-only failure modes. -
Reconcile job hard timeout is
activeDeadlineSeconds=900to avoid false failures during slower broker/Vault periods. -
If
VAULT_SKIP_PLACEHOLDER_VALUES=true, placeholder values are ignored usingVAULT_PLACEHOLDER_VALUESCSV. -
While any user is in a transitional/invalid state, stale-user deletion is skipped for safety.
-
On each successful reconcile, SCRAM password is re-applied from Vault so password rotations converge automatically.
-
Create/update user with two Vault KVv2 secrets (both using field
value):secret/data/kafka-scram-user-<username>(Kafka principal username)secret/data/kafka-scram-user-<username>-password(SCRAM password)
-
Delete user by deleting both keys:
vault kv delete secret/kafka-scram-user-<username>vault kv delete secret/kafka-scram-user-<username>-password
This uses root-level keys under secret/data/ (no nested folder required).
Only keys with prefix kafka-scram-user- are managed by this reconciler.
Optional per-user override keys (all field value) are supported:
secret/data/kafka-scram-user-<username>-scram-iterations(default4096)secret/data/kafka-scram-user-<username>-resource-pattern-type(literalorprefixed)secret/data/kafka-scram-user-<username>-topic-all(CSV topics, grantsAll)secret/data/kafka-scram-user-<username>-topic-read(CSV topics, grantsRead)secret/data/kafka-scram-user-<username>-topic-write(CSV topics, grantsWrite)secret/data/kafka-scram-user-<username>-topic-describe(CSV topics, grantsDescribe)secret/data/kafka-scram-user-<username>-group-read(CSV groups, grantsRead)secret/data/kafka-scram-user-<username>-group-describe(CSV groups, grantsDescribe)secret/data/kafka-scram-user-<username>-cluster-describe(true/false)
If override keys are not set, default ACLs are applied for async GraphQL flow:
- topics
graphql.async.requests.v1,graphql.async.responses.v1,graphql.async.responses.dlq.v1(All) - group
graphql-async-workers(Read,Describe)
Example:
vault kv put secret/kafka-scram-user-ollama-async value='ollama-async'
vault kv put secret/kafka-scram-user-ollama-async-password value='replace-me'kubectl -n kafka logs job/kafka-security-bootstrap --tail=200
kubectl -n kafka get jobs -l cronjob-name=kafka-security-reconciler --sort-by=.metadata.creationTimestamp
kubectl -n kafka logs job/$(kubectl -n kafka get jobs -l cronjob-name=kafka-security-reconciler -o jsonpath='{.items[-1:].metadata.name}') --tail=200
kubectl -n kafka exec kafka-0 -- /opt/kafka/bin/kafka-configs.sh --bootstrap-server kafka-0.kafka-hs.kafka.svc.internal.lan:9094 --describe --entity-type users
kubectl -n kafka get pods -l app=nifi
kubectl -n kafka get pods -l app=flink
kubectl -n kafka get svc oauth2-proxy nifi flink-oauth2-proxy
kubectl -n kafka get jobs | grep batch-example
kubectl -n kafka logs job/batch-example-flink-submit --tail=200