diff --git a/CHANGELOG.md b/CHANGELOG.md index ae7094f1..3922fe4c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -84,6 +84,18 @@ for that specific tag for the per-commit details. malformed inputs, determinism, concurrency-safe construction, and singleton invariants. Detector migrations to consume `ctx.resolved()` and the resolver-bootstrap-into-Analyzer hook follow in sub-project 1 Phase 5. +- **AKS read-only deploy hardening** (sub-project 2): runbook at + [`shared/runbooks/aks-read-only-deploy.md`](shared/runbooks/aks-read-only-deploy.md), + JVM-flag-preset launcher at [`scripts/aks-launch.sh`](scripts/aks-launch.sh), + and a sentinel test asserting the script contains every required flag. + Enables `codeiq serve` inside an AKS pod with + `securityContext.readOnlyRootFilesystem=true` and a writable `/tmp` + emptyDir: an init-container copies the graph bundle from Nexus into + `/tmp/codeiq-data`; the main container runs `aks-launch.sh /tmp/codeiq-data`. + Zero source-code changes to the serve profile or Neo4j wiring — solved at + the deployment layer plus Spring-Boot-loader / `java.io.tmpdir` / + `-XX:ErrorFile` / `-XX:HeapDumpPath` overrides. Spec at + [`docs/specs/2026-04-28-aks-read-only-deploy-design.md`](docs/specs/2026-04-28-aks-read-only-deploy-design.md). ### Changed diff --git a/docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md b/docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md new file mode 100644 index 00000000..2034d303 --- /dev/null +++ b/docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md @@ -0,0 +1,156 @@ +# Sub-project 2 implementation plan — AKS read-only deploy hardening + +> **Spec:** [`docs/specs/2026-04-28-aks-read-only-deploy-design.md`](../specs/2026-04-28-aks-read-only-deploy-design.md) +> +> **Goal:** ship a runbook + JVM-flag-preset launch script + a sentinel test, so `codeiq serve` runs cleanly inside an AKS pod with read-only root filesystem and writable `/tmp`. No source-code changes to the serve profile or Neo4j wiring. +> +> **Scope:** small. Five files changed, single PR off `main`. Independent of sub-project 1. + +## File map + +| Action | Path | Purpose | +|---|---|---| +| **CREATE** | `docs/specs/2026-04-28-aks-read-only-deploy-design.md` | Architecture spec (✅ done with this plan). | +| **CREATE** | `docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md` | This file. | +| **CREATE** | `shared/runbooks/aks-read-only-deploy.md` | Canonical deploy runbook. | +| **CREATE** | `scripts/aks-launch.sh` | JVM-flag-preset launch wrapper. | +| **CREATE** | `src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java` | Asserts the launch script contains the required flags. Catches drift. | +| **MODIFY** | `CHANGELOG.md` | New `[Unreleased] / Added` bullet. | +| **MODIFY** | `shared/runbooks/engineering-standards.md` | §7.1 cross-link to the new runbook. | + +## Tasks + +### Task 1 — Runbook + +**File:** `shared/runbooks/aks-read-only-deploy.md`. + +**Sections:** Overview · Deploy shape · Init-container pattern (Kubernetes manifest snippet) · JVM flag preset · Local docker smoke · Rollback · Cross-references. + +**Hard requirement:** every command in the runbook must be runnable as-is. No placeholder URLs. Where a Nexus URL is needed, parameterize via `$NEXUS_URL` env, document it once. + +### Task 2 — Launch script + +**File:** `scripts/aks-launch.sh`. + +**Skeleton:** + +```bash +#!/usr/bin/env bash +# AKS read-only deploy launcher for codeiq serve. +# Usage: aks-launch.sh /tmp/codeiq-data +set -euo pipefail + +if [[ $# -ne 1 ]]; then + echo "usage: $(basename "$0") " >&2 + exit 64 +fi +DATA_DIR="$1" + +# Resolve the codeiq JAR location. Container image installs it at /app. +JAR="${CODEIQ_JAR:-/app/code-iq.jar}" + +# Pre-flight: ensure /tmp has enough headroom (1 GB minimum). +TMP_FREE_KB="$(df -Pk /tmp | awk 'NR==2 {print $4}')" +if [[ "$TMP_FREE_KB" -lt 1048576 ]]; then + echo "fatal: /tmp has < 1 GB free ($TMP_FREE_KB KB)" >&2 + exit 70 +fi + +# JVM flag preset: every entry has a non-default behavior that without it +# would write outside /tmp. Order is intentional — system properties first, +# then -XX flags, so any -XX value referencing a system property resolves. +JAVA_OPTS=( + -Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader + -Djava.io.tmpdir=/tmp + -XX:ErrorFile=/tmp/hs_err_pid%p.log + -XX:HeapDumpPath=/tmp + -XX:+HeapDumpOnOutOfMemoryError +) + +mkdir -p /tmp/spring-boot-loader + +exec java "${JAVA_OPTS[@]}" -jar "$JAR" serve "$DATA_DIR" +``` + +**Permissions:** `chmod +x scripts/aks-launch.sh` after create. Must be executable (the sentinel test asserts this). + +### Task 3 — Sentinel test + +**File:** `src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java`. + +**Assertions** (one per required flag, plus structural checks): + +```java +@Test void scriptIsExecutable() { ... } +@Test void scriptUsesStrictBashMode() { ... } // set -euo pipefail +@Test void scriptValidatesArgCount() { ... } +@Test void scriptSetsSpringBootLoaderTmpDir() { ... } +@Test void scriptSetsJavaIoTmpdir() { ... } +@Test void scriptSetsJvmErrorFile() { ... } +@Test void scriptSetsHeapDumpPath() { ... } +@Test void scriptEnablesHeapDumpOnOom() { ... } +@Test void scriptExecsJava() { ... } // exec java to PID 1 +``` + +The test reads the script as a `String` and grep-matches each required substring. Cheap, deterministic, drift-proof. + +### Task 4 — CHANGELOG entry + +**File:** `CHANGELOG.md`. + +**Add to `[Unreleased] / ### Added`:** + +```markdown +- AKS read-only deploy hardening (sub-project 2): runbook at + `shared/runbooks/aks-read-only-deploy.md`, JVM-flag-preset launcher at + `scripts/aks-launch.sh`, and a sentinel test asserting the script + contains every required flag. Enables `codeiq serve` inside an AKS pod + with read-only root filesystem + writable `/tmp` (init-container + copies bundle from Nexus → `/tmp/codeiq-data`; main container runs + `aks-launch.sh /tmp/codeiq-data`). Zero source-code changes to the + serve profile or Neo4j wiring — solved at the deployment layer plus + Spring-Boot-loader / JVM crash-file path overrides. Spec at + `docs/specs/2026-04-28-aks-read-only-deploy-design.md`. +``` + +### Task 5 — engineering-standards cross-link + +**File:** `shared/runbooks/engineering-standards.md` §7.1. + +Add a one-line bullet right under the existing "deploy surface" sentence: + +```markdown +- AKS read-only deploy is supported via `shared/runbooks/aks-read-only-deploy.md` + and `scripts/aks-launch.sh` (sub-project 2). The Maven Central artifact + the + launch script + an init-container that copies the graph bundle from Nexus + into `/tmp/codeiq-data` is the full surface — no separate hosted backend. +``` + +### Task 6 — Test loop + commit + +```bash +mvn test -Dtest=AksLaunchScriptSentinelTest +mvn test # full suite — confirm nothing else regressed +git add docs/specs/ docs/plans/ shared/runbooks/ scripts/aks-launch.sh \ + src/test/java/io/github/randomcodespace/iq/deploy/ CHANGELOG.md +git commit -m "feat(deploy): AKS read-only deploy hardening (sub-project 2)" +git push -u origin feat/sub-project-2-aks-read-only-deploy +gh pr create --base main \ + --title "feat: AKS read-only deploy hardening (sub-project 2)" \ + --body "..." +``` + +## Acceptance gates + +- [ ] All seven files in the file map exist and are non-empty. +- [ ] Sentinel test green. +- [ ] Full `mvn test` green. +- [ ] Runbook commands are copy-pasteable; no placeholder URLs that the operator can't substitute. +- [ ] PR open against `main`. + +## Out of scope (deliberate) + +- A heavyweight JVM-level filesystem-write detector (Java has no clean `chroot` / `unshare` API; environment-fragile in CI). The runbook docker smoke is the SSoT for "did this actually work in a RO root." +- A `/api/diagnostics` endpoint surfacing JVM flag preset values. Tracked separately if ops need it. +- Switching the storage layer to a static snapshot (Approach D in the spec). Reserved as the fallback if init-container copy proves operationally insufficient. +- Helm chart / OCI artifact packaging. The runbook ships a vanilla Kubernetes manifest snippet; productionizing into Helm is the deployer's call. diff --git a/docs/specs/2026-04-28-aks-read-only-deploy-design.md b/docs/specs/2026-04-28-aks-read-only-deploy-design.md new file mode 100644 index 00000000..3cc4c4a7 --- /dev/null +++ b/docs/specs/2026-04-28-aks-read-only-deploy-design.md @@ -0,0 +1,178 @@ +# Sub-project 2 — AKS read-only deploy hardening + +> **Status:** Design ready for implementation. Owner: AI agent + Amit Kumar. Created 2026-04-28. + +## 1. Problem + +`codeiq serve` is meant to run inside an AKS pod with a **read-only root filesystem** (a hardening default for production Kubernetes). Only `/tmp` is writable. The graph bundle is built in CI via `index → enrich → bundle`, uploaded to a private Nexus registry, then pulled and mounted into the pod read-only at deploy time. + +Today's serve mode opens an embedded Neo4j data directory and a fat JAR. Both want to write under the directory they were pointed at: + +- **Neo4j Embedded** acquires a `store_lock` file in the DB directory at open, and writes transaction logs / counts / schema cache files even in nominally read-only modes. +- **Spring Boot fat JAR loader** extracts nested JARs to `~/.m2/spring-boot-loader-tmp/` (or wherever `org.springframework.boot.loader.tmpDir` points) at startup. +- **JVM** writes `hs_err_pid*.log` and heap dumps to the working directory by default on crash. + +Without a deploy-shape change, `serve` fails to boot under `--read-only` because every one of the above tries to write outside `/tmp`. + +## 2. Goal + +`codeiq serve` runs cleanly inside an AKS pod that has: + +- root filesystem mounted **read-only**, +- `/tmp` mounted as a writable `emptyDir` or `tmpfs`, +- the graph bundle pulled from Nexus and made available at a known mount path, + +with **zero source-code changes to the serve profile or the Neo4j wiring**. Everything is solved at the deployment layer plus a JVM-flag preset. + +## 3. Non-goals + +- **Not** rewriting the storage layer to a static read-only snapshot (e.g. JSON / Parquet at serve time replacing Neo4j). That's a separate, much larger sub-project. We address it only if the init-container copy approach proves operationally insufficient. +- **Not** adding mutation endpoints or any write surface to serve mode. The serving layer remains strictly read-only per `CLAUDE.md` §"Read-Only Serving Layer". +- **Not** changing the build-CI side of the bundle pipeline (`index`, `enrich`, `bundle`) — that runs on a writable build agent. +- **Not** introducing a hosted backend or static-CDN frontend. The Maven Central + GitHub Releases distribution model from `engineering-standards.md` §7.1 is unchanged. AKS deploy is one of several runtime targets a downstream consumer might pick; the artifacts are the same JAR. + +## 4. Approach: init-container copy + JVM flag preset + +``` + ┌──────────────────────────────┐ + │ Build CI │ + │ index → enrich → bundle.zip │ + │ upload to Nexus │ + └───────────────┬──────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ AKS pod (root FS = read-only, /tmp writable) │ +│ │ +│ ┌─────────────── init-container ───────────────┐ │ +│ │ curl --fail "$NEXUS_URL/$BUNDLE" -o /tmp/bundle.zip │ │ +│ │ unzip /tmp/bundle.zip -d /tmp/codeiq-data/ │ │ +│ └────────────────────┬─────────────────────────┘ │ +│ │ (volume share: /tmp via emptyDir) │ +│ ▼ │ +│ ┌─────────────── main container ───────────────┐ │ +│ │ scripts/aks-launch.sh /tmp/codeiq-data │ │ +│ │ → java [JVM flag preset] -jar code-iq.jar │ │ +│ │ serve /tmp/codeiq-data │ │ +│ └──────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +The init-container is doing one thing: making the immutable bundle physically present under `/tmp/codeiq-data` so Neo4j can open it in normal (read+write-to-its-own-dir) mode. The main container then runs `serve` with the JVM flags below. + +### Why this over the alternatives + +| Approach | Verdict | Reasoning | +|---|---|---| +| **A. Init-container copy + JVM flags** *(chosen)* | ✅ | Minimal blast radius — zero source-code changes to serve / Neo4j wiring. Neo4j gets a writable directory under `/tmp`, the rest is JVM flags. Easy to test (`docker run --read-only --tmpfs /tmp ...`). | +| B. Neo4j RO mode + writable temp dir redirects | ❌ | Embedded Neo4j 2026.04.0 still acquires `store_lock` at open. `dbms.directories.transaction.logs.root` redirect needs careful per-version validation. Neo4j's RO mode is more brittle than copying the dir. | +| C. Bake bundle into container image | ❌ | Couples release cadence to image build; large image; container's writable upper layer is ALSO read-only when mounted `--read-only`, so Neo4j still fails. | +| D. Replace Neo4j with static snapshot | ❌ | Throws away the entire read API surface (Cypher, indexes, full-text search). Massive scope. Reserved as the "if A doesn't hold" fallback. | + +## 5. JVM flag preset + +These flags compose at launch via `scripts/aks-launch.sh`. Every entry has a non-default behavior that without it would write outside `/tmp`. + +```bash +JAVA_OPTS=( + # Spring Boot fat JAR extracts nested JARs at startup. Default is + # ~/.m2/spring-boot-loader-tmp/ which sits under HOME, outside /tmp. + "-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader" + + # Java standard temp dir. Spring Boot's multipart upload temp area, + # any Files.createTempFile call, JNA / Netty native lib extraction. + "-Djava.io.tmpdir=/tmp" + + # JVM crash dump file (default: cwd/hs_err_pid.log). + "-XX:ErrorFile=/tmp/hs_err_pid%p.log" + + # JVM heap dump on OOM (default: cwd). + "-XX:HeapDumpPath=/tmp" + "-XX:+HeapDumpOnOutOfMemoryError" + + # Diagnostic VM logs that some JDKs default into cwd. + "-XX:NativeMemoryTracking=summary" +) +``` + +The preset is **wrapper-script-encoded, not pom.xml**: pom.xml controls the build, not the runtime JVM. The script is the contract surface for AKS deploy. + +## 6. Audit findings + +| Surface | Default location | Conflict with RO root | Fix | +|---|---|---|---| +| Neo4j `store_lock` + tx logs + counts cache | `/.codeiq/graph/graph.db/` | 🚩 yes | Init-container copies bundle to `/tmp/codeiq-data`. No code change. | +| Spring Boot fat JAR extraction | `~/.m2/spring-boot-loader-tmp/` | 🚩 yes | `-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader` | +| Java standard temp | `java.io.tmpdir` (default `/tmp` on Linux but worth being explicit) | 🟡 environment-dependent | `-Djava.io.tmpdir=/tmp` | +| JVM crash files (`hs_err_pid*.log`) | cwd | 🚩 yes | `-XX:ErrorFile=/tmp/hs_err_pid%p.log` | +| JVM heap dumps on OOM | cwd | 🚩 yes | `-XX:HeapDumpPath=/tmp` | +| Logback file appenders | none — `logback-spring.xml` is console-only | ✅ no | No change. Verified at `src/main/resources/logback-spring.xml`. | +| H2 analysis cache | `/.codeiq/cache/` | ✅ no — index-time only | No change. | +| React SPA static assets | classpath: `static/` | ✅ no | No change. | +| Picocli + Spring AI MCP | in-memory + classpath | ✅ no | No change. | +| Symbol resolver SPI (sub-project 1) | in-memory; index-time only | ✅ no | No change. | + +## 7. Test approach + +**Layer 1 — JVM-flag preset sentinel** (unit, fast, CI-gated) + +A unit test reads `scripts/aks-launch.sh` and asserts each required `-D` / `-XX:` flag is present and points at a `/tmp` path. Catches drift if someone trims the script. Cheap to keep green. + +**Layer 2 — Local docker smoke** (manual, runbook-described, not CI-gated) + +The runbook documents: + +```bash +docker run --rm --read-only --tmpfs /tmp:rw,size=2g \ + -v /path/to/bundle:/mnt/bundle:ro \ + codeiq:latest \ + /usr/local/bin/aks-launch.sh /tmp/codeiq-data +``` + +The smoke is the *only* honest test of the RO-root assumption — JVM-level filesystem-write detection inside JUnit is environment-fragile (CI runners have different access patterns, no clean `chroot` API in Java). The runbook smoke is the SSoT for "did this actually work?". + +**Layer 3 — Integration smoke** (existing `IntegrationSmokeTest`) + +The existing `IntegrationSmokeTest` boots `serve` with a real Neo4j data dir from `INTEGRATION_TEST_DIR`. Once the runbook lands, follow-up: extend that test to assert no files appear in `Path.of(".").toAbsolutePath()` after startup. Tracked as a follow-up; not blocking for this sub-project. + +## 8. Backward compatibility + +- Existing `codeiq serve ` continues to work on a writable filesystem. The launch script is **optional** — a developer-machine launch keeps using `java -jar code-iq-*-cli.jar serve ` with no flags. +- No new dependencies. No code changes outside the test surface and the script. +- Not breaking the Maven Central + GitHub Releases distribution channel; consumers who pull the JAR and run it from a local CLI are unaffected. + +## 9. Risks + +| Risk | Mitigation | +|---|---| +| `/tmp` size cap on AKS too small for the graph bundle | Document `emptyDir.sizeLimit: 4Gi` (or larger per repo size) in the runbook init-container manifest. Pre-flight check in the script — fail fast if `/tmp` has < N MB free. | +| Bundle download from Nexus fails — pod stuck in init | Init-container uses `curl --fail` so a 4xx/5xx aborts. Add a max-retry with backoff in the runbook init-container example. | +| Init-container copy slow on first deploy (large DB) | Document the trade-off; for very large repos, consider Approach D (static snapshot) as a follow-up — out of scope here. | +| Future Spring Boot release changes the loader temp-dir flag name | Sentinel test catches the flag presence; runbook lists the flag as Spring Boot 4.x — re-validate on Spring Boot 5.x upgrade. | +| Neo4j version change introduces a new write target outside the data dir | Caught by the runbook docker smoke before merge of any Neo4j upgrade PR. Make the smoke part of the upgrade checklist in `release.md`. | + +## 10. Determinism + observability + +- Determinism is unaffected — this is a deploy-layer change. The graph itself is byte-identical for the same input regardless of where it's served from. +- Add a `/api/diagnostics` (out of scope; tracked) that surfaces the JVM flag preset values for ops verification. Until then, ops can read the launch script directly inside the running container. + +## 11. Acceptance criteria + +1. **Spec** lands at `docs/specs/2026-04-28-aks-read-only-deploy-design.md`. +2. **Plan** lands at `docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md`. +3. **Runbook** at `shared/runbooks/aks-read-only-deploy.md` covers: deploy shape, init-container manifest snippet, JVM flag preset, docker smoke, rollback. +4. **Launch script** at `scripts/aks-launch.sh` composes the JVM flag preset and execs `java -jar`. Has `set -euo pipefail` and validates its single argument. +5. **Sentinel test** at `src/test/java/.../deploy/AksLaunchScriptSentinelTest.java` asserts the script contains every required flag. +6. **CHANGELOG.md** `[Unreleased] / Added` entry. +7. **engineering-standards.md §7.1** cross-link to the new runbook. +8. **`mvn test`** green. +9. **PR** opened against `main`. Independent of sub-project 1 — separate base, separate review. + +## 12. References + +- `~/.claude/CLAUDE.md` — "Deployment assumption: solutions may run behind a corporate firewall / air-gapped" +- `~/.claude/rules/build.md` — "Self-contained build", "No runtime network calls to the public internet" +- `CLAUDE.md` (project) — "Read-Only Serving Layer", "Pipeline is index → enrich → serve" +- `shared/runbooks/engineering-standards.md` §7.1 — "Deploy targets" +- Spring Boot reference, "Loader" — `org.springframework.boot.loader.tmpDir` system property +- Neo4j 2026.04.0 — embedded API; `store_lock` behavior diff --git a/scripts/aks-launch.sh b/scripts/aks-launch.sh new file mode 100755 index 00000000..a561a846 --- /dev/null +++ b/scripts/aks-launch.sh @@ -0,0 +1,54 @@ +#!/usr/bin/env bash +# AKS read-only deploy launcher for `codeiq serve`. +# +# Encodes the JVM flag preset that lets `serve` boot under +# securityContext.readOnlyRootFilesystem=true with /tmp mounted writable. +# Spec: docs/specs/2026-04-28-aks-read-only-deploy-design.md. +# Runbook: shared/runbooks/aks-read-only-deploy.md. +# +# Usage: aks-launch.sh /tmp/codeiq-data +set -euo pipefail + +if [[ $# -ne 1 ]]; then + echo "usage: $(basename "$0") " >&2 + exit 64 +fi +DATA_DIR="$1" + +if [[ ! -d "$DATA_DIR" ]]; then + echo "fatal: data dir does not exist: $DATA_DIR" >&2 + exit 66 +fi + +# Resolve the codeiq JAR. Container image installs it at /app/code-iq.jar +# by default; override via $CODEIQ_JAR for local testing. +JAR="${CODEIQ_JAR:-/app/code-iq.jar}" +if [[ ! -f "$JAR" ]]; then + echo "fatal: codeiq JAR not found at $JAR (override with \$CODEIQ_JAR)" >&2 + exit 66 +fi + +# Pre-flight: ensure /tmp has enough headroom. 1 GB is the absolute floor — +# Neo4j tx logs + Spring Boot loader extraction + JVM heap dump on OOM +# headroom. Real deploys want 2–4 GB depending on graph size. +TMP_FREE_KB="$(df -Pk /tmp | awk 'NR==2 {print $4}')" +if [[ "${TMP_FREE_KB:-0}" -lt 1048576 ]]; then + echo "fatal: /tmp has < 1 GB free (${TMP_FREE_KB:-?} KB)" >&2 + exit 70 +fi + +mkdir -p /tmp/spring-boot-loader + +# JVM flag preset. Every entry has a non-default behavior that without it +# would write outside /tmp. Order: -D system properties first, then -XX. +# Don't reorder — keep it greppable for the sentinel test. +JAVA_OPTS=( + -Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader + -Djava.io.tmpdir=/tmp + -XX:ErrorFile=/tmp/hs_err_pid%p.log + -XX:HeapDumpPath=/tmp + -XX:+HeapDumpOnOutOfMemoryError +) + +# Exec to PID 1 so signals (SIGTERM on pod stop) reach the JVM directly. +exec java "${JAVA_OPTS[@]}" -jar "$JAR" serve "$DATA_DIR" diff --git a/shared/runbooks/aks-read-only-deploy.md b/shared/runbooks/aks-read-only-deploy.md new file mode 100644 index 00000000..b113e734 --- /dev/null +++ b/shared/runbooks/aks-read-only-deploy.md @@ -0,0 +1,225 @@ +# Runbook: AKS read-only deploy + +> **Audience:** ops engineers deploying `codeiq serve` to an AKS cluster (or any Kubernetes cluster with `securityContext.readOnlyRootFilesystem: true`). +> +> **Spec:** [`docs/specs/2026-04-28-aks-read-only-deploy-design.md`](../../docs/specs/2026-04-28-aks-read-only-deploy-design.md). Full architecture rationale lives there; this runbook is the operational checklist. + +## 1. Overview + +`codeiq serve` runs inside an AKS pod with the root filesystem mounted read-only and `/tmp` mounted writable. The graph bundle is built in CI (`index → enrich → bundle`), uploaded to Nexus, then pulled at deploy time by an init-container into `/tmp/codeiq-data`. The main container runs the launch wrapper at `scripts/aks-launch.sh` which composes the JVM flag preset and execs `java -jar code-iq.jar serve /tmp/codeiq-data`. + +Three deployment-layer pieces enable this with **zero source-code changes** to the serve profile: + +1. The graph bundle physically lives under `/tmp/codeiq-data` so embedded Neo4j has a writable directory for its `store_lock`, transaction logs, and counts cache. +2. JVM flags redirect Spring-Boot-loader extraction, crash dumps, and heap dumps to `/tmp`. +3. The launch wrapper enforces the flag preset in one place. + +## 2. Deploy shape + +``` +Build CI (any writable agent — GitHub Actions, GitLab, etc.) + └─ codeiq index $REPO + └─ codeiq enrich $REPO ──▶ $REPO/.codeiq/graph/graph.db/ + └─ codeiq bundle $REPO ──▶ bundle.zip (graph + manifest) + └─ curl -u $NEXUS_USER:$NEXUS_PASS \ + --upload-file bundle.zip \ + "$NEXUS_URL/repository/codeiq-bundles/$BUNDLE_VERSION/bundle.zip" + +AKS deploy (one Pod per service) + init-container "fetch-bundle" download from Nexus → /tmp/codeiq-data/ + main container "codeiq-serve" /usr/local/bin/aks-launch.sh /tmp/codeiq-data + listens on :8080 (configurable) +``` + +## 3. Init-container Kubernetes manifest + +Drop the snippet below into your Pod spec. The init-container shares an `emptyDir` mount with the main container so the unzipped bundle is visible at the same path in both. + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: codeiq + namespace: codeiq +spec: + replicas: 1 + selector: { matchLabels: { app: codeiq } } + template: + metadata: { labels: { app: codeiq } } + spec: + securityContext: + runAsNonRoot: true + runAsUser: 65532 + fsGroup: 65532 + volumes: + - name: tmp + emptyDir: + medium: Memory # tmpfs — fastest; switch to "" for disk-backed + sizeLimit: 4Gi # tune to bundle size + Neo4j tx-log headroom + initContainers: + - name: fetch-bundle + image: alpine:3.20 + command: [sh, -c] + args: + - | + set -euo pipefail + apk add --no-cache curl unzip > /dev/null + curl --fail --silent --show-error \ + -u "$NEXUS_USER:$NEXUS_PASS" \ + "$NEXUS_URL/repository/codeiq-bundles/$BUNDLE_VERSION/bundle.zip" \ + -o /tmp/bundle.zip + mkdir -p /tmp/codeiq-data + unzip -q /tmp/bundle.zip -d /tmp/codeiq-data + rm -f /tmp/bundle.zip + env: + - name: NEXUS_URL + valueFrom: { secretKeyRef: { name: codeiq-nexus, key: url } } + - name: NEXUS_USER + valueFrom: { secretKeyRef: { name: codeiq-nexus, key: user } } + - name: NEXUS_PASS + valueFrom: { secretKeyRef: { name: codeiq-nexus, key: pass } } + - name: BUNDLE_VERSION + value: "0.1.0" # bumped per release + volumeMounts: + - name: tmp + mountPath: /tmp + securityContext: + readOnlyRootFilesystem: true + allowPrivilegeEscalation: false + capabilities: { drop: [ALL] } + containers: + - name: codeiq-serve + image: ghcr.io/randomcodespace/codeiq:0.1.0 + command: [/usr/local/bin/aks-launch.sh, /tmp/codeiq-data] + ports: + - { name: http, containerPort: 8080 } + readinessProbe: + httpGet: { path: /actuator/health/readiness, port: http } + initialDelaySeconds: 20 + periodSeconds: 5 + livenessProbe: + httpGet: { path: /actuator/health/liveness, port: http } + initialDelaySeconds: 60 + periodSeconds: 10 + resources: + requests: { cpu: 500m, memory: 1Gi } + limits: { cpu: 2, memory: 4Gi } + volumeMounts: + - name: tmp + mountPath: /tmp + securityContext: + readOnlyRootFilesystem: true # enforces the model + allowPrivilegeEscalation: false + runAsNonRoot: true + capabilities: { drop: [ALL] } + seccompProfile: { type: RuntimeDefault } +``` + +**Volume sizing:** the `emptyDir.sizeLimit: 4Gi` covers a typical mid-size repo's graph + Neo4j transaction-log headroom + Spring Boot loader extraction (~50 MB) + JVM heap-dump headroom. Bump for very large bundles. The pre-flight check in `aks-launch.sh` aborts startup if `/tmp` has < 1 GB free, which is the absolute floor. + +**Image:** the container image must install the launch script at `/usr/local/bin/aks-launch.sh` and the JAR at `/app/code-iq.jar` (or set `CODEIQ_JAR=...`). Reference Dockerfile: + +```dockerfile +FROM eclipse-temurin:25-jre-alpine +RUN apk add --no-cache bash +WORKDIR /app +COPY code-iq-*-cli.jar /app/code-iq.jar +COPY scripts/aks-launch.sh /usr/local/bin/aks-launch.sh +RUN chmod +x /usr/local/bin/aks-launch.sh +USER 65532:65532 +ENTRYPOINT ["/usr/local/bin/aks-launch.sh"] +``` + +## 4. JVM flag preset (canonical reference) + +Encoded in `scripts/aks-launch.sh`. Updating the preset means updating the script (and the sentinel test catches the drift). Every flag has a non-default behavior that without it would write outside `/tmp`. + +| Flag | Default | Why required | +|---|---|---| +| `-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader` | `~/.m2/spring-boot-loader-tmp` | Spring Boot fat JAR extracts nested JARs to `$HOME` by default — outside `/tmp`. | +| `-Djava.io.tmpdir=/tmp` | OS-default (`/tmp` on Linux) | Explicit so multipart uploads, JNA / Netty native lib extraction land where we expect across base images. | +| `-XX:ErrorFile=/tmp/hs_err_pid%p.log` | cwd | JVM crash dump default is the working directory. | +| `-XX:HeapDumpPath=/tmp` | cwd | Heap dump on OOM default is cwd. | +| `-XX:+HeapDumpOnOutOfMemoryError` | off | Without this the path flag never fires. | + +## 5. Verification + +### 5.1 Local docker smoke (the gate) + +This is the **single source of truth** for "did the deploy assumption actually hold." JVM-level write detection inside JUnit is environment-fragile; running the actual binary inside the actual constraint shape is the only honest test. + +```bash +# Build the image once. +docker build -t codeiq:smoke . + +# Run with --read-only and a tmpfs /tmp, mount a known-good bundle as RO. +docker run --rm \ + --read-only \ + --tmpfs /tmp:rw,size=2g,mode=1777 \ + -v "$PWD/test-bundle:/mnt/bundle:ro" \ + -p 8080:8080 \ + --entrypoint sh \ + codeiq:smoke \ + -c ' + cp -r /mnt/bundle/. /tmp/codeiq-data && + /usr/local/bin/aks-launch.sh /tmp/codeiq-data + ' + +# In another terminal: +curl -fsS http://localhost:8080/api/stats > /tmp/stats.json +jq '.graph.nodes' /tmp/stats.json # > 0 confirms the graph loaded +``` + +If the container exits non-zero with `Read-only file system` or `Permission denied`, **do not paper over with `--read-only=false`**. Investigate which path the new code is trying to write to, and either fix the code or extend the JVM flag preset. + +### 5.2 Sentinel test (drift catcher) + +```bash +mvn test -Dtest=AksLaunchScriptSentinelTest +``` + +Asserts every required flag is in `scripts/aks-launch.sh`. CI-gated. Catches accidental flag removal. + +### 5.3 In-cluster smoke (post-deploy) + +```bash +kubectl -n codeiq port-forward deploy/codeiq 8080:8080 & +curl -fsS http://localhost:8080/actuator/health +curl -fsS http://localhost:8080/api/stats | jq '.graph.nodes' +``` + +## 6. Rollback + +The deploy artifact is the immutable bundle in Nexus + the immutable container image. Rollback is "redeploy the previous bundle version." + +```bash +# Bundle rollback — re-tag the previous bundle version, redeploy. +kubectl -n codeiq set env deploy/codeiq \ + --containers='codeiq-serve' BUNDLE_VERSION=0.0.49 + +# Image rollback (CVE patch / launcher fix). +kubectl -n codeiq set image deploy/codeiq \ + codeiq-serve=ghcr.io/randomcodespace/codeiq:0.0.49 +``` + +For full release / rollback policy see [`shared/runbooks/release.md`](release.md) and [`shared/runbooks/rollback.md`](rollback.md). This runbook covers the AKS-specific bits only. + +## 7. Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `Read-only file system` at startup | A new code path is writing outside `/tmp`. | Run the docker smoke (§5.1) — the stack trace points at the path. Either redirect via the JVM flag preset (extend §4) or fix the code. | +| `lock acquired by another process` from Neo4j | Two pods sharing the same `/tmp` volume — only legal in single-replica mode. | Set `replicas: 1`, or split each replica's `emptyDir` (default — they're per-pod). | +| `out of disk space` during init-container | `emptyDir.sizeLimit` too small for the bundle. | Bump `sizeLimit` in the manifest. | +| `BUNDLE_VERSION` not found at Nexus | Stale tag, or release never landed. | Verify the upload step in build CI; check Nexus repository UI. | +| Pod restart loop after a clean start | Likely a heap dump filling `/tmp` — `--tmpfs` size cap reached. | Bump `sizeLimit`; investigate the OOM root cause via the heap dump pulled out of the previous pod. | + +## 8. Cross-references + +- Spec: [`docs/specs/2026-04-28-aks-read-only-deploy-design.md`](../../docs/specs/2026-04-28-aks-read-only-deploy-design.md) +- Plan: [`docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md`](../../docs/plans/2026-04-28-sub-project-2-aks-read-only-deploy.md) +- Engineering standards: [`engineering-standards.md`](engineering-standards.md) §7.1 Deploy targets +- Release process: [`release.md`](release.md) +- Rollback: [`rollback.md`](rollback.md) +- Launch script: [`scripts/aks-launch.sh`](../../scripts/aks-launch.sh) +- Sentinel test: [`src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java`](../../src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java) diff --git a/shared/runbooks/engineering-standards.md b/shared/runbooks/engineering-standards.md index d6b2eee4..03dbff9e 100644 --- a/shared/runbooks/engineering-standards.md +++ b/shared/runbooks/engineering-standards.md @@ -136,6 +136,16 @@ Hello-world / pipeline proof: `git tag -l 'v0.0.1-beta.*' | wc -l` is non-zero ( If the product later needs a hosted demo or container surface, that is a **new RAN-* issue**, not a re-open of RAN-46. +#### 7.1.1 Read-only deploy targets (sub-project 2) + +When a downstream consumer wants to run `codeiq serve` inside a hardened container runtime — Kubernetes / AKS / OpenShift with `securityContext.readOnlyRootFilesystem=true`, or any environment where the root filesystem is mounted read-only and only `/tmp` is writable — the canonical pattern is: + +1. CI builds the bundle (`index → enrich → bundle`) and uploads the zip to a private artifact registry (e.g. Nexus). +2. An **init-container** copies the bundle into a writable `/tmp/codeiq-data` (`emptyDir` with `medium: Memory` for tmpfs, or default for disk-backed). +3. The main container runs [`scripts/aks-launch.sh`](../../scripts/aks-launch.sh) which composes the JVM flag preset (Spring-Boot-loader tmpDir, `java.io.tmpdir`, `-XX:ErrorFile`, `-XX:HeapDumpPath`) and exec's `java -jar code-iq.jar serve /tmp/codeiq-data`. + +Zero source-code changes to the serve profile or Neo4j wiring — solved at the deployment layer plus the JVM-flag-preset launcher. Drift caught by `AksLaunchScriptSentinelTest`. Full deploy / verify / rollback steps in [`shared/runbooks/aks-read-only-deploy.md`](aks-read-only-deploy.md). Architecture rationale in [`docs/specs/2026-04-28-aks-read-only-deploy-design.md`](../../docs/specs/2026-04-28-aks-read-only-deploy-design.md). + --- ## 8. Documentation diff --git a/src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java b/src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java new file mode 100644 index 00000000..0a15b3a9 --- /dev/null +++ b/src/test/java/io/github/randomcodespace/iq/deploy/AksLaunchScriptSentinelTest.java @@ -0,0 +1,132 @@ +package io.github.randomcodespace.iq.deploy; + +import org.junit.jupiter.api.BeforeAll; +import org.junit.jupiter.api.Test; +import org.junit.jupiter.api.TestInstance; + +import java.io.IOException; +import java.nio.file.Files; +import java.nio.file.Path; + +import static org.junit.jupiter.api.Assertions.*; + +/** + * Sub-project 2 — sentinel test for {@code scripts/aks-launch.sh}. + * + *

The script encodes the JVM flag preset that lets {@code codeiq serve} + * run inside an AKS pod with {@code securityContext.readOnlyRootFilesystem=true} + * and {@code /tmp} mounted writable. The preset is the deploy contract — losing + * any flag here means the next deploy fails with a "read-only file system" + * error at startup. This test asserts every flag is present so flag drift + * during refactors fails CI, not production. + * + *

The test is deliberately a string-grep against the script source rather + * than an exec — the runbook's docker smoke + * ({@code shared/runbooks/aks-read-only-deploy.md} §5.1) is the SSoT for + * "did the deploy assumption actually hold." The unit test only catches + * drift in the file we control. + */ +@TestInstance(TestInstance.Lifecycle.PER_CLASS) +class AksLaunchScriptSentinelTest { + + private static final Path SCRIPT_PATH = Path.of("scripts/aks-launch.sh"); + + private String script; + + @BeforeAll + void loadScript() throws IOException { + assertTrue(Files.exists(SCRIPT_PATH), + "scripts/aks-launch.sh missing — sub-project 2 deploy contract is broken"); + script = Files.readString(SCRIPT_PATH); + } + + @Test + void scriptIsExecutable() { + // Posix permissions check via file attribute. On non-Posix runners + // (e.g. Windows CI), Files.isExecutable is the best we can do. + assertTrue(Files.isExecutable(SCRIPT_PATH), + "aks-launch.sh must be chmod +x — runbook installs it at " + + "/usr/local/bin/aks-launch.sh and the container ENTRYPOINT exec's it"); + } + + @Test + void scriptUsesStrictBashMode() { + assertTrue(script.contains("set -euo pipefail"), + "aks-launch.sh must use 'set -euo pipefail' — silent failures in init " + + "(e.g. /tmp pre-flight check) would let the JVM start in a broken state"); + } + + @Test + void scriptValidatesArgCount() { + assertTrue(script.contains("$# -ne 1"), + "aks-launch.sh must reject any argv shape other than exactly one data-dir arg"); + } + + @Test + void scriptSetsSpringBootLoaderTmpDir() { + // Without this, Spring Boot extracts nested JARs to ~/.m2/spring-boot-loader-tmp/ + // — outside /tmp, fails under read-only HOME. + assertTrue(script.contains("-Dorg.springframework.boot.loader.tmpDir=/tmp/spring-boot-loader"), + "spring-boot-loader tmpDir must be redirected to /tmp/spring-boot-loader"); + } + + @Test + void scriptSetsJavaIoTmpdir() { + // Explicit even though /tmp is the Linux default — multipart upload + // temps, JNA, Netty native-lib extraction all use this; making it + // explicit means base-image-default drift can't break us. + assertTrue(script.contains("-Djava.io.tmpdir=/tmp"), + "java.io.tmpdir must be explicitly /tmp"); + } + + @Test + void scriptSetsJvmErrorFile() { + // Default: cwd. cwd under read-only root = unwritable. JVM crash + // would silently drop the dump file. Setting this captures crashes + // in /tmp where ops can extract them via kubectl cp. + assertTrue(script.contains("-XX:ErrorFile=/tmp/hs_err_pid%p.log"), + "JVM ErrorFile must land in /tmp so crash dumps survive a read-only root"); + } + + @Test + void scriptSetsHeapDumpPath() { + assertTrue(script.contains("-XX:HeapDumpPath=/tmp"), + "Heap dump path must be /tmp (cwd is read-only)"); + } + + @Test + void scriptEnablesHeapDumpOnOom() { + // Without this, HeapDumpPath is inert. + assertTrue(script.contains("-XX:+HeapDumpOnOutOfMemoryError"), + "+HeapDumpOnOutOfMemoryError must be on so the configured path actually fires"); + } + + @Test + void scriptExecsJavaAsPid1() { + // exec (not just call) lets SIGTERM from kubelet reach the JVM + // directly on pod shutdown — without it, bash sits between as PID 1 + // and the JVM doesn't get the signal until bash propagates (or + // doesn't, depending on shell traps). + assertTrue(script.contains("exec java"), + "must `exec java` so the JVM is PID 1 and receives SIGTERM directly on pod stop"); + } + + @Test + void scriptDoesPreflightTmpCheck() { + // The 1 GB floor is documented in the spec. Pre-flighting catches + // a misconfigured emptyDir.sizeLimit before Neo4j blows up halfway + // through opening its store. + assertTrue(script.contains("/tmp") && script.contains("df -Pk"), + "must pre-flight /tmp free space via 'df -Pk /tmp'"); + assertTrue(script.contains("1048576"), + "the 1 GB minimum (1048576 KB) floor must be enforced — see spec §9 risks"); + } + + @Test + void scriptCreatesSpringBootLoaderDir() { + // mkdir is idempotent (-p) — the dir might already exist on a + // restart-without-pod-recreate path. + assertTrue(script.contains("mkdir -p /tmp/spring-boot-loader"), + "must create the spring-boot-loader extraction dir before java starts"); + } +}