From 62afaa891109f1abc15f6a72e7f8c22ef88d8fbf Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Tue, 28 Apr 2026 06:24:50 +0000 Subject: [PATCH 01/10] chore(deps): bump Node from v20.11.0 to v22.12.0 for Vite 8 (unblocks #86) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Vite 8 (pulled in by PR #86's vite group bump) raised its engine requirement to ^20.19.0 || >=22.12.0. Our pinned v20.11.0 fails the frontend-maven-plugin `npm run build` step immediately: SyntaxError: The requested module 'node:util' does not provide an export named 'styleText' (rolldown, which Vite 8 uses, depends on `node:util.styleText` — only in Node 20.18+/22.x). Pinning to v22.12.0 — the minimum v22 release that satisfies Vite 8 — keeps us on a currently-supported LTS line and unblocks dependabot's #86 vite group bump. Co-Authored-By: Claude Opus 4.7 (1M context) --- pom.xml | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index 6e3e1528..1a49dc8e 100644 --- a/pom.xml +++ b/pom.xml @@ -260,7 +260,17 @@ 2.0.0 src/main/frontend - v20.11.0 + + v22.12.0 ${frontend.skip} From b1872f48add7b1491d15a49e8aefb227c14f4d40 Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Tue, 28 Apr 2026 07:25:13 +0000 Subject: [PATCH 02/10] checkpoint: pre-yolo 2026-04-28T07:25:13 --- AGENTS.md | 2 +- ...04-28-serve-path-prod-readiness-counter.md | 156 +++++++++++++++ .../2026-04-28-serve-path-prod-readiness.md | 177 ++++++++++++++++++ 3 files changed, 334 insertions(+), 1 deletion(-) create mode 100644 docs/audits/2026-04-28-serve-path-prod-readiness-counter.md create mode 100644 docs/audits/2026-04-28-serve-path-prod-readiness.md diff --git a/AGENTS.md b/AGENTS.md index 80dbccf4..e598e400 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -46,7 +46,7 @@ If you hit something requiring GitHub App / PAT / OAuth that the runtime cannot # Memory Context -# [codeiq] recent context, 2026-04-28 1:14am UTC +# [codeiq] recent context, 2026-04-28 6:43am UTC No previous sessions found. \ No newline at end of file diff --git a/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md b/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md new file mode 100644 index 00000000..ab840594 --- /dev/null +++ b/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md @@ -0,0 +1,156 @@ +# Counter-Audit: serve-path Production Readiness +**Original audit:** `docs/audits/2026-04-28-serve-path-prod-readiness.md` (15 findings: 6 HIGH / 7 MEDIUM / 2 LOW) +**Counter-audit date:** 2026-04-28 +**Method:** Every finding verified against actual source files. Net-new findings added from independent inspection of GraphController, McpTools, ServeCommand, GraphHealthIndicator, CorsConfig, BundleCommand, SafeFileReader, CodeIqConfig, McpLimitsConfig, McpAuthConfig, GraphStore, application.yml, pom.xml, security.yml, ci-java.yml, and frontend/package.json. + +--- + +## Section A — Corrections to Original Audit + +### A1. Finding #6 overstated — `markReady()` fires AFTER `bootstrapNeo4jFromCache()`, not before + +**Original claim:** "`markReady()` fires before the graph is loaded" — traffic is routed during bootstrap. + +**What the code actually does:** `ServeCommand.java` lines 83–126: `graphBootstrapper.bootstrapNeo4jFromCache()` is called at line 84 and returns only when bootstrap completes. `markReady()` is called at line 126, after that return and after the node/edge count is printed to stdout. The auto-bootstrap race window the audit describes does not exist in the current code. + +**What IS real:** `GraphHealthIndicator` is not wired into the readiness group in `application.yml` (no `management.endpoint.health.group.readiness.include: readinessState,graph`). So the Kubernetes readiness probe does not gate on graph health. If a future code change moves `markReady()` earlier, or if bootstrap is made async, this becomes acute. The config gap is the real finding. + +**Correction:** Downgrade finding #6 from HIGH to MEDIUM. The fix (adding `management.endpoint.health.group.readiness.include: readinessState,graph` to the serving profile in `application.yml`) is still valid and still needed, but the acute rollout-during-bootstrap race does not currently exist. + +--- + +### A2. Finding #10 is half-wrong — REST `traceImpact` IS depth-capped; MCP is not + +**Original claim:** "the API endpoint `/api/triage/impact/{id}` (`GraphController:188`) doesn't appear to bound it." + +**What the code actually does:** `GraphController.java` line 192: +```java +int cappedDepth = Math.min(depth, config.getMaxDepth()); +``` +`config.getMaxDepth()` defaults to 10 (`CodeIqConfig.java`). The REST endpoint is bounded. + +**What IS true — in the other direction:** `McpTools.traceImpact` passes `depth != null ? depth : 3` directly to `queryService.traceImpact` with no `Math.min` guard. The MCP path is unbounded. + +**Correction:** The unbounded-depth defect exists on the MCP tool, not the REST controller. The fix targets `McpTools.traceImpact`, not `GraphController`. Severity (MEDIUM) is unchanged; affected surface is corrected. See also C3 below. + +--- + +### A3. Finding #11 is partially wrong — `SafeFileReader` does enforce a byte cap + +**Original claim:** "no `Content-Length` cap matches `getMaxFileBytes`" — implies size is unbounded. + +**What the code actually does:** `GraphController.readFile` calls `SafeFileReader.read(resolved, startLine, endLine, config.getMaxFileBytes())`. `SafeFileReader` enforces the byte cap for both full-file and line-range reads. The cap is real and working. + +**What IS real:** Content-type is not sniffed. Binary files (`.jks`, `.so`, `.png`) are served as `text/plain` with no early reject. The slow-client connection exhaustion concern is valid. The size-cap concern is not. + +**Correction:** Remove "no Content-Length cap" from finding #11. The binary content-type concern stands; severity (MEDIUM) is unchanged. + +--- + +## Section B — Severity Adjustments + +### B1. Finding #6 — downgrade from HIGH to MEDIUM (per A1) + +The bootstrap-before-markReady ordering eliminates the acute readiness race. The residual gap — readiness group not configured — is MEDIUM: probes do not gate on graph health, which can cause transient 503s after a pod restart if Neo4j is slow to open, but the bootstrap path completes before traffic is accepted under normal conditions. + +--- + +### B2. Finding #4 — scope extended; severity remains HIGH + +Finding #4 correctly flags null checksums in `BundleCommand.createManifest` (line 241: `null` passed as the checksums argument). This is confirmed. However the finding misses a co-equal defect: `generateServeShell` (lines 265–274 in `BundleCommand.java`) emits a `serve.sh` that unconditionally downloads the JAR at runtime: + +```bash +curl -fL -o "$JAR" \ + "https://repo1.maven.org/maven2/io/github/randomcodespace/iq/code-iq/${VERSION}/code-iq-${VERSION}-cli.jar" +``` + +An equivalent `serve.bat` exists for Windows. This is a direct violation of `build.md` ("No runtime network calls to the public internet"). Bundles deployed in air-gapped environments silently fail to start when the JAR is absent. A compromised Maven Central namespace could substitute a malicious JAR, and even correct checksums in `manifest.json` would not protect against it because the downloaded JAR is not verified against them. The fix for finding #4 must address both the null checksums **and** the runtime download. + +--- + +## Section C — Missed Findings + +### C1. HIGH — `getCachedData()` loads the full graph into heap on every topology MCP tool call + +**Symptom in prod:** Seven MCP tools — `serviceDetail`, `blastRadius`, `findPath`, `findBottlenecks`, `findCircularDeps`, `findDeadServices`, `findNode` — all begin by calling `getCachedData()` (`McpTools.java` lines 83–92). `getCachedData()` calls `graphStore.findAll()`, which executes two full graph scans (`findAllNodes` + `findAllEdges`) and materialises every node and every edge into a `GraphData` record on the Java heap. On a 5M-node enriched graph this is multiple gigabytes per call. No result is cached between invocations — each call pays the full allocation cost. Two concurrent `blast_radius` invocations double-allocate. This is an OOM vector independent of the `run_cypher` issue in finding #2, and it is triggered by normal topology tool usage, not by adversarial queries. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:83–92` (`getCachedData`); `src/main/java/io/github/randomcodespace/iq/graph/GraphStore.java` (`findAll`, `findAllNodes`, `findAllEdges`). + +**Severity:** HIGH + +**Fix proposal:** Replace `getCachedData()` with targeted Cypher queries per tool (e.g. `blastRadius` needs only the subgraph reachable from the seed node, not all 5M nodes). If a full snapshot is required for correctness, populate it once at serve startup into a size-bounded `SoftReference` and invalidate on graph reload. Add a Caffeine cache with a 60-second TTL and a max-weight bound in bytes. Effort: M. + +--- + +### C2. MEDIUM — Swagger UI exposed unauthenticated; full API schema readable by any cluster workload + +**Symptom in prod:** `pom.xml` includes `springdoc-openapi-starter-webmvc-ui:3.0.3`. SpringDoc auto-registers `/swagger-ui/index.html`, `/swagger-ui.html`, and `/v3/api-docs` at startup with no authentication guard. Because there is no Spring Security on the classpath (finding #1), no filter intercepts these paths. Any actor who can reach the pod gets the complete OpenAPI schema: every endpoint path, parameter name, response shape, and the full enumeration of `NodeKind` / `EdgeKind` values. This is reconnaissance-in-depth that lowers the cost of exploiting finding #1. Neither `springdoc.swagger-ui.enabled` nor `springdoc.api-docs.enabled` is set to `false` in the serving profile of `application.yml`. + +**File / location:** `pom.xml` (`springdoc-openapi-starter-webmvc-ui:3.0.3`); `src/main/resources/application.yml` (no `springdoc.*` keys in serving profile). + +**Severity:** MEDIUM + +**Fix proposal:** In `application.yml` serving profile: `springdoc.swagger-ui.enabled: false` and `springdoc.api-docs.enabled: false`. Provide opt-in `codeiq.serving.swagger-ui.enabled: true` for local development. When auth (finding #1) is implemented, gate `/swagger-ui/**` and `/v3/api-docs/**` behind the same bearer check. Effort: XS. + +--- + +### C3. MEDIUM — `McpTools.traceImpact` has no depth cap on the MCP path + +**Symptom in prod:** As established in A2, `McpTools.traceImpact` forwards caller-supplied `depth` to `queryService.traceImpact` without any `Math.min` guard. A malicious or runaway MCP client sends `depth=1000` on a hub node; the resulting `RELATES_TO*1..1000` variable-length Cypher match runs until the transaction timeout fires — which is also not configured (finding #2). The REST endpoint at `GraphController:192` is safe; the MCP surface is not. `McpLimitsConfig` already defines a `maxDepth` field that is never consumed here. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (`traceImpact` method, depth forwarding). + +**Severity:** MEDIUM + +**Fix proposal:** `int safedDepth = depth != null ? Math.min(depth, mcpLimitsConfig.maxDepth()) : 3;` — wire the already-parsed `maxDepth` from `McpLimitsConfig`. Effort: XS. + +--- + +### C4. MEDIUM — `semgrep` installed from PyPI without a pinned version in `security.yml` + +**Symptom in prod:** `.github/workflows/security.yml` line 94 runs `python -m pip install --quiet --upgrade pip semgrep` with no version pin. Every workflow run fetches the latest `semgrep` release from PyPI at the moment of execution. Every other tool in the same workflow is pinned: `osv-scanner` uses `OSV_SCANNER_VERSION: 2.3.5` with a named release download; `gitleaks` uses `GITLEAKS_VERSION: 8.30.1`; all GitHub Actions are SHA-pinned. A compromised PyPI release of `semgrep` (or a transitive dependency) would execute arbitrary code inside the SAST job, which runs with `contents: read` permission and access to `GITHUB_TOKEN`. This directly contradicts the workflow's header comment: "All actions SHA-pinned per Scorecard `Pinned-Dependencies`." + +**File / location:** `.github/workflows/security.yml:94`. + +**Severity:** MEDIUM + +**Fix proposal:** Pin to a specific version: `pip install semgrep==` (resolve via PyPI). Better: use `returntocorp/semgrep-action` pinned by commit SHA (free for OSS), which eliminates the PyPI install entirely and aligns with the workflow's existing SHA-pinning posture. Effort: XS. + +--- + +### C5. LOW — CLAUDE.md tech-stack version table is stale on three components + +**Symptom:** CLAUDE.md "Tech Stack" section states Spring Boot `4.0.5`, Spring AI `2.0.0-M3`, Neo4j Embedded `2026.02.3`. Actual values in `pom.xml`: Spring Boot `4.0.6`, Spring AI `2.0.0-M4`, Neo4j `2026.04.0`. Stale docs cause reviewers and automated tooling to reference wrong versions when checking CVE databases or compatibility matrices. + +**File / location:** `CLAUDE.md` (Tech Stack section); `pom.xml` (ground truth). + +**Severity:** LOW + +**Fix proposal:** Update three lines in CLAUDE.md. Note that `pom.xml` is the SSoT; CLAUDE.md is informational. Effort: XS. + +--- + +## Summary Table + +| # | Original Finding | Verdict | Adjustment | +|---|-----------------|---------|------------| +| 1 | No auth on API/MCP | Confirmed | — | +| 2 | `run_cypher` no cap / timeout / READ mode | Confirmed | — | +| 3 | No rate limiting | Confirmed | — | +| 4 | Unsigned bundle + null checksums | Confirmed + extended | serve.sh/bat runtime Maven Central download is co-equal defect; fix must cover both | +| 5 | `/api/file` ships secrets in bundle | Confirmed | — | +| 6 | Readiness fires before graph load | **Partially wrong** | Downgrade HIGH → MEDIUM; bootstrap-before-markReady ordering is correct; readiness group config gap is the real issue | +| 7 | No `@RestControllerAdvice`; stack trace leak | Confirmed | — | +| 8 | MCP errors return HTTP 200 | Confirmed | — | +| 9 | No structured logs / request ID / MDC | Confirmed | — | +| 10 | `findShortestPath` + `traceImpact` unbounded | **Partially wrong** | REST `traceImpact` IS capped via `Math.min`; MCP `traceImpact` is NOT; fix target corrected | +| 11 | `/api/file` no size cap; binary served as text | **Partially wrong** | Size cap IS enforced by SafeFileReader; binary content-type issue stands; remove size-cap claim | +| 12 | `GraphHealthIndicator` uncached count on every probe | Confirmed | — | +| 13 | CORS defaults wrong; no CSP / security headers | Confirmed | — | +| 14 | Bad YAML silently uses defaults; no fail-fast | Confirmed | — | +| 15 | Zero integration tests for auth / rate-limit path | Confirmed | — | +| C1 | `getCachedData()` full graph load per topology call | **NET NEW** | HIGH | +| C2 | Swagger UI unauthenticated | **NET NEW** | MEDIUM | +| C3 | MCP `traceImpact` no depth cap | **NET NEW** | MEDIUM | +| C4 | `semgrep` unpinned in `security.yml` | **NET NEW** | MEDIUM | +| C5 | CLAUDE.md version table stale | **NET NEW** | LOW | diff --git a/docs/audits/2026-04-28-serve-path-prod-readiness.md b/docs/audits/2026-04-28-serve-path-prod-readiness.md new file mode 100644 index 00000000..1b572d62 --- /dev/null +++ b/docs/audits/2026-04-28-serve-path-prod-readiness.md @@ -0,0 +1,177 @@ +## 1. HIGH — MCP and REST API are fully unauthenticated; one curl from anywhere on the cluster reads the whole graph + +**Symptom in prod:** Pod has no auth on `/api/**` or `/mcp` (no Spring Security on classpath, no `@PreAuthorize`, no filter, no token check). Any other workload in the AKS namespace — including a compromised sidecar in another tenant's pod that resolves the codeiq Service — can hit `GET /api/file?path=...` and exfiltrate every byte under the analyzed codebase root, plus run arbitrary read-only Cypher via `POST /mcp` `run_cypher`. The unified config defines `mcp.auth.mode: bearer|mtls` (`McpAuthConfig`) but **nothing wires it into a filter** — the field is dead. East-west attack on multi-tenant pipeline = data exfil from other tenants' analyzed source. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:39` (no `@PreAuthorize`); `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:269` (no auth check); `pom.xml` (no `spring-boot-starter-security`); `src/main/java/io/github/randomcodespace/iq/config/unified/McpAuthConfig.java` (config class, never consumed). + +**Severity:** HIGH + +**Fix proposal:** Add `spring-boot-starter-security`. Implement `SecurityFilterChain` in a new `config/SecurityConfig.java` that, when `codeiq.mcp.auth.mode=bearer`, requires `Authorization: Bearer ${CODEIQ_MCP_TOKEN}` on `/api/**` AND `/mcp/**` (constant-time compare). Permit only `/actuator/health/*`. Default `mode=none` permitted only when `spring.profiles.active` contains `local`. Effort: M. + +--- + +## 2. HIGH — `run_cypher` has zero result-set cap, zero query timeout, and runs in the default (read+write) tx mode + +**Symptom in prod:** A single MCP client sends `MATCH (a:CodeNode), (b:CodeNode), (c:CodeNode) RETURN a, b, c LIMIT 999999999`. `runCypher` accumulates rows in an `ArrayList>` with no cap, the JVM heap fills, `OutOfMemoryError` triggers (heap dump goes to `/tmp` per `aks-launch.sh:51`, eats tmpfs), pod is `OOMKilled`. Tenant outage ≥60s while replica restarts and re-bootstraps Neo4j. Embedded Neo4j has no per-query memory limit configured (`Neo4jConfig.java`, no `dbms.memory.transaction.max_size`). Additionally, `tx.execute(query)` runs in default access mode, not READ — so a procedure registered later (or one this regex-blocklist misses) could mutate. The CLAUDE.md "Gotchas" already calls out RAN-31 ("pin run_cypher to Neo4j READ access mode") but the current code at `mcp/McpTools.java:296` still uses `graphDb.beginTx()` not `beginTx(KernelTransaction.Type.IMPLICIT, AUTH_DISABLED, AccessMode.Static.READ, timeoutMs, MILLIS)`. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:269-318` (`runCypher`); `mcp/McpTools.java:311` (unbounded `rows.add`); `src/main/java/io/github/randomcodespace/iq/config/Neo4jConfig.java` (no transaction-timeout / memory settings). + +**Severity:** HIGH + +**Fix proposal:** Use `graphDb.beginTx(perToolTimeoutMs, MILLIS)` (transaction timeout already in `McpLimitsConfig.perToolTimeoutMs=15000`). Cap rows at `mcp.limits.max_results` (500) and stop iterating; return a `truncated: true` flag. Cap accumulated payload bytes at `mcp.limits.max_payload_bytes` (2 MB) by serializing-as-you-go. Configure `dbms.memory.transaction.max_size=512m` in `Neo4jConfig`. Effort: S. + +--- + +## 3. HIGH — No rate limiting anywhere; one MCP client saturates the pod for everyone + +**Symptom in prod:** `mcp.limits.rate_per_minute: 300` is defined in `McpLimitsConfig` and parsed by `UnifiedConfigLoader.java:166` but **no filter or interceptor enforces it** (zero hits for `Bucket4j|Resilience4j|RateLimiter|HandlerInterceptor` in main source). One agent client in a runaway loop fires `find_cycles` (which runs `MATCH p=(a)-[:RELATES_TO*2..10]->(a)` — graph-wide variable-length match, no per-call limit) at hundreds of QPS. Tomcat virtual-thread executor saturates Neo4j page cache, p99 on `/api/stats` jumps from 50 ms to multi-second, readiness probe (`periodSeconds: 5`) starts to flake, kubelet restarts the pod (`replicas: 1` — no failover), tenant goes dark. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (no rate limiter); `src/main/java/io/github/randomcodespace/iq/api/GraphController.java` (no rate limiter); `src/main/java/io/github/randomcodespace/iq/config/unified/McpLimitsConfig.java` (`ratePerMinute` parsed but unused). + +**Severity:** HIGH + +**Fix proposal:** Add Bucket4j (Apache-2.0, single dep, ~80 KB). Register an `OncePerRequestFilter` keyed by `Authorization` token (or remote IP fallback) with a refill-per-second token bucket sized at `mcp.limits.rate_per_minute / 60`. 429 with `Retry-After` header on bucket exhaustion. Apply to `/api/**` and `/mcp/**`. Effort: S. + +--- + +## 4. HIGH — Bundle is unsigned and unverified; init-container blindly unzips whatever Nexus serves + +**Symptom in prod:** AKS init-container (`shared/runbooks/aks-read-only-deploy.md:48-72`) runs `curl -u $NEXUS_USER:$NEXUS_PASS .../bundle.zip | unzip` with no checksum verification, no signature check. `ArtifactManifest` defines a `checksums` field (`Map`) but `BundleCommand.createManifest` (`cli/BundleCommand.java`) passes `null` for it (sed shows `null` literal in the constructor call). On Nexus credential compromise OR a malicious internal user with `codeiq-bundles` write access, an attacker swaps `bundle.zip` with one that contains a `graph.db/` planted with a Cypher full-text index that triggers JNDI lookup, OR a `serve.sh` that is NEVER actually invoked at runtime but still — once bundles are signed, you can also trust `manifest.json`. Single tenant's bundle becomes a foothold across the whole pipeline because the same Nexus path is served to every replica. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/cli/BundleCommand.java:141-150` (manifest checksum field passed `null`); `src/main/java/io/github/randomcodespace/iq/intelligence/ArtifactManifest.java` (record defines `checksums` but never populated); `shared/runbooks/aks-read-only-deploy.md:48-72` (no `sha256sum -c` step). + +**Severity:** HIGH + +**Fix proposal:** In `BundleCommand`, after writing each entry, accumulate SHA-256 in a `MessageDigest` and emit the map. Write a sibling `bundle.zip.sha256` file uploaded next to the bundle. In the init-container, fetch `.sha256` first and `sha256sum -c` before unzip. For tamper-resistance, also sign with cosign / GPG (Sigstore = supply-chain consistent with §7.1 of engineering-standards). Effort: M. + +--- + +## 5. HIGH — `/api/file` reads anything under the codebase root; bundle ships full source — credentials, .env, .pem all readable + +**Symptom in prod:** `GraphController.readFile` (line 255) and `McpTools.readFile` (line 394) traverse-protect to the codebase root, but the bundle (`BundleCommand`, `source/` directory) ships **the entire source tree** including `.env`, `.aws/credentials` if committed, private keys checked in by mistake, secrets in `application-local.yml`. An authenticated MCP client (or unauthenticated, until #1 is fixed) calls `read_file(path=".env")` and prints the file. There is no extension allow-list, no `.gitignore`-aware filter at bundle time, no scrubber. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:255-310` (`readFile`); `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:394-420`; `src/main/java/io/github/randomcodespace/iq/cli/BundleCommand.java` (`source/` packaging — no exclusion). + +**Severity:** HIGH + +**Fix proposal:** At bundle time, exclude a curated set: `**/.env*`, `**/*.pem`, `**/*.key`, `**/id_rsa*`, `**/credentials`, `**/secrets/**`, anything matched by `.gitignore`. At read time, reject those same patterns even if they slip through. Add a `serving.read_file_extension_allowlist` config (default = source-code extensions only). Effort: S. + +--- + +## 6. HIGH — `/actuator/health/readiness` returns 200 before the graph is loaded + +**Symptom in prod:** `ServeCommand.markReady()` publishes `ReadinessState.ACCEPTING_TRAFFIC` after the Spring context is up, but `GraphHealthIndicator` (`health/GraphHealthIndicator.java`) is registered as a generic `HealthIndicator`, not under the readiness group. With Spring Boot's defaults, custom `HealthIndicator`s land in the liveness+readiness composite **only if they're added to the `readiness` group**. Right now: pod becomes "ready" the moment Spring starts (~8-16s per CLAUDE.md) but `GraphBootstrapper` is still loading H2 → Neo4j (can take seconds-to-minutes for big graphs). Readiness probe passes, kube-proxy routes traffic, every request 503s with "Neo4j graph not available" (`GraphController.requireQueryService:line ~30`). On rolling deploy this also means the new pod is marked ready before old pod is drained → 100% error rate during the rollover window. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/cli/ServeCommand.java:~110` (`markReady()`); `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java:1-40` (no readiness group); `application.yml` `serving` profile (`management.health.readinessstate.enabled: true` but no `management.endpoint.health.group.readiness.include: graph,readinessState`). + +**Severity:** HIGH + +**Fix proposal:** Move `markReady()` to fire **after** `GraphBootstrapper` returns AND `graphStore.count() > 0`. Add to `application.yml` (serving profile): `management.endpoint.health.group.readiness.include: readinessState,graph`. Add a regression test. Effort: S. + +--- + +## 7. MEDIUM — No `@RestControllerAdvice`; uncaught exceptions return generic 500s with stack-trace bodies, no error envelope + +**Symptom in prod:** `grep '@ControllerAdvice'` returns zero hits in `src/main/java`. When `QueryService.nodesByKind` throws (Neo4j tx died, NPE on a malformed cached node, etc.), Spring's default error attributes return a JSON body with `"trace": "...full stack..."` if `server.error.include-stacktrace` defaults haven't been turned off — and nothing in `application.yml` turns it off. On-call sees redacted `INTERNAL_SERVER_ERROR` in clients but the response body leaks classnames + line numbers (CWE-209). MCP tools partially mask this by returning `{"error": "..."}` 200 (which is its OWN problem — see finding #8). REST has no consistent error envelope at all. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java` (mixed `ResponseStatusException` + raw return); no `*ControllerAdvice*.java` files; missing `server.error.include-stacktrace=never` in `application.yml`. + +**Severity:** MEDIUM + +**Fix proposal:** Add `api/GlobalExceptionHandler.java` with `@RestControllerAdvice`. Map `ResponseStatusException` through, all others to `{"code": "INTERNAL", "message": , "request_id": }` with HTTP 500. Set `server.error.include-stacktrace: never` and `server.error.include-message: never` in the serving profile. Effort: S. + +--- + +## 8. MEDIUM — MCP tools return `{"error": "..."}` with HTTP 200, defeating client retry logic and observability + +**Symptom in prod:** Every `catch (Exception e)` in `McpTools` returns `toJson(Map.of(PROP_ERROR, e.getMessage()))` as a successful 200 response. Spring Boot metrics (`http.server.requests`) record these as 2xx, so error-rate dashboards stay green during incidents. MCP clients with retry-on-non-2xx never retry, never alert. Worse, `e.getMessage()` from a Neo4j parse error can leak query structure / node IDs from another tenant if a path-traversal bug ever lands. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (35+ `catch (Exception e) { return toJson(Map.of(PROP_ERROR, e.getMessage())); }` blocks). + +**Severity:** MEDIUM + +**Fix proposal:** Define error codes (`INVALID_INPUT`, `NOT_FOUND`, `INTERNAL`, `RATE_LIMITED`). Return MCP-spec-compliant errors (Spring AI MCP supports throwing — verify on its API). At minimum: log with stack trace at WARN, return `{"error": {"code": "INTERNAL", "message": "internal error", "request_id": ...}}` with the actual message redacted unless it's an `IllegalArgumentException`. Effort: S. + +--- + +## 9. MEDIUM — No structured logs, no request ID, no MDC; on-call has no way to correlate a slow request to a Neo4j query + +**Symptom in prod:** `grep MDC.put|requestId|X-Request-ID|OncePerRequestFilter` in `src/main/java`: zero hits. Pod logs are default Spring Boot text format. When customer reports "the graph endpoint hung for 30s at 14:32", on-call has only timestamp matching to find the query, no per-request span ID. With virtual threads enabled (`spring.threads.virtual.enabled: true`) and N concurrent slow requests, log lines interleave with no way to demux. + +**File / location:** `src/main/resources/logback*.xml` (none — uses Spring Boot default); `src/main/resources/application.yml` (no `logging.pattern.level`); no `RequestIdFilter`. + +**Severity:** MEDIUM + +**Fix proposal:** Add `logback-spring.xml` with JSON appender (logstash-logback-encoder, MIT, single dep) gated on `spring.profiles.active=serving`. Add a `RequestIdFilter` (`OncePerRequestFilter`) that pulls `X-Request-ID` or generates a UUID, populates MDC, returns it in the response header. Add `Micrometer` timers around each `@McpTool` (Spring AI auto-instruments REST). Expose `/actuator/prometheus` (currently `metrics` is exposed but not the Prometheus scrape endpoint). Effort: M. + +--- + +## 10. MEDIUM — `GraphStore.findShortestPath` and `traceImpact` have unbounded depth or fixed `[*..20]` with no row limit, no time guard + +**Symptom in prod:** `GraphStore.findShortestPath` (line 453) runs `MATCH p = shortestPath((a)-[*..20]-(b)) RETURN [n IN nodes(p) | n.id]` — fine on small graphs, on a 5M-node enriched bundle this is 30+ seconds. `traceImpact` runs `MATCH (a)-[:RELATES_TO*1..$depth]->(b)` with `depth` capped at 10 by `McpTools.traceImpact:line ~349` — but the API endpoint `/api/triage/impact/{id}` (`GraphController:188`) doesn't appear to bound it. With 99 detector kinds and `RELATES_TO*1..10` on a hub node (e.g. a popular library import), this is a Cartesian explosion. No `WITH p LIMIT N` cap, no `dbms.transaction.timeout` configured. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/graph/GraphStore.java:453` (`shortestPath`); `:line for traceImpact`; `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:188` (`triage/impact`). + +**Severity:** MEDIUM + +**Fix proposal:** Set `dbms.transaction.timeout=30s` in `Neo4jConfig`. Add `LIMIT $maxNodes` (e.g. 10000) on every `*..N` query. Bound `depth` ≤ 5 in REST endpoint and validate. Effort: S. + +--- + +## 11. MEDIUM — `/api/file` content-type is `text/plain` for all files; binary data dumps; no `Content-Length` cap matches `getMaxFileBytes` + +**Symptom in prod:** `readFile` returns binary files (a checked-in `.png`, `.jks` keystore, native `.so`) as `text/plain` with garbled UTF-8. Browser logs the entire base64-mangled body. The implementation reads via `SafeFileReader.read(resolved, startLine, endLine, config.getMaxFileBytes())` so size is bounded, but content-type isn't sniffed and there's no early-reject for non-text files. Slow client reading 1 MB file at 1 KB/s — keeps a virtual thread + a Tomcat connection occupied for 1000s. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:255-310`. + +**Severity:** MEDIUM + +**Fix proposal:** Probe content type with `Files.probeContentType` or magic-byte check; if not `text/*`, return 415. Set `server.tomcat.connection-timeout=10s`, `server.tomcat.max-swallow-size=1MB`. Effort: S. + +--- + +## 12. MEDIUM — `GraphHealthIndicator.health()` calls `graphStore.count()` on every probe — `MATCH (n:CodeNode) RETURN count(n)` against an embedded DB + +**Symptom in prod:** Readiness probe `periodSeconds: 5` → 12 full Cypher count queries per minute, each holding a transaction open. On a 5M-node graph with concurrent user traffic, this contends with the page cache. Liveness probe also fires every 10s. The current implementation has no cache/throttle. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java:30`. + +**Severity:** MEDIUM + +**Fix proposal:** Cache `count()` result for 30 s in an `AtomicReference`. Or: only verify "graph reachable" via a constant-time `tx.execute("RETURN 1").hasNext()`. Effort: S. + +--- + +## 13. MEDIUM — `CorsConfig` default allows `http://localhost:[*]` and `http://127.0.0.1:[*]`; in cluster, this is wrong but undetected; no CSP + +**Symptom in prod:** Default `codeiq.cors.allowed-origin-patterns` (`config/CorsConfig.java:14`) is hardcoded to dev-loopback patterns. In AKS, the React UI is served same-origin (no CORS needed) — this is fine — but if anyone exposes the API behind a reverse proxy at a different origin, they'll get cryptic CORS failures because the YAML doesn't override it (`codeiq.yml.example` doesn't even include it). Worse: zero CSP / X-Frame-Options / X-Content-Type-Options headers means the served React UI is clickjackable and the JSON endpoints can be loaded into a hostile origin's `