diff --git a/AGENTS.md b/AGENTS.md index 80dbccf4..e598e400 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -46,7 +46,7 @@ If you hit something requiring GitHub App / PAT / OAuth that the runtime cannot # Memory Context -# [codeiq] recent context, 2026-04-28 1:14am UTC +# [codeiq] recent context, 2026-04-28 6:43am UTC No previous sessions found. \ No newline at end of file diff --git a/CHANGELOG.md b/CHANGELOG.md index dd6060fa..1bc95370 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -225,6 +225,146 @@ for that specific tag for the per-commit details. path-B board ruling, they are not to be re-introduced without an explicit board reversal — see `shared/runbooks/engineering-standards.md` §5.1. +### Security + +- **Production-readiness PR 1 of 5 — security baseline.** First half of the + audit findings catalogued under `docs/audits/2026-04-28-serve-path-prod-readiness.md` + (+ `-counter.md`). Closes audit findings #1, #7, #13 (HIGH/MEDIUM) and C2 (MEDIUM). + - **Bearer-token auth on `/api/**` and `/mcp/**`** (audit #1). Added + `spring-boot-starter-security`. New `config/security/SecurityConfig`, + `BearerAuthFilter`, `TokenResolver`. Token source priority: + `CODEIQ_MCP_TOKEN` env > `codeiq.mcp.auth.token` config > startup failure. + Constant-time compare via SHA-256 pre-hash + `MessageDigest.isEqual` — + 32-byte digests on both sides defeat the length oracle. RFC 7235 §2.1 + case-insensitive scheme matching (`Bearer`, `bearer`, etc.). Authorization + header value never reaches a logger from this code. Permit list: + `/`, `/index.html`, `/favicon.ico`, `/assets/**`, `/static/**`, `/error`, + `/actuator/health/{liveness,readiness}` — everything else under + `/api/**`, `/mcp/**`, `/actuator/**` requires the bearer token. + - **Fail-fast on misconfiguration** (audit #14 partial). `mode=bearer` with + no token resolved → throws at startup. `mode=none` with active `serving` + profile and `allow_unauthenticated` not explicitly set → throws at + startup. `mode=mtls` is reserved and explicitly throws "not yet + implemented" rather than silently passing through. + - **Defensive response headers** (audit #13). New + `config/security/SecurityHeadersFilter` sets `X-Content-Type-Options: + nosniff`, `X-Frame-Options: DENY`, `Content-Security-Policy: default-src + 'self'; ... frame-ancestors 'none'`, `Referrer-Policy: no-referrer`, + `Permissions-Policy` disabling geolocation/camera/microphone. + `Strict-Transport-Security: max-age=31536000; includeSubDomains` is set + only when `X-Forwarded-Proto: https` is present (AKS terminates TLS at + ingress) — setting HSTS over plain HTTP would lock out misconfigured envs. + - **Uniform error envelope** (audit #7). New + `api/GlobalExceptionHandler` (`@RestControllerAdvice`, + `@Profile("serving")`) maps every uncaught exception to + `{"code","message","request_id"}` with the right HTTP status. + `IllegalArgumentException` → 400 with surfaced message. + `ResponseStatusException` → status code passes through. Anything else → + 500 with generic message; the actual exception is logged at WARN with + the `request_id` so on-call can correlate without leaking stack frames + to the client. `application.yml` now sets + `server.error.include-stacktrace: never` + `include-message: never` + + `include-binding-errors: never` as belt-and-suspenders. + - **Default CORS deny-all in serving** (audit #13). `config/CorsConfig` + default changed from loopback patterns to empty. Empty means register + no mappings → Spring MVC rejects all preflighted cross-origin requests. + Operators who genuinely need cross-origin (e.g. dev with a separate + Vite server on a different port) explicitly set + `codeiq.cors.allowed-origin-patterns`. Logs the resolved state at + startup. The React UI at `/` is unaffected — it's served same-origin. + - **Swagger UI / api-docs disabled in serving** (counter-audit C2). + `springdoc.api-docs.enabled: false` + `springdoc.swagger-ui.enabled: false` + in the serving profile of `application.yml`. The OpenAPI schema is + reconnaissance data; reachable only when running locally or with the + indexing profile. + - **`management.endpoints.web.exposure.include` narrowed** to `health,info` + in serving (was `health,info,metrics`); `health.show-details: never`. + Defense-in-depth alongside the `SecurityFilterChain` `authenticated()` + rule on `/actuator/**`. + - **Spring Security autoconfig excluded outside serving.** Without the + `serving` profile (CLI, tests, IDE runs), Spring Security's default + HTTP Basic chain would lock all endpoints — adding the starter would + break ~3000 existing tests that pass through MockMvc with no token. + `application.yml` excludes `SecurityAutoConfiguration`, + `SecurityFilterAutoConfiguration`, `UserDetailsServiceAutoConfiguration` + at the default level; the `serving` profile re-enables them by listing + only `UserDetailsServiceAutoConfiguration` (so the auto user/password + is suppressed but the filter chain is built from `SecurityConfig`). + - **Tests:** 31 new unit tests across `BearerAuthFilterTest` (14 cases: + missing/wrong/empty/correct/lowercase scheme, length-oracle defense, + log-leak audit, `shouldNotFilter` paths, `SecurityContextHolder` cleanup), + `TokenResolverTest` (9 cases for mode/profile/env-priority/fail-fast), + `SecurityHeadersFilterTest` (5 cases for header presence/HSTS gating), + `GlobalExceptionHandlerTest` (3 cases verifying the envelope shape and + no stack-trace leak). Full suite: 3453 tests / 0 failures / 0 errors. + + **Known follow-up (not in this PR):** the React UI cannot read env vars, + so the SPA shell is unauthenticated to access static assets. API/MCP calls + from the UI must inject `Authorization: Bearer ` from + operator-supplied localStorage. A first-class UI auth bootstrap (login + flow + token-issuance endpoint, OR server-side template injection) is its + own design — tracked as a follow-up issue. + +- **Production-readiness PR 2 of 5 — resource limits & abuse protection.** + Closes audit findings #2, #3, C1 (HIGH) and #10, #11 (MEDIUM). + - **Cypher transaction timeout** (audit #2). Neo4j embedded + `GraphDatabaseSettings.transaction_timeout = 30s` configured in + `Neo4jConfig` — every transaction in the JVM, including `run_cypher` + and graph traversals, gets a hard wall-clock cap. Catches runaway + variable-length matches before they starve the page cache. + - **Result-set cap on `run_cypher`** (audit #2). Hard row cap at + `mcp.limits.max_results` (default 500); excess rows dropped, response + carries `truncated: true` + `max_results: N`. Defends the JVM heap + against `MATCH (a),(b),(c) RETURN a,b,c LIMIT 999999999` blowups. + - **MCP `traceImpact` depth cap** (audit #10 corrected, C3). New + `mcp.limits.max_depth` field (default 10) wired into + `McpTools.traceImpact` via `Math.min`. Defends against + `RELATES_TO*1..1000` Cartesian explosions on hub nodes. + - **TTL snapshot cache on topology tools** (audit C1). `McpTools. + getCachedData()` now backed by a 60-second TTL snapshot. Without it, + every concurrent `service_dependencies` / `blast_radius` / + `find_path` / `find_bottlenecks` / `find_circular_deps` / + `find_dead_services` / `find_node` call paid the full + `graphStore.findAll()` cost and double-allocated multi-GB heaps. + A bridge fix; the proper refactor (TopologyService → per-tool Cypher) + is a tracked follow-up. + - **Per-client rate limiter** (audit #3). New `RateLimitFilter` using + Bucket4j 8.18.0 (Apache-2.0). Token bucket sized at + `mcp.limits.rate_per_minute` (default 300). Keyed by SHA-256 hash of + the `Authorization` header (so the token never lives in our key map), + falls back to `X-Forwarded-For` (first hop) or `RemoteAddr`. 429 + response with `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining` + headers. Registered before `BearerAuthFilter` so unauthenticated + brute-force is also throttled. + - **`/api/file` content-type sniff** (audit #11 corrected). Added + `Files.probeContentType` guard — non-text MIMEs (`.jks`, `.so`, + `.png`, native libs) return HTTP 415 with the probed type, instead + of being served as garbled `text/plain`. Allowlist: `text/*`, + `application/json`, `application/xml`, `application/x-yaml`, + `application/javascript`. The byte cap (already enforced by + `SafeFileReader`) is unchanged. + - **Tomcat slow-client tarpit** (audit #11). `server.tomcat.connection- + timeout: 10s`, `max-swallow-size: 1MB` in the serving profile — + drops connections that hold a virtual thread + Tomcat connection at + 1 KB/s. + - **CodeQL hardening on the security baseline.** Sanitised request + method + URI before logging in `BearerAuthFilter` (CWE-117 / CodeQL + `java/log-injection`); removed env-var name from the bearer-token + bootstrap log line in `TokenResolver` (CodeQL `java/sensitive-log`); + documented the deliberate stateless-bearer rationale on + `SecurityConfig.csrf(disable)` (CodeQL `java/spring-disabled-csrf-protection` + — no exploit path on a no-cookie surface). + - **Tests:** new `RateLimitFilterTest` (10 cases: under/over limit, + separate buckets per client, header-hashing, X-Forwarded-For + precedence, permit-list, default-rate fallback). Existing 6 test + classes updated for the new `McpTools` ctor signature. Full suite: + 3672 tests / 0 failures / 0 errors. + + **Known follow-up:** TopologyService still walks the full snapshot + in-memory after the cache hit — long-term plan is to rewrite each + topology tool as a targeted Cypher query so the snapshot isn't needed. + The cache is the bridge; the rewrite reduces peak memory. + ## [0.1.0] - 2026-03-28 First general-availability cut. See the diff --git a/CLAUDE.md b/CLAUDE.md index be136f3a..1f151ad8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -433,6 +433,20 @@ bean for code paths that haven't been ported yet. - **Parallel agent conflicts**: Don't dispatch multiple agents editing the same files concurrently. Use worktree isolation or sequential execution. - **SonarCloud project key**: `RandomCodeSpace_codeiq`, org: `randomcodespace` - **CI workflow**: Single `ci-java.yml` runs build + SonarCloud analysis. No cross-platform builds needed (JVM). +- **Spring Security only loads in the `serving` profile.** `application.yml` excludes `SecurityAutoConfiguration` + `SecurityFilterAutoConfiguration` + `UserDetailsServiceAutoConfiguration` at the **default** level so adding `spring-boot-starter-security` doesn't break ~3000 MockMvc tests by activating a default HTTP Basic chain. The `serving` profile re-enables them by listing only `UserDetailsServiceAutoConfiguration` (suppresses the auto user/password printout); the chain itself is built by `config/security/SecurityConfig`. **Don't** drop the default exclude — non-serving contexts (CLI, tests) must have no Spring Security wiring at all. +- **`BearerAuthFilter.shouldNotFilter` and `SecurityConfig.permitAll()` paths must stay in sync.** The filter runs before Spring's `AuthorizationFilter`, so if a path is in `permitAll()` but NOT in `shouldNotFilter`, the filter rejects it with 401 before Spring's chain can permit it. Open paths today: `/`, `/index.html`, `/favicon.ico`, `/assets/**`, `/static/**`, `/error`, `/actuator/health`, `/actuator/health/liveness`, `/actuator/health/readiness`. Adding any new permit-all endpoint requires updating BOTH places. +- **Constant-time bearer-token compare uses SHA-256 pre-hash.** Both the provided and expected token are hashed with SHA-256 before `MessageDigest.isEqual`. SHA-256 always produces 32-byte digests, so `isEqual` runs over fixed-size arrays — defeats the length oracle that makes raw `isEqual` unsafe across mismatched-length inputs. **Don't** "optimize" by removing the hash and comparing raw token bytes; that re-introduces the oracle. +- **Never log the `Authorization` header.** `BearerAuthFilter` deliberately never passes the header value to a logger, even at DEBUG. The rejection log line carries only `method` and `requestURI`. There's a regression test (`tokenValueNeverAppearsInLogs`) that captures all log lines for the filter and asserts the secret substring is absent. +- **`mode=none` + active `serving` profile = startup failure** unless `codeiq.mcp.auth.allow_unauthenticated=true` is **explicitly** set. This is by design — operators must opt into permissive mode deliberately. `mode=mtls` is reserved and currently throws "not yet implemented" (better than silently passing through). +- **`server.error.include-stacktrace: never`** is set in the serving profile as defense-in-depth alongside `GlobalExceptionHandler`. Don't enable it for "easier debugging" — stack frames in the response body leak class names + paths (CWE-209). Use the `request_id` in the envelope to correlate to the WARN log line where the full stack is captured. +- **Cypher transaction wall-clock cap is configured at the DBMS level**, not per-call. `Neo4jConfig.databaseManagementService(...)` sets `GraphDatabaseSettings.transaction_timeout = 30s` so every transaction gets the cap automatically. Don't reach for `graphDb.beginTx(timeout, unit)` overload in tool code — the test suite mocks `beginTx()` with no args and the overload changes the matcher signature, breaking the existing stubs across `McpToolsTest` / `McpToolsExpandedTest` / `McpToolsEvidenceTest`. +- **`McpTools.runCypher` row cap is enforced in the iteration loop, not via `LIMIT`.** After `maxResults` rows are accumulated the loop breaks and the response carries `truncated: true` + `max_results: N`. Don't try to inject `LIMIT N` into the user-supplied query string — that would require parsing the query (and the user's query may already have its own LIMIT). +- **`McpTools.getCachedData()` 60-second TTL snapshot is a bridge fix.** It's NOT the proper solution — the proper solution is to rewrite each topology MCP tool to use a targeted Cypher query so the full graph never needs to live on heap. The cache caps peak memory under concurrent calls but the snapshot itself is still multi-GB on large graphs. When that refactor lands, the `AtomicReference` and `getCachedData()` itself can be deleted. +- **`RateLimitFilter` keys by `sha256(Authorization)`** — the raw token NEVER goes into the bucket key map. The 16-hex-char digest is enough collision resistance for keying. Falls back to `X-Forwarded-For` (first hop) → `RemoteAddr` when no auth header is present. Buckets live in a `ConcurrentHashMap` — bounded in practice by `num_distinct_clients`, which for the single-tenant pod shape is small. Swap to a Caffeine cache with a max-size eviction if multi-tenant exposure is ever added. +- **Filter chain order in `serving` profile**: `SecurityHeadersFilter` → `RateLimitFilter` → `BearerAuthFilter` → ... → controller. Each `addFilterBefore(X, UsernamePasswordAuthenticationFilter.class)` inserts X immediately before UPAFilter, pushing the previously-inserted filter farther from the target — so the **registration order in `SecurityConfig.servingFilterChain` IS the chain order**. Don't shuffle without re-reasoning about it: if `RateLimitFilter` ran AFTER `BearerAuthFilter`, an unauthenticated brute-force attempt would never get throttled (would just see 401 over and over, hitting the slow path). +- **`Files.probeContentType` is best-effort** — JDK 25 on Linux uses `/etc/mime.types` + magic-byte fallback. It returns `null` if the type can't be determined; treat that as "let it through" (the byte cap in `SafeFileReader` still bounds size). The allowlist for `/api/file` is `text/*` + `application/{json,xml,x-yaml,javascript}` — extending requires adding to the explicit list in `GraphController.readFile`. +- **Sanitize user-controlled values before logging.** `BearerAuthFilter.sanitizeForLog(String)` strips `\p{Cntrl}` and truncates at 256 chars. Use it on anything tainted by `request.getRequestURI()`, `request.getMethod()`, headers, etc. before passing to a logger. CodeQL `java/log-injection` will flag direct `log.warn("... {} ...", request.getRequestURI())` as a vuln. +- **`mcp.limits.max_depth` is a NEW field on `McpLimitsConfig`** (default 10). Audit #10 / C3 — the original audit assumed it existed but it didn't. When adding new MCP traversal tools, cap depth via `Math.min(callerSupplied, maxDepth)` before passing to Cypher. The REST endpoint already had this guard via `config.getMaxDepth()` from `CodeIqConfig`; the MCP path now mirrors it via `McpLimitsConfig.maxDepth()`. ## Supply-chain observability (OpenSSF) diff --git a/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md b/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md new file mode 100644 index 00000000..ab840594 --- /dev/null +++ b/docs/audits/2026-04-28-serve-path-prod-readiness-counter.md @@ -0,0 +1,156 @@ +# Counter-Audit: serve-path Production Readiness +**Original audit:** `docs/audits/2026-04-28-serve-path-prod-readiness.md` (15 findings: 6 HIGH / 7 MEDIUM / 2 LOW) +**Counter-audit date:** 2026-04-28 +**Method:** Every finding verified against actual source files. Net-new findings added from independent inspection of GraphController, McpTools, ServeCommand, GraphHealthIndicator, CorsConfig, BundleCommand, SafeFileReader, CodeIqConfig, McpLimitsConfig, McpAuthConfig, GraphStore, application.yml, pom.xml, security.yml, ci-java.yml, and frontend/package.json. + +--- + +## Section A — Corrections to Original Audit + +### A1. Finding #6 overstated — `markReady()` fires AFTER `bootstrapNeo4jFromCache()`, not before + +**Original claim:** "`markReady()` fires before the graph is loaded" — traffic is routed during bootstrap. + +**What the code actually does:** `ServeCommand.java` lines 83–126: `graphBootstrapper.bootstrapNeo4jFromCache()` is called at line 84 and returns only when bootstrap completes. `markReady()` is called at line 126, after that return and after the node/edge count is printed to stdout. The auto-bootstrap race window the audit describes does not exist in the current code. + +**What IS real:** `GraphHealthIndicator` is not wired into the readiness group in `application.yml` (no `management.endpoint.health.group.readiness.include: readinessState,graph`). So the Kubernetes readiness probe does not gate on graph health. If a future code change moves `markReady()` earlier, or if bootstrap is made async, this becomes acute. The config gap is the real finding. + +**Correction:** Downgrade finding #6 from HIGH to MEDIUM. The fix (adding `management.endpoint.health.group.readiness.include: readinessState,graph` to the serving profile in `application.yml`) is still valid and still needed, but the acute rollout-during-bootstrap race does not currently exist. + +--- + +### A2. Finding #10 is half-wrong — REST `traceImpact` IS depth-capped; MCP is not + +**Original claim:** "the API endpoint `/api/triage/impact/{id}` (`GraphController:188`) doesn't appear to bound it." + +**What the code actually does:** `GraphController.java` line 192: +```java +int cappedDepth = Math.min(depth, config.getMaxDepth()); +``` +`config.getMaxDepth()` defaults to 10 (`CodeIqConfig.java`). The REST endpoint is bounded. + +**What IS true — in the other direction:** `McpTools.traceImpact` passes `depth != null ? depth : 3` directly to `queryService.traceImpact` with no `Math.min` guard. The MCP path is unbounded. + +**Correction:** The unbounded-depth defect exists on the MCP tool, not the REST controller. The fix targets `McpTools.traceImpact`, not `GraphController`. Severity (MEDIUM) is unchanged; affected surface is corrected. See also C3 below. + +--- + +### A3. Finding #11 is partially wrong — `SafeFileReader` does enforce a byte cap + +**Original claim:** "no `Content-Length` cap matches `getMaxFileBytes`" — implies size is unbounded. + +**What the code actually does:** `GraphController.readFile` calls `SafeFileReader.read(resolved, startLine, endLine, config.getMaxFileBytes())`. `SafeFileReader` enforces the byte cap for both full-file and line-range reads. The cap is real and working. + +**What IS real:** Content-type is not sniffed. Binary files (`.jks`, `.so`, `.png`) are served as `text/plain` with no early reject. The slow-client connection exhaustion concern is valid. The size-cap concern is not. + +**Correction:** Remove "no Content-Length cap" from finding #11. The binary content-type concern stands; severity (MEDIUM) is unchanged. + +--- + +## Section B — Severity Adjustments + +### B1. Finding #6 — downgrade from HIGH to MEDIUM (per A1) + +The bootstrap-before-markReady ordering eliminates the acute readiness race. The residual gap — readiness group not configured — is MEDIUM: probes do not gate on graph health, which can cause transient 503s after a pod restart if Neo4j is slow to open, but the bootstrap path completes before traffic is accepted under normal conditions. + +--- + +### B2. Finding #4 — scope extended; severity remains HIGH + +Finding #4 correctly flags null checksums in `BundleCommand.createManifest` (line 241: `null` passed as the checksums argument). This is confirmed. However the finding misses a co-equal defect: `generateServeShell` (lines 265–274 in `BundleCommand.java`) emits a `serve.sh` that unconditionally downloads the JAR at runtime: + +```bash +curl -fL -o "$JAR" \ + "https://repo1.maven.org/maven2/io/github/randomcodespace/iq/code-iq/${VERSION}/code-iq-${VERSION}-cli.jar" +``` + +An equivalent `serve.bat` exists for Windows. This is a direct violation of `build.md` ("No runtime network calls to the public internet"). Bundles deployed in air-gapped environments silently fail to start when the JAR is absent. A compromised Maven Central namespace could substitute a malicious JAR, and even correct checksums in `manifest.json` would not protect against it because the downloaded JAR is not verified against them. The fix for finding #4 must address both the null checksums **and** the runtime download. + +--- + +## Section C — Missed Findings + +### C1. HIGH — `getCachedData()` loads the full graph into heap on every topology MCP tool call + +**Symptom in prod:** Seven MCP tools — `serviceDetail`, `blastRadius`, `findPath`, `findBottlenecks`, `findCircularDeps`, `findDeadServices`, `findNode` — all begin by calling `getCachedData()` (`McpTools.java` lines 83–92). `getCachedData()` calls `graphStore.findAll()`, which executes two full graph scans (`findAllNodes` + `findAllEdges`) and materialises every node and every edge into a `GraphData` record on the Java heap. On a 5M-node enriched graph this is multiple gigabytes per call. No result is cached between invocations — each call pays the full allocation cost. Two concurrent `blast_radius` invocations double-allocate. This is an OOM vector independent of the `run_cypher` issue in finding #2, and it is triggered by normal topology tool usage, not by adversarial queries. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:83–92` (`getCachedData`); `src/main/java/io/github/randomcodespace/iq/graph/GraphStore.java` (`findAll`, `findAllNodes`, `findAllEdges`). + +**Severity:** HIGH + +**Fix proposal:** Replace `getCachedData()` with targeted Cypher queries per tool (e.g. `blastRadius` needs only the subgraph reachable from the seed node, not all 5M nodes). If a full snapshot is required for correctness, populate it once at serve startup into a size-bounded `SoftReference` and invalidate on graph reload. Add a Caffeine cache with a 60-second TTL and a max-weight bound in bytes. Effort: M. + +--- + +### C2. MEDIUM — Swagger UI exposed unauthenticated; full API schema readable by any cluster workload + +**Symptom in prod:** `pom.xml` includes `springdoc-openapi-starter-webmvc-ui:3.0.3`. SpringDoc auto-registers `/swagger-ui/index.html`, `/swagger-ui.html`, and `/v3/api-docs` at startup with no authentication guard. Because there is no Spring Security on the classpath (finding #1), no filter intercepts these paths. Any actor who can reach the pod gets the complete OpenAPI schema: every endpoint path, parameter name, response shape, and the full enumeration of `NodeKind` / `EdgeKind` values. This is reconnaissance-in-depth that lowers the cost of exploiting finding #1. Neither `springdoc.swagger-ui.enabled` nor `springdoc.api-docs.enabled` is set to `false` in the serving profile of `application.yml`. + +**File / location:** `pom.xml` (`springdoc-openapi-starter-webmvc-ui:3.0.3`); `src/main/resources/application.yml` (no `springdoc.*` keys in serving profile). + +**Severity:** MEDIUM + +**Fix proposal:** In `application.yml` serving profile: `springdoc.swagger-ui.enabled: false` and `springdoc.api-docs.enabled: false`. Provide opt-in `codeiq.serving.swagger-ui.enabled: true` for local development. When auth (finding #1) is implemented, gate `/swagger-ui/**` and `/v3/api-docs/**` behind the same bearer check. Effort: XS. + +--- + +### C3. MEDIUM — `McpTools.traceImpact` has no depth cap on the MCP path + +**Symptom in prod:** As established in A2, `McpTools.traceImpact` forwards caller-supplied `depth` to `queryService.traceImpact` without any `Math.min` guard. A malicious or runaway MCP client sends `depth=1000` on a hub node; the resulting `RELATES_TO*1..1000` variable-length Cypher match runs until the transaction timeout fires — which is also not configured (finding #2). The REST endpoint at `GraphController:192` is safe; the MCP surface is not. `McpLimitsConfig` already defines a `maxDepth` field that is never consumed here. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (`traceImpact` method, depth forwarding). + +**Severity:** MEDIUM + +**Fix proposal:** `int safedDepth = depth != null ? Math.min(depth, mcpLimitsConfig.maxDepth()) : 3;` — wire the already-parsed `maxDepth` from `McpLimitsConfig`. Effort: XS. + +--- + +### C4. MEDIUM — `semgrep` installed from PyPI without a pinned version in `security.yml` + +**Symptom in prod:** `.github/workflows/security.yml` line 94 runs `python -m pip install --quiet --upgrade pip semgrep` with no version pin. Every workflow run fetches the latest `semgrep` release from PyPI at the moment of execution. Every other tool in the same workflow is pinned: `osv-scanner` uses `OSV_SCANNER_VERSION: 2.3.5` with a named release download; `gitleaks` uses `GITLEAKS_VERSION: 8.30.1`; all GitHub Actions are SHA-pinned. A compromised PyPI release of `semgrep` (or a transitive dependency) would execute arbitrary code inside the SAST job, which runs with `contents: read` permission and access to `GITHUB_TOKEN`. This directly contradicts the workflow's header comment: "All actions SHA-pinned per Scorecard `Pinned-Dependencies`." + +**File / location:** `.github/workflows/security.yml:94`. + +**Severity:** MEDIUM + +**Fix proposal:** Pin to a specific version: `pip install semgrep==` (resolve via PyPI). Better: use `returntocorp/semgrep-action` pinned by commit SHA (free for OSS), which eliminates the PyPI install entirely and aligns with the workflow's existing SHA-pinning posture. Effort: XS. + +--- + +### C5. LOW — CLAUDE.md tech-stack version table is stale on three components + +**Symptom:** CLAUDE.md "Tech Stack" section states Spring Boot `4.0.5`, Spring AI `2.0.0-M3`, Neo4j Embedded `2026.02.3`. Actual values in `pom.xml`: Spring Boot `4.0.6`, Spring AI `2.0.0-M4`, Neo4j `2026.04.0`. Stale docs cause reviewers and automated tooling to reference wrong versions when checking CVE databases or compatibility matrices. + +**File / location:** `CLAUDE.md` (Tech Stack section); `pom.xml` (ground truth). + +**Severity:** LOW + +**Fix proposal:** Update three lines in CLAUDE.md. Note that `pom.xml` is the SSoT; CLAUDE.md is informational. Effort: XS. + +--- + +## Summary Table + +| # | Original Finding | Verdict | Adjustment | +|---|-----------------|---------|------------| +| 1 | No auth on API/MCP | Confirmed | — | +| 2 | `run_cypher` no cap / timeout / READ mode | Confirmed | — | +| 3 | No rate limiting | Confirmed | — | +| 4 | Unsigned bundle + null checksums | Confirmed + extended | serve.sh/bat runtime Maven Central download is co-equal defect; fix must cover both | +| 5 | `/api/file` ships secrets in bundle | Confirmed | — | +| 6 | Readiness fires before graph load | **Partially wrong** | Downgrade HIGH → MEDIUM; bootstrap-before-markReady ordering is correct; readiness group config gap is the real issue | +| 7 | No `@RestControllerAdvice`; stack trace leak | Confirmed | — | +| 8 | MCP errors return HTTP 200 | Confirmed | — | +| 9 | No structured logs / request ID / MDC | Confirmed | — | +| 10 | `findShortestPath` + `traceImpact` unbounded | **Partially wrong** | REST `traceImpact` IS capped via `Math.min`; MCP `traceImpact` is NOT; fix target corrected | +| 11 | `/api/file` no size cap; binary served as text | **Partially wrong** | Size cap IS enforced by SafeFileReader; binary content-type issue stands; remove size-cap claim | +| 12 | `GraphHealthIndicator` uncached count on every probe | Confirmed | — | +| 13 | CORS defaults wrong; no CSP / security headers | Confirmed | — | +| 14 | Bad YAML silently uses defaults; no fail-fast | Confirmed | — | +| 15 | Zero integration tests for auth / rate-limit path | Confirmed | — | +| C1 | `getCachedData()` full graph load per topology call | **NET NEW** | HIGH | +| C2 | Swagger UI unauthenticated | **NET NEW** | MEDIUM | +| C3 | MCP `traceImpact` no depth cap | **NET NEW** | MEDIUM | +| C4 | `semgrep` unpinned in `security.yml` | **NET NEW** | MEDIUM | +| C5 | CLAUDE.md version table stale | **NET NEW** | LOW | diff --git a/docs/audits/2026-04-28-serve-path-prod-readiness.md b/docs/audits/2026-04-28-serve-path-prod-readiness.md new file mode 100644 index 00000000..1b572d62 --- /dev/null +++ b/docs/audits/2026-04-28-serve-path-prod-readiness.md @@ -0,0 +1,177 @@ +## 1. HIGH — MCP and REST API are fully unauthenticated; one curl from anywhere on the cluster reads the whole graph + +**Symptom in prod:** Pod has no auth on `/api/**` or `/mcp` (no Spring Security on classpath, no `@PreAuthorize`, no filter, no token check). Any other workload in the AKS namespace — including a compromised sidecar in another tenant's pod that resolves the codeiq Service — can hit `GET /api/file?path=...` and exfiltrate every byte under the analyzed codebase root, plus run arbitrary read-only Cypher via `POST /mcp` `run_cypher`. The unified config defines `mcp.auth.mode: bearer|mtls` (`McpAuthConfig`) but **nothing wires it into a filter** — the field is dead. East-west attack on multi-tenant pipeline = data exfil from other tenants' analyzed source. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:39` (no `@PreAuthorize`); `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:269` (no auth check); `pom.xml` (no `spring-boot-starter-security`); `src/main/java/io/github/randomcodespace/iq/config/unified/McpAuthConfig.java` (config class, never consumed). + +**Severity:** HIGH + +**Fix proposal:** Add `spring-boot-starter-security`. Implement `SecurityFilterChain` in a new `config/SecurityConfig.java` that, when `codeiq.mcp.auth.mode=bearer`, requires `Authorization: Bearer ${CODEIQ_MCP_TOKEN}` on `/api/**` AND `/mcp/**` (constant-time compare). Permit only `/actuator/health/*`. Default `mode=none` permitted only when `spring.profiles.active` contains `local`. Effort: M. + +--- + +## 2. HIGH — `run_cypher` has zero result-set cap, zero query timeout, and runs in the default (read+write) tx mode + +**Symptom in prod:** A single MCP client sends `MATCH (a:CodeNode), (b:CodeNode), (c:CodeNode) RETURN a, b, c LIMIT 999999999`. `runCypher` accumulates rows in an `ArrayList>` with no cap, the JVM heap fills, `OutOfMemoryError` triggers (heap dump goes to `/tmp` per `aks-launch.sh:51`, eats tmpfs), pod is `OOMKilled`. Tenant outage ≥60s while replica restarts and re-bootstraps Neo4j. Embedded Neo4j has no per-query memory limit configured (`Neo4jConfig.java`, no `dbms.memory.transaction.max_size`). Additionally, `tx.execute(query)` runs in default access mode, not READ — so a procedure registered later (or one this regex-blocklist misses) could mutate. The CLAUDE.md "Gotchas" already calls out RAN-31 ("pin run_cypher to Neo4j READ access mode") but the current code at `mcp/McpTools.java:296` still uses `graphDb.beginTx()` not `beginTx(KernelTransaction.Type.IMPLICIT, AUTH_DISABLED, AccessMode.Static.READ, timeoutMs, MILLIS)`. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:269-318` (`runCypher`); `mcp/McpTools.java:311` (unbounded `rows.add`); `src/main/java/io/github/randomcodespace/iq/config/Neo4jConfig.java` (no transaction-timeout / memory settings). + +**Severity:** HIGH + +**Fix proposal:** Use `graphDb.beginTx(perToolTimeoutMs, MILLIS)` (transaction timeout already in `McpLimitsConfig.perToolTimeoutMs=15000`). Cap rows at `mcp.limits.max_results` (500) and stop iterating; return a `truncated: true` flag. Cap accumulated payload bytes at `mcp.limits.max_payload_bytes` (2 MB) by serializing-as-you-go. Configure `dbms.memory.transaction.max_size=512m` in `Neo4jConfig`. Effort: S. + +--- + +## 3. HIGH — No rate limiting anywhere; one MCP client saturates the pod for everyone + +**Symptom in prod:** `mcp.limits.rate_per_minute: 300` is defined in `McpLimitsConfig` and parsed by `UnifiedConfigLoader.java:166` but **no filter or interceptor enforces it** (zero hits for `Bucket4j|Resilience4j|RateLimiter|HandlerInterceptor` in main source). One agent client in a runaway loop fires `find_cycles` (which runs `MATCH p=(a)-[:RELATES_TO*2..10]->(a)` — graph-wide variable-length match, no per-call limit) at hundreds of QPS. Tomcat virtual-thread executor saturates Neo4j page cache, p99 on `/api/stats` jumps from 50 ms to multi-second, readiness probe (`periodSeconds: 5`) starts to flake, kubelet restarts the pod (`replicas: 1` — no failover), tenant goes dark. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (no rate limiter); `src/main/java/io/github/randomcodespace/iq/api/GraphController.java` (no rate limiter); `src/main/java/io/github/randomcodespace/iq/config/unified/McpLimitsConfig.java` (`ratePerMinute` parsed but unused). + +**Severity:** HIGH + +**Fix proposal:** Add Bucket4j (Apache-2.0, single dep, ~80 KB). Register an `OncePerRequestFilter` keyed by `Authorization` token (or remote IP fallback) with a refill-per-second token bucket sized at `mcp.limits.rate_per_minute / 60`. 429 with `Retry-After` header on bucket exhaustion. Apply to `/api/**` and `/mcp/**`. Effort: S. + +--- + +## 4. HIGH — Bundle is unsigned and unverified; init-container blindly unzips whatever Nexus serves + +**Symptom in prod:** AKS init-container (`shared/runbooks/aks-read-only-deploy.md:48-72`) runs `curl -u $NEXUS_USER:$NEXUS_PASS .../bundle.zip | unzip` with no checksum verification, no signature check. `ArtifactManifest` defines a `checksums` field (`Map`) but `BundleCommand.createManifest` (`cli/BundleCommand.java`) passes `null` for it (sed shows `null` literal in the constructor call). On Nexus credential compromise OR a malicious internal user with `codeiq-bundles` write access, an attacker swaps `bundle.zip` with one that contains a `graph.db/` planted with a Cypher full-text index that triggers JNDI lookup, OR a `serve.sh` that is NEVER actually invoked at runtime but still — once bundles are signed, you can also trust `manifest.json`. Single tenant's bundle becomes a foothold across the whole pipeline because the same Nexus path is served to every replica. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/cli/BundleCommand.java:141-150` (manifest checksum field passed `null`); `src/main/java/io/github/randomcodespace/iq/intelligence/ArtifactManifest.java` (record defines `checksums` but never populated); `shared/runbooks/aks-read-only-deploy.md:48-72` (no `sha256sum -c` step). + +**Severity:** HIGH + +**Fix proposal:** In `BundleCommand`, after writing each entry, accumulate SHA-256 in a `MessageDigest` and emit the map. Write a sibling `bundle.zip.sha256` file uploaded next to the bundle. In the init-container, fetch `.sha256` first and `sha256sum -c` before unzip. For tamper-resistance, also sign with cosign / GPG (Sigstore = supply-chain consistent with §7.1 of engineering-standards). Effort: M. + +--- + +## 5. HIGH — `/api/file` reads anything under the codebase root; bundle ships full source — credentials, .env, .pem all readable + +**Symptom in prod:** `GraphController.readFile` (line 255) and `McpTools.readFile` (line 394) traverse-protect to the codebase root, but the bundle (`BundleCommand`, `source/` directory) ships **the entire source tree** including `.env`, `.aws/credentials` if committed, private keys checked in by mistake, secrets in `application-local.yml`. An authenticated MCP client (or unauthenticated, until #1 is fixed) calls `read_file(path=".env")` and prints the file. There is no extension allow-list, no `.gitignore`-aware filter at bundle time, no scrubber. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:255-310` (`readFile`); `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java:394-420`; `src/main/java/io/github/randomcodespace/iq/cli/BundleCommand.java` (`source/` packaging — no exclusion). + +**Severity:** HIGH + +**Fix proposal:** At bundle time, exclude a curated set: `**/.env*`, `**/*.pem`, `**/*.key`, `**/id_rsa*`, `**/credentials`, `**/secrets/**`, anything matched by `.gitignore`. At read time, reject those same patterns even if they slip through. Add a `serving.read_file_extension_allowlist` config (default = source-code extensions only). Effort: S. + +--- + +## 6. HIGH — `/actuator/health/readiness` returns 200 before the graph is loaded + +**Symptom in prod:** `ServeCommand.markReady()` publishes `ReadinessState.ACCEPTING_TRAFFIC` after the Spring context is up, but `GraphHealthIndicator` (`health/GraphHealthIndicator.java`) is registered as a generic `HealthIndicator`, not under the readiness group. With Spring Boot's defaults, custom `HealthIndicator`s land in the liveness+readiness composite **only if they're added to the `readiness` group**. Right now: pod becomes "ready" the moment Spring starts (~8-16s per CLAUDE.md) but `GraphBootstrapper` is still loading H2 → Neo4j (can take seconds-to-minutes for big graphs). Readiness probe passes, kube-proxy routes traffic, every request 503s with "Neo4j graph not available" (`GraphController.requireQueryService:line ~30`). On rolling deploy this also means the new pod is marked ready before old pod is drained → 100% error rate during the rollover window. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/cli/ServeCommand.java:~110` (`markReady()`); `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java:1-40` (no readiness group); `application.yml` `serving` profile (`management.health.readinessstate.enabled: true` but no `management.endpoint.health.group.readiness.include: graph,readinessState`). + +**Severity:** HIGH + +**Fix proposal:** Move `markReady()` to fire **after** `GraphBootstrapper` returns AND `graphStore.count() > 0`. Add to `application.yml` (serving profile): `management.endpoint.health.group.readiness.include: readinessState,graph`. Add a regression test. Effort: S. + +--- + +## 7. MEDIUM — No `@RestControllerAdvice`; uncaught exceptions return generic 500s with stack-trace bodies, no error envelope + +**Symptom in prod:** `grep '@ControllerAdvice'` returns zero hits in `src/main/java`. When `QueryService.nodesByKind` throws (Neo4j tx died, NPE on a malformed cached node, etc.), Spring's default error attributes return a JSON body with `"trace": "...full stack..."` if `server.error.include-stacktrace` defaults haven't been turned off — and nothing in `application.yml` turns it off. On-call sees redacted `INTERNAL_SERVER_ERROR` in clients but the response body leaks classnames + line numbers (CWE-209). MCP tools partially mask this by returning `{"error": "..."}` 200 (which is its OWN problem — see finding #8). REST has no consistent error envelope at all. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java` (mixed `ResponseStatusException` + raw return); no `*ControllerAdvice*.java` files; missing `server.error.include-stacktrace=never` in `application.yml`. + +**Severity:** MEDIUM + +**Fix proposal:** Add `api/GlobalExceptionHandler.java` with `@RestControllerAdvice`. Map `ResponseStatusException` through, all others to `{"code": "INTERNAL", "message": , "request_id": }` with HTTP 500. Set `server.error.include-stacktrace: never` and `server.error.include-message: never` in the serving profile. Effort: S. + +--- + +## 8. MEDIUM — MCP tools return `{"error": "..."}` with HTTP 200, defeating client retry logic and observability + +**Symptom in prod:** Every `catch (Exception e)` in `McpTools` returns `toJson(Map.of(PROP_ERROR, e.getMessage()))` as a successful 200 response. Spring Boot metrics (`http.server.requests`) record these as 2xx, so error-rate dashboards stay green during incidents. MCP clients with retry-on-non-2xx never retry, never alert. Worse, `e.getMessage()` from a Neo4j parse error can leak query structure / node IDs from another tenant if a path-traversal bug ever lands. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/mcp/McpTools.java` (35+ `catch (Exception e) { return toJson(Map.of(PROP_ERROR, e.getMessage())); }` blocks). + +**Severity:** MEDIUM + +**Fix proposal:** Define error codes (`INVALID_INPUT`, `NOT_FOUND`, `INTERNAL`, `RATE_LIMITED`). Return MCP-spec-compliant errors (Spring AI MCP supports throwing — verify on its API). At minimum: log with stack trace at WARN, return `{"error": {"code": "INTERNAL", "message": "internal error", "request_id": ...}}` with the actual message redacted unless it's an `IllegalArgumentException`. Effort: S. + +--- + +## 9. MEDIUM — No structured logs, no request ID, no MDC; on-call has no way to correlate a slow request to a Neo4j query + +**Symptom in prod:** `grep MDC.put|requestId|X-Request-ID|OncePerRequestFilter` in `src/main/java`: zero hits. Pod logs are default Spring Boot text format. When customer reports "the graph endpoint hung for 30s at 14:32", on-call has only timestamp matching to find the query, no per-request span ID. With virtual threads enabled (`spring.threads.virtual.enabled: true`) and N concurrent slow requests, log lines interleave with no way to demux. + +**File / location:** `src/main/resources/logback*.xml` (none — uses Spring Boot default); `src/main/resources/application.yml` (no `logging.pattern.level`); no `RequestIdFilter`. + +**Severity:** MEDIUM + +**Fix proposal:** Add `logback-spring.xml` with JSON appender (logstash-logback-encoder, MIT, single dep) gated on `spring.profiles.active=serving`. Add a `RequestIdFilter` (`OncePerRequestFilter`) that pulls `X-Request-ID` or generates a UUID, populates MDC, returns it in the response header. Add `Micrometer` timers around each `@McpTool` (Spring AI auto-instruments REST). Expose `/actuator/prometheus` (currently `metrics` is exposed but not the Prometheus scrape endpoint). Effort: M. + +--- + +## 10. MEDIUM — `GraphStore.findShortestPath` and `traceImpact` have unbounded depth or fixed `[*..20]` with no row limit, no time guard + +**Symptom in prod:** `GraphStore.findShortestPath` (line 453) runs `MATCH p = shortestPath((a)-[*..20]-(b)) RETURN [n IN nodes(p) | n.id]` — fine on small graphs, on a 5M-node enriched bundle this is 30+ seconds. `traceImpact` runs `MATCH (a)-[:RELATES_TO*1..$depth]->(b)` with `depth` capped at 10 by `McpTools.traceImpact:line ~349` — but the API endpoint `/api/triage/impact/{id}` (`GraphController:188`) doesn't appear to bound it. With 99 detector kinds and `RELATES_TO*1..10` on a hub node (e.g. a popular library import), this is a Cartesian explosion. No `WITH p LIMIT N` cap, no `dbms.transaction.timeout` configured. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/graph/GraphStore.java:453` (`shortestPath`); `:line for traceImpact`; `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:188` (`triage/impact`). + +**Severity:** MEDIUM + +**Fix proposal:** Set `dbms.transaction.timeout=30s` in `Neo4jConfig`. Add `LIMIT $maxNodes` (e.g. 10000) on every `*..N` query. Bound `depth` ≤ 5 in REST endpoint and validate. Effort: S. + +--- + +## 11. MEDIUM — `/api/file` content-type is `text/plain` for all files; binary data dumps; no `Content-Length` cap matches `getMaxFileBytes` + +**Symptom in prod:** `readFile` returns binary files (a checked-in `.png`, `.jks` keystore, native `.so`) as `text/plain` with garbled UTF-8. Browser logs the entire base64-mangled body. The implementation reads via `SafeFileReader.read(resolved, startLine, endLine, config.getMaxFileBytes())` so size is bounded, but content-type isn't sniffed and there's no early-reject for non-text files. Slow client reading 1 MB file at 1 KB/s — keeps a virtual thread + a Tomcat connection occupied for 1000s. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/api/GraphController.java:255-310`. + +**Severity:** MEDIUM + +**Fix proposal:** Probe content type with `Files.probeContentType` or magic-byte check; if not `text/*`, return 415. Set `server.tomcat.connection-timeout=10s`, `server.tomcat.max-swallow-size=1MB`. Effort: S. + +--- + +## 12. MEDIUM — `GraphHealthIndicator.health()` calls `graphStore.count()` on every probe — `MATCH (n:CodeNode) RETURN count(n)` against an embedded DB + +**Symptom in prod:** Readiness probe `periodSeconds: 5` → 12 full Cypher count queries per minute, each holding a transaction open. On a 5M-node graph with concurrent user traffic, this contends with the page cache. Liveness probe also fires every 10s. The current implementation has no cache/throttle. + +**File / location:** `src/main/java/io/github/randomcodespace/iq/health/GraphHealthIndicator.java:30`. + +**Severity:** MEDIUM + +**Fix proposal:** Cache `count()` result for 30 s in an `AtomicReference`. Or: only verify "graph reachable" via a constant-time `tx.execute("RETURN 1").hasNext()`. Effort: S. + +--- + +## 13. MEDIUM — `CorsConfig` default allows `http://localhost:[*]` and `http://127.0.0.1:[*]`; in cluster, this is wrong but undetected; no CSP + +**Symptom in prod:** Default `codeiq.cors.allowed-origin-patterns` (`config/CorsConfig.java:14`) is hardcoded to dev-loopback patterns. In AKS, the React UI is served same-origin (no CORS needed) — this is fine — but if anyone exposes the API behind a reverse proxy at a different origin, they'll get cryptic CORS failures because the YAML doesn't override it (`codeiq.yml.example` doesn't even include it). Worse: zero CSP / X-Frame-Options / X-Content-Type-Options headers means the served React UI is clickjackable and the JSON endpoints can be loaded into a hostile origin's `