RandomCodeSpace · aksOps · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -46,7 +46,7 @@ If you hit something requiring GitHub App / PAT / OAuth that the runtime cannot
 <claude-mem-context>
 # Memory Context
 
-# [codeiq] recent context, 2026-04-28 1:14am UTC
+# [codeiq] recent context, 2026-04-28 6:43am UTC
 
 No previous sessions found.
 </claude-mem-context>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -225,6 +225,146 @@ for that specific tag for the per-commit details.
   path-B board ruling, they are not to be re-introduced without an explicit
   board reversal — see `shared/runbooks/engineering-standards.md` §5.1.
 
+### Security
+
+- **Production-readiness PR 1 of 5 — security baseline.** First half of the
+  audit findings catalogued under `docs/audits/2026-04-28-serve-path-prod-readiness.md`
+  (+ `-counter.md`). Closes audit findings #1, #7, #13 (HIGH/MEDIUM) and C2 (MEDIUM).
+  - **Bearer-token auth on `/api/**` and `/mcp/**`** (audit #1). Added
+    `spring-boot-starter-security`. New `config/security/SecurityConfig`,
+    `BearerAuthFilter`, `TokenResolver`. Token source priority:
+    `CODEIQ_MCP_TOKEN` env > `codeiq.mcp.auth.token` config > startup failure.
+    Constant-time compare via SHA-256 pre-hash + `MessageDigest.isEqual` —
+    32-byte digests on both sides defeat the length oracle. RFC 7235 §2.1
+    case-insensitive scheme matching (`Bearer`, `bearer`, etc.). Authorization
+    header value never reaches a logger from this code. Permit list:
+    `/`, `/index.html`, `/favicon.ico`, `/assets/**`, `/static/**`, `/error`,
+    `/actuator/health/{liveness,readiness}` — everything else under
+    `/api/**`, `/mcp/**`, `/actuator/**` requires the bearer token.
+  - **Fail-fast on misconfiguration** (audit #14 partial). `mode=bearer` with
+    no token resolved → throws at startup. `mode=none` with active `serving`
+    profile and `allow_unauthenticated` not explicitly set → throws at
+    startup. `mode=mtls` is reserved and explicitly throws "not yet
+    implemented" rather than silently passing through.
+  - **Defensive response headers** (audit #13). New
+    `config/security/SecurityHeadersFilter` sets `X-Content-Type-Options:
+    nosniff`, `X-Frame-Options: DENY`, `Content-Security-Policy: default-src
+    'self'; ... frame-ancestors 'none'`, `Referrer-Policy: no-referrer`,
+    `Permissions-Policy` disabling geolocation/camera/microphone.
+    `Strict-Transport-Security: max-age=31536000; includeSubDomains` is set
+    only when `X-Forwarded-Proto: https` is present (AKS terminates TLS at
+    ingress) — setting HSTS over plain HTTP would lock out misconfigured envs.
+  - **Uniform error envelope** (audit #7). New
+    `api/GlobalExceptionHandler` (`@RestControllerAdvice`,
+    `@Profile("serving")`) maps every uncaught exception to
+    `{"code","message","request_id"}` with the right HTTP status.
+    `IllegalArgumentException` → 400 with surfaced message.
+    `ResponseStatusException` → status code passes through. Anything else →
+    500 with generic message; the actual exception is logged at WARN with
+    the `request_id` so on-call can correlate without leaking stack frames
+    to the client. `application.yml` now sets
+    `server.error.include-stacktrace: never` + `include-message: never` +
+    `include-binding-errors: never` as belt-and-suspenders.
+  - **Default CORS deny-all in serving** (audit #13). `config/CorsConfig`
+    default changed from loopback patterns to empty. Empty means register
+    no mappings → Spring MVC rejects all preflighted cross-origin requests.
+    Operators who genuinely need cross-origin (e.g. dev with a separate
+    Vite server on a different port) explicitly set
+    `codeiq.cors.allowed-origin-patterns`. Logs the resolved state at
+    startup. The React UI at `/` is unaffected — it's served same-origin.
+  - **Swagger UI / api-docs disabled in serving** (counter-audit C2).
+    `springdoc.api-docs.enabled: false` + `springdoc.swagger-ui.enabled: false`
+    in the serving profile of `application.yml`. The OpenAPI schema is
+    reconnaissance data; reachable only when running locally or with the
+    indexing profile.
+  - **`management.endpoints.web.exposure.include` narrowed** to `health,info`
+    in serving (was `health,info,metrics`); `health.show-details: never`.
+    Defense-in-depth alongside the `SecurityFilterChain` `authenticated()`
+    rule on `/actuator/**`.
+  - **Spring Security autoconfig excluded outside serving.** Without the
+    `serving` profile (CLI, tests, IDE runs), Spring Security's default
+    HTTP Basic chain would lock all endpoints — adding the starter would
+    break ~3000 existing tests that pass through MockMvc with no token.
+    `application.yml` excludes `SecurityAutoConfiguration`,
+    `SecurityFilterAutoConfiguration`, `UserDetailsServiceAutoConfiguration`
+    at the default level; the `serving` profile re-enables them by listing
+    only `UserDetailsServiceAutoConfiguration` (so the auto user/password
+    is suppressed but the filter chain is built from `SecurityConfig`).
+  - **Tests:** 31 new unit tests across `BearerAuthFilterTest` (14 cases:
+    missing/wrong/empty/correct/lowercase scheme, length-oracle defense,
+    log-leak audit, `shouldNotFilter` paths, `SecurityContextHolder` cleanup),
+    `TokenResolverTest` (9 cases for mode/profile/env-priority/fail-fast),
+    `SecurityHeadersFilterTest` (5 cases for header presence/HSTS gating),
+    `GlobalExceptionHandlerTest` (3 cases verifying the envelope shape and
+    no stack-trace leak). Full suite: 3453 tests / 0 failures / 0 errors.
+
+  **Known follow-up (not in this PR):** the React UI cannot read env vars,
+  so the SPA shell is unauthenticated to access static assets. API/MCP calls
+  from the UI must inject `Authorization: Bearer <token>` from
+  operator-supplied localStorage. A first-class UI auth bootstrap (login
+  flow + token-issuance endpoint, OR server-side template injection) is its
+  own design — tracked as a follow-up issue.
+
+- **Production-readiness PR 2 of 5 — resource limits & abuse protection.**
+  Closes audit findings #2, #3, C1 (HIGH) and #10, #11 (MEDIUM).
+  - **Cypher transaction timeout** (audit #2). Neo4j embedded
+    `GraphDatabaseSettings.transaction_timeout = 30s` configured in
+    `Neo4jConfig` — every transaction in the JVM, including `run_cypher`
+    and graph traversals, gets a hard wall-clock cap. Catches runaway
+    variable-length matches before they starve the page cache.
+  - **Result-set cap on `run_cypher`** (audit #2). Hard row cap at
+    `mcp.limits.max_results` (default 500); excess rows dropped, response
+    carries `truncated: true` + `max_results: N`. Defends the JVM heap
+    against `MATCH (a),(b),(c) RETURN a,b,c LIMIT 999999999` blowups.
+  - **MCP `traceImpact` depth cap** (audit #10 corrected, C3). New
+    `mcp.limits.max_depth` field (default 10) wired into
+    `McpTools.traceImpact` via `Math.min`. Defends against
+    `RELATES_TO*1..1000` Cartesian explosions on hub nodes.
+  - **TTL snapshot cache on topology tools** (audit C1). `McpTools.
+    getCachedData()` now backed by a 60-second TTL snapshot. Without it,
+    every concurrent `service_dependencies` / `blast_radius` /
+    `find_path` / `find_bottlenecks` / `find_circular_deps` /
+    `find_dead_services` / `find_node` call paid the full
+    `graphStore.findAll()` cost and double-allocated multi-GB heaps.
+    A bridge fix; the proper refactor (TopologyService → per-tool Cypher)
+    is a tracked follow-up.
+  - **Per-client rate limiter** (audit #3). New `RateLimitFilter` using
+    Bucket4j 8.18.0 (Apache-2.0). Token bucket sized at
+    `mcp.limits.rate_per_minute` (default 300). Keyed by SHA-256 hash of
+    the `Authorization` header (so the token never lives in our key map),
+    falls back to `X-Forwarded-For` (first hop) or `RemoteAddr`. 429
+    response with `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`
+    headers. Registered before `BearerAuthFilter` so unauthenticated
+    brute-force is also throttled.
+  - **`/api/file` content-type sniff** (audit #11 corrected). Added
+    `Files.probeContentType` guard — non-text MIMEs (`.jks`, `.so`,
+    `.png`, native libs) return HTTP 415 with the probed type, instead
+    of being served as garbled `text/plain`. Allowlist: `text/*`,
+    `application/json`, `application/xml`, `application/x-yaml`,
+    `application/javascript`. The byte cap (already enforced by
+    `SafeFileReader`) is unchanged.
+  - **Tomcat slow-client tarpit** (audit #11). `server.tomcat.connection-
+    timeout: 10s`, `max-swallow-size: 1MB` in the serving profile —
+    drops connections that hold a virtual thread + Tomcat connection at
+    1 KB/s.
+  - **CodeQL hardening on the security baseline.** Sanitised request
+    method + URI before logging in `BearerAuthFilter` (CWE-117 / CodeQL
+    `java/log-injection`); removed env-var name from the bearer-token
+    bootstrap log line in `TokenResolver` (CodeQL `java/sensitive-log`);
+    documented the deliberate stateless-bearer rationale on
+    `SecurityConfig.csrf(disable)` (CodeQL `java/spring-disabled-csrf-protection`
+    — no exploit path on a no-cookie surface).
+  - **Tests:** new `RateLimitFilterTest` (10 cases: under/over limit,
+    separate buckets per client, header-hashing, X-Forwarded-For
+    precedence, permit-list, default-rate fallback). Existing 6 test
+    classes updated for the new `McpTools` ctor signature. Full suite:
+    3672 tests / 0 failures / 0 errors.
+
+  **Known follow-up:** TopologyService still walks the full snapshot
+  in-memory after the cache hit — long-term plan is to rewrite each
+  topology tool as a targeted Cypher query so the snapshot isn't needed.
+  The cache is the bridge; the rewrite reduces peak memory.
+
 ## [0.1.0] - 2026-03-28
 
 First general-availability cut. See the

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -433,6 +433,20 @@ bean for code paths that haven't been ported yet.
 - **Parallel agent conflicts**: Don't dispatch multiple agents editing the same files concurrently. Use worktree isolation or sequential execution.
 - **SonarCloud project key**: `RandomCodeSpace_codeiq`, org: `randomcodespace`
 - **CI workflow**: Single `ci-java.yml` runs build + SonarCloud analysis. No cross-platform builds needed (JVM).
+- **Spring Security only loads in the `serving` profile.** `application.yml` excludes `SecurityAutoConfiguration` + `SecurityFilterAutoConfiguration` + `UserDetailsServiceAutoConfiguration` at the **default** level so adding `spring-boot-starter-security` doesn't break ~3000 MockMvc tests by activating a default HTTP Basic chain. The `serving` profile re-enables them by listing only `UserDetailsServiceAutoConfiguration` (suppresses the auto user/password printout); the chain itself is built by `config/security/SecurityConfig`. **Don't** drop the default exclude — non-serving contexts (CLI, tests) must have no Spring Security wiring at all.
+- **`BearerAuthFilter.shouldNotFilter` and `SecurityConfig.permitAll()` paths must stay in sync.** The filter runs before Spring's `AuthorizationFilter`, so if a path is in `permitAll()` but NOT in `shouldNotFilter`, the filter rejects it with 401 before Spring's chain can permit it. Open paths today: `/`, `/index.html`, `/favicon.ico`, `/assets/**`, `/static/**`, `/error`, `/actuator/health`, `/actuator/health/liveness`, `/actuator/health/readiness`. Adding any new permit-all endpoint requires updating BOTH places.
+- **Constant-time bearer-token compare uses SHA-256 pre-hash.** Both the provided and expected token are hashed with SHA-256 before `MessageDigest.isEqual`. SHA-256 always produces 32-byte digests, so `isEqual` runs over fixed-size arrays — defeats the length oracle that makes raw `isEqual` unsafe across mismatched-length inputs. **Don't** "optimize" by removing the hash and comparing raw token bytes; that re-introduces the oracle.
+- **Never log the `Authorization` header.** `BearerAuthFilter` deliberately never passes the header value to a logger, even at DEBUG. The rejection log line carries only `method` and `requestURI`. There's a regression test (`tokenValueNeverAppearsInLogs`) that captures all log lines for the filter and asserts the secret substring is absent.
+- **`mode=none` + active `serving` profile = startup failure** unless `codeiq.mcp.auth.allow_unauthenticated=true` is **explicitly** set. This is by design — operators must opt into permissive mode deliberately. `mode=mtls` is reserved and currently throws "not yet implemented" (better than silently passing through).
+- **`server.error.include-stacktrace: never`** is set in the serving profile as defense-in-depth alongside `GlobalExceptionHandler`. Don't enable it for "easier debugging" — stack frames in the response body leak class names + paths (CWE-209). Use the `request_id` in the envelope to correlate to the WARN log line where the full stack is captured.
+- **Cypher transaction wall-clock cap is configured at the DBMS level**, not per-call. `Neo4jConfig.databaseManagementService(...)` sets `GraphDatabaseSettings.transaction_timeout = 30s` so every transaction gets the cap automatically. Don't reach for `graphDb.beginTx(timeout, unit)` overload in tool code — the test suite mocks `beginTx()` with no args and the overload changes the matcher signature, breaking the existing stubs across `McpToolsTest` / `McpToolsExpandedTest` / `McpToolsEvidenceTest`.
+- **`McpTools.runCypher` row cap is enforced in the iteration loop, not via `LIMIT`.** After `maxResults` rows are accumulated the loop breaks and the response carries `truncated: true` + `max_results: N`. Don't try to inject `LIMIT N` into the user-supplied query string — that would require parsing the query (and the user's query may already have its own LIMIT).
+- **`McpTools.getCachedData()` 60-second TTL snapshot is a bridge fix.** It's NOT the proper solution — the proper solution is to rewrite each topology MCP tool to use a targeted Cypher query so the full graph never needs to live on heap. The cache caps peak memory under concurrent calls but the snapshot itself is still multi-GB on large graphs. When that refactor lands, the `AtomicReference<CachedSnapshot>` and `getCachedData()` itself can be deleted.
+- **`RateLimitFilter` keys by `sha256(Authorization)`** — the raw token NEVER goes into the bucket key map. The 16-hex-char digest is enough collision resistance for keying. Falls back to `X-Forwarded-For` (first hop) → `RemoteAddr` when no auth header is present. Buckets live in a `ConcurrentHashMap` — bounded in practice by `num_distinct_clients`, which for the single-tenant pod shape is small. Swap to a Caffeine cache with a max-size eviction if multi-tenant exposure is ever added.
+- **Filter chain order in `serving` profile**: `SecurityHeadersFilter` → `RateLimitFilter` → `BearerAuthFilter` → ... → controller. Each `addFilterBefore(X, UsernamePasswordAuthenticationFilter.class)` inserts X immediately before UPAFilter, pushing the previously-inserted filter farther from the target — so the **registration order in `SecurityConfig.servingFilterChain` IS the chain order**. Don't shuffle without re-reasoning about it: if `RateLimitFilter` ran AFTER `BearerAuthFilter`, an unauthenticated brute-force attempt would never get throttled (would just see 401 over and over, hitting the slow path).
+- **`Files.probeContentType` is best-effort** — JDK 25 on Linux uses `/etc/mime.types` + magic-byte fallback. It returns `null` if the type can't be determined; treat that as "let it through" (the byte cap in `SafeFileReader` still bounds size). The allowlist for `/api/file` is `text/*` + `application/{json,xml,x-yaml,javascript}` — extending requires adding to the explicit list in `GraphController.readFile`.
+- **Sanitize user-controlled values before logging.** `BearerAuthFilter.sanitizeForLog(String)` strips `\p{Cntrl}` and truncates at 256 chars. Use it on anything tainted by `request.getRequestURI()`, `request.getMethod()`, headers, etc. before passing to a logger. CodeQL `java/log-injection` will flag direct `log.warn("... {} ...", request.getRequestURI())` as a vuln.
+- **`mcp.limits.max_depth` is a NEW field on `McpLimitsConfig`** (default 10). Audit #10 / C3 — the original audit assumed it existed but it didn't. When adding new MCP traversal tools, cap depth via `Math.min(callerSupplied, maxDepth)` before passing to Cypher. The REST endpoint already had this guard via `config.getMaxDepth()` from `CodeIqConfig`; the MCP path now mirrors it via `McpLimitsConfig.maxDepth()`.
 
 ## Supply-chain observability (OpenSSF)