From f60d8980efe0a47f8aadf62b37621686df1401c0 Mon Sep 17 00:00:00 2001 From: Amit Kumar Date: Mon, 4 May 2026 16:36:57 +0000 Subject: [PATCH] perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OOM review of `codeiq serve` on AKS at the typical ~200 K-node graph scale identified four cumulative offenders fighting for the same cgroup memory limit: - McpTools and TopologyController each held an independent in-heap topology snapshot (~150 MB at this graph size). Under mixed REST + MCP traffic both lived on heap simultaneously. - TopologyController's snapshot had no TTL — once loaded, held for the lifetime of the process. - Spring `@EnableCaching` was on but no `CacheManager` bean was registered, so every `@Cacheable` region in QueryService fell back to ConcurrentMapCacheManager (unbounded, no TTL, no eviction). - Neo4j embedded auto-grabbed ~50% of free RAM for its off-heap page cache at startup, racing the JVM heap inside a single cgroup. Changes: - Extract `query/TopologySnapshotProvider` as the single owner of the topology snapshot; both McpTools and TopologyController now consume it. 60 s TTL deduplicates concurrent loads and lets idle pods release the heap. The Snapshot record carries a `loaded` flag so the controller can still distinguish "no source available" (404) from "graph is empty" (200), preserving the legacy contract. - Switch `cache.type: simple` → `caffeine` with `maximumSize=1000, expireAfterWrite=5m` in the serving profile; add the Caffeine dependency. - Cap Neo4j page cache at 256 MiB via `GraphDatabaseSettings.pagecache_memory` in Neo4jConfig. - Add `-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError` to scripts/aks-launch.sh so the JVM heap is pinned to half the cgroup limit, leaving room for Neo4j page cache + Metaspace + JIT + Tomcat NIO buffers + OS slack. - Add `shared/runbooks/aks-oom-quick-fix.md` with diagnostic commands, the Deployment YAML patch, and the OOMKilled-vs-readiness-flap decision tree. Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %, no more OOMKilled events, idle pod releases topology snapshot after 60 s. All 3706 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- pom.xml | 13 ++ scripts/aks-launch.sh | 24 ++- shared/runbooks/aks-oom-quick-fix.md | 162 ++++++++++++++++++ .../iq/api/TopologyController.java | 141 ++++----------- .../iq/config/Neo4jConfig.java | 11 +- .../randomcodespace/iq/mcp/McpTools.java | 67 +++----- .../iq/query/TopologySnapshotProvider.java | 119 +++++++++++++ src/main/resources/application.yml | 11 +- .../api/TopologyControllerExtendedTest.java | 9 +- .../iq/api/TopologyEndpointTest.java | 8 +- .../iq/mcp/McpToolsEvidenceTest.java | 11 +- .../iq/mcp/McpToolsExpandedTest.java | 7 +- .../randomcodespace/iq/mcp/McpToolsTest.java | 6 +- 13 files changed, 425 insertions(+), 164 deletions(-) create mode 100644 shared/runbooks/aks-oom-quick-fix.md create mode 100644 src/main/java/io/github/randomcodespace/iq/query/TopologySnapshotProvider.java diff --git a/pom.xml b/pom.xml index 5f86377c..a5867da9 100644 --- a/pom.xml +++ b/pom.xml @@ -170,6 +170,19 @@ 8.18.0 + + + com.github.ben-manes.caffeine + caffeine + +