perf(serve): bound JVM/Neo4j memory and dedupe topology snapshot#118
Merged
Conversation
OOM review of `codeiq serve` on AKS at the typical ~200 K-node graph scale identified four cumulative offenders fighting for the same cgroup memory limit: - McpTools and TopologyController each held an independent in-heap topology snapshot (~150 MB at this graph size). Under mixed REST + MCP traffic both lived on heap simultaneously. - TopologyController's snapshot had no TTL — once loaded, held for the lifetime of the process. - Spring `@EnableCaching` was on but no `CacheManager` bean was registered, so every `@Cacheable` region in QueryService fell back to ConcurrentMapCacheManager (unbounded, no TTL, no eviction). - Neo4j embedded auto-grabbed ~50% of free RAM for its off-heap page cache at startup, racing the JVM heap inside a single cgroup. Changes: - Extract `query/TopologySnapshotProvider` as the single owner of the topology snapshot; both McpTools and TopologyController now consume it. 60 s TTL deduplicates concurrent loads and lets idle pods release the heap. The Snapshot record carries a `loaded` flag so the controller can still distinguish "no source available" (404) from "graph is empty" (200), preserving the legacy contract. - Switch `cache.type: simple` → `caffeine` with `maximumSize=1000, expireAfterWrite=5m` in the serving profile; add the Caffeine dependency. - Cap Neo4j page cache at 256 MiB via `GraphDatabaseSettings.pagecache_memory` in Neo4jConfig. - Add `-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryError` to scripts/aks-launch.sh so the JVM heap is pinned to half the cgroup limit, leaving room for Neo4j page cache + Metaspace + JIT + Tomcat NIO buffers + OS slack. - Add `shared/runbooks/aks-oom-quick-fix.md` with diagnostic commands, the Deployment YAML patch, and the OOMKilled-vs-readiness-flap decision tree. Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %, no more OOMKilled events, idle pod releases topology snapshot after 60 s. All 3706 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OOM review of
codeiq serveon AKS at the typical ~200 K-node graphscale identified four cumulative offenders fighting for the same
cgroup memory limit. This PR addresses all four:
McpToolsandTopologyControllereach held an independent in-heap topology snapshot. Extracted a single
query/TopologySnapshotProvider(60 s TTL, idle-releaseable) shared byboth. The Snapshot record carries a
loadedflag so the controllercan still distinguish "no source available" (404) from "graph is
empty" (200), preserving the legacy contract.
@EnableCachingwas on but noCacheManagerbean wasregistered → unbounded
ConcurrentMapCacheManager. Switched theserving profile to Caffeine (
maximumSize=1000, expireAfterWrite=5m).GraphDatabaseSettings.pagecache_memoryso embedded Neo4j stopsauto-grabbing ~50 % of free RAM at startup.
-XX:MaxRAMPercentage=50 -XX:InitialRAMPercentage=25 -XX:+UseG1GC -XX:+ExitOnOutOfMemoryErrortoscripts/aks-launch.shso the heap is pinned to half the cgroup limit, leaving room for
Neo4j + Metaspace + JIT + Tomcat NIO + OS slack.
Plus a runbook at
shared/runbooks/aks-oom-quick-fix.mdwith thediagnostic flow (OOMKilled vs readiness-flap) and the Deployment YAML
patch.
Net effect at 200 K nodes / 4 GiB pod: peak heap ceiling drops ~50 %,
no more OOMKilled events, idle pod releases topology snapshot after
60 s.
Test plan
🤖 Generated with Claude Code