feat: frontend rebuild (dashboard, scalable service map, MCP console) + triage/SQLite hardening#98
Merged
Merged
Conversation
Reduces the MCP HTTP-streamable surface from 21 tools to 7 — the minimum set needed for an LLM-driven incident-triage workflow on a 120-service SQLite deployment that's currently OOMing within an hour. Kept (7): get_anomaly_timeline, get_service_map, get_service_health, root_cause_analysis, impact_analysis, trace_graph, search_logs. Cut (14): get_system_graph, tail_logs, get_trace, search_traces, get_metrics, get_dashboard_stats, get_storage_status, find_similar_logs, get_alerts, correlated_signals, get_error_chains, get_investigations, get_investigation, get_graph_snapshot. The cut tools fall into three buckets: (a) duplicates of a kept tool with a slightly different framing (get_system_graph ≈ get_service_map, get_error_chains is folded into root_cause_analysis); (b) require subsystems being dropped in follow-up commits (find_similar_logs → vectordb, get_graph_snapshot → snapshot table); (c) belong to a separate forensic-analytics workflow not part of active triage (get_investigations, get_dashboard_stats). MCP clients calling cut tools receive an "unknown tool" RPC error — no deprecation period, the cut is intentional and immediate. Files touched: cache.go cacheable list re-sorted to mirror toolDefs; dispatcher in tools.go collapsed to the 7-case switch; tools_ran20_test.go (find_similar_logs only) deleted; server_ran22_test.go pared down to the constructor-tenant signature test now that the HTTP find_similar_logs flow is gone (the no-header default-tenant invariant is covered by tenant_isolation_test.go); tenant_isolation_test.go drops subtests for cut tools. Design spec: docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
The vectordb package was a pure-Go TF-IDF index for semantic log search, backing one MCP tool (find_similar_logs, cut in the prior commit) and one HTTP endpoint (/api/logs/similar). With the kept search_logs MCP tool already routing through SQLite FTS5 / pg_trgm GIN, the in-memory TF-IDF index is no longer reachable by any survivor. Removing it reclaims ~5-15% of resident heap on a 120-service SQLite deployment that the maxSize=100000 index + 5-minute snapshot loop + startup ReplayFromDB hydrator otherwise consume — heap pressure that contributes to the OOM-within-an-hour failure mode this refactor is solving for. Deletions: - internal/vectordb/ — index.go, snapshot.go, replay.go + tests - internal/api/similar_handler.go + test — the /api/logs/similar route - internal/storage/log_repo_replay_test.go + LogsForVectorReplay() and ListRecentHighSeverityLogsAllTenants() (only the vectordb hydrator read these; no other caller) - internal/graphrag/clustering.go::SimilarErrors() — vectordb-dependent, no production caller; Drain template clustering is the survivor - Vector* fields on telemetry.Metrics + RecordVector* observer methods - VectorIndexMaxEntries / VectorIndexSnapshotPath / VectorIndexSnapshotInterval on config.Config Signature changes: - graphrag.New(repo, tsdbAgg, ringBuf, cfg) — vectordb arg removed - mcp.New(defaultTenant, repo, metrics, svcGraph) — vectordb arg removed - ui.NewServer(repo, metrics, topo) — vectordb arg removed - api.Server.SetVectorIndex removed Operator migration: - The data/vectordb.snapshot file is left in place on disk; the loader that read it at boot is deleted, so it becomes a stale file that is safe to remove by hand. No automatic cleanup. - MCP clients calling find_similar_logs already receive "unknown tool" after the prior commit; the HTTP /api/logs/similar route now 404s.
The `graph_snapshots` table backed exactly one MCP tool (get_graph_snapshot, cut earlier in this PR) — no UI surface or REST endpoint reads it. With the tool gone the table is pure write amplification: at 15-minute cadence × ~100 tenants × per-row JSON nodes+edges blob it adds ~67k rows/week even after the 7-day age prune, and the row-count backstop only kicks in above 100k. On the SQLite OOM-within-an-hour deployment this contributes meaningfully to the 2 TB/day disk growth. Deletions: - internal/graphrag/snapshot.go (entire file): GraphSnapshot GORM model, takeSnapshot / takeSnapshotForTenant, pruneOldSnapshots, GetGraphSnapshot, maxSnapshotRows constant. - views.GraphSnapshot type + GraphSnapshotFromModel converter (only used by the removed test). - TestGraphRAG_GetGraphSnapshot_TenantScoped + the GraphSnapshot wire- shape leak test in views_test.go. Updates: - AutoMigrateGraphRAG no longer creates the table on fresh installs. graphRAGTables slice drops "graph_snapshots" so tenant-backfill skips it and the test asserting the per-table backfill no longer expects the row. - refresh.go::snapshotLoop now only calls persistDrainTemplates; the snapshotEvery field and the loop name are kept for wiring stability so external Config.SnapshotEvery still tunes the drain-persist cadence. Operator migration: existing graph_snapshots tables are LEFT IN PLACE on upgrade — AutoMigrate's IF NOT EXISTS semantics mean a populated table is not touched. Operators wanting to reclaim disk should `DROP TABLE graph_snapshots; VACUUM;` after upgrading. The table will stop receiving new writes immediately.
Makes the platform survivable at 120 services on SQLite, the target the
prior commits in this PR have been shaving heap and disk pressure for.
Two coordinated changes:
1. SQLite PRAGMA stanza in factory.go is hardened from 3 to 8 settings
and made fail-closed:
PRAGMA journal_mode=WAL
PRAGMA synchronous=NORMAL
PRAGMA cache_size=-262144 # 256 MB page cache
PRAGMA temp_store=MEMORY
PRAGMA mmap_size=1073741824 # 1 GB mmap
PRAGMA wal_autocheckpoint=10000 # checkpoint after 10k pages
PRAGMA journal_size_limit=67108864 # cap WAL at 64 MB
PRAGMA busy_timeout=5000
Each PRAGMA failure now aborts startup with a wrapped error
(`sqlite pragma %q failed: %w`) so an unexpected SQLite build that
doesn't honour, e.g. mmap_size, can't silently regress the platform
to default-tuned behaviour.
2. config.Load now runs `applyDriverDefaults(cfg)` after constructing
the Config struct. When DBDriver=sqlite (case-insensitive) AND the
operator did not explicitly set the env var (detected via
os.LookupEnv presence — value comparison would falsely treat
operator-set Postgres-default values as "unset"), the following
defaults flip:
DB_MAX_OPEN_CONNS 50 → 1
DB_MAX_IDLE_CONNS 10 → 1
INGEST_PIPELINE_WORKERS 8 → 2
INGEST_PIPELINE_QUEUE_SIZE 50000 → 10000
METRIC_MAX_CARDINALITY 10000 → 3000
STORE_MIN_SEVERITY "" → "WARN"
SAMPLING_RATE 1.0 → 0.05
GRPC_MAX_CONCURRENT_STREAMS 1000 → 240
LOG_FTS_ENABLED false → true
Postgres/MSSQL/MySQL paths are unchanged bit-for-bit (early-return
in applyDriverDefaults).
The applyDriverDefaults override is unit-tested for: the all-flip path,
the "respect explicit operator override" path, the Postgres no-op path,
and case-insensitive driver matching.
Design rationale and per-default justification:
docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
Updates the operator-facing documentation to reflect the refactor in this PR: - CLAUDE.md "MCP Server" section rewritten to describe the 7-tool triage surface (kept + cut lists). The architecture diagram drops the legacy Vector accelerator layer. The "Storage Architecture", "GraphRAG Architecture" (background processes, persistence models, log clustering), and "Key Directories" sections drop their vectordb / graph_snapshots mentions. A new "SQLite per-driver defaults" section documents the nine env-var overrides flipped by applyDriverDefaults and the eight PRAGMAs applied at startup. - LOG_FTS_ENABLED entry rewritten to document the new SQLite-default `true` (with the LIKE-fallback / drop_fts reclaim path preserved). - STORE_MIN_SEVERITY entry notes the new SQLite-default `"WARN"`. - README.md "Features" bullet swaps "21 tools" for the 7-tool triage surface and inlines the kept tool names. - .env.example drops the VECTOR_INDEX_* block, adds a "SQLite Tuning" block listing every auto-flipped default, and notes the 7-tool MCP surface under the MCP section. - The design spec at docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md is the canonical record of the refactor's rationale, decision matrix, per-default justification, migration notes, and risk/mitigation table.
The github.com/RandomCodeSpace/central-ops module is private and 404s for this account, breaking offline/air-gapped builds. Only two symbols were used, both trivially replaceable in-tree: - main.go: version.Detect() -> local detectVersion() via runtime/debug (runtime/debug was already imported); falls back to "local". - internal/mcp/server.go: httputil.CORSMiddleware -> local corsMiddleware that sets Access-Control-Allow-* and answers OPTIONS preflight. go mod edit -droprequire + go.sum cleanup. go build and go vet pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sole Full frontend overhaul on React 19 + @ossrandom/design-system. Dashboard (new default view): - 12-col bento grid, 5 bands: hero health gauge, traffic/errors, top failing services, recent anomalies, platform health. - Uptime stat + readiness probe dots + DB size. - Recent anomalies clamped to a 15-minute window, deduped by service|type to the 20 most recent. Service map: - Raw cytoscape (cose-bilkent) topology scaling 1–200 services with every node on screen; node size = degree (edge count). - Hover/click reveals a node's edges + stats; label LOD past 120 nodes. - Graph/List segmented toggle (list is the accessible default on touch/small screens). MCP Trial console: - List-detail layout over the 7-tool triage surface via JSON-RPC. - Dynamic tool forms, result views, history, live SSE stream, settings. Platform: - TopNav segmented tabs (Dashboard/Services/MCP); removed Logs & Traces. - ErrorBoundary mounted; theme race fixed in main.tsx. - ServicesView and MCPConsoleView lazy-loaded to keep cytoscape out of the initial bundle (206KB / 65KB gz). - Re-embedded internal/ui/dist. tsc -b, vitest, and eslint pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The metrics middleware's responseWriter wraps http.ResponseWriter to capture the status code and already forwards Hijack (for WebSocket upgrades), but embedding the interface drops Flush from the method set. The MCP SSE handler's w.(http.Flusher) assertion therefore failed and GET /mcp returned 500 "SSE not supported", so the UI LiveStream looped on EventSource onerror. Add a Flush() forwarder mirroring the existing Hijack(). GET /mcp now returns 200 text/event-stream and pushes endpoint + resources/updated events. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DashboardStats.AvgLatencyMs was milliseconds but P99Latency was microseconds (storage computes p99 in µs and its tests assert µs). Only avg got the µs→ms conversion, so the dashboard rendered the raw µs p99 under an "ms" label — e.g. 4,430,763 ms. Convert p99 to ms at the API view boundary (storage stays µs so its tests pass) and rename the field to p99_latency_ms for unit-explicit parity with avg_latency_ms. Frontend reads the renamed field. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LiveStream dumped the raw JSON-RPC envelope (with an escaped graph blob). Add formatStreamEvent() to render concise lines — "graph · N svc · M edges · healthy/degraded/critical" and a handshake line — and listen for the named `endpoint` event EventSource won't route to onmessage. The -15m/-1h/-24h quick presets existed only on datetime fields (since/start/ end). The time_range duration field (root_cause_analysis, trace_graph) was a bare input; add matching 15m/1h/24h preset chips that fill the Go-duration string. Rebuilds internal/ui/dist, which re-embeds these changes plus the p99→ms dashboard fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves conflicts in favour of main's canonical backend (PR #91 superseded this branch's older copies of the 7-tool MCP + SQLite survival work). Branch's net-new contribution is unchanged: frontend rebuild, MCP SSE Flush fix, p99 µs→ms, and the MCP console stream-formatting + time_range presets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Go stdlib 1.25.10 carries two advisories (GO-2026-5037, GO-2026-5039) flagged by OSV-Scanner; both are fixed in 1.25.11. CI installs the toolchain via go-version-file: go.mod, so bumping the directive clears the SCA gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
aksOps
added a commit
that referenced
this pull request
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Rebuilds the OtelContext frontend and lands the supporting backend changes that
were staged on this branch. Single Go binary with embedded React UI; no new
runtime deps.
Frontend (React 19 + @ossrandom/design-system)
Fixes in this batch
Flush(only forwardedHijack), soGET /mcpreturned 500 and the LiveStream looped ononerror. Added aFlush()forwarder → 200text/event-stream.p99_latency_ms(storage stays µs; its tests assert µs).time_rangeduration field.Backend (earlier on branch)
central-opsdependency (offline/air-gapped builds).Validation
go build,go vet,go test ./...pass.tsc -b,vitest(36),eslintpass.ready=true, topology + SSE + dashboard exercised by a 7-service chaos sim).🤖 Generated with Claude Code