Skip to content

feat: frontend rebuild (dashboard, scalable service map, MCP console) + triage/SQLite hardening#98

Merged
aksOps merged 12 commits into
mainfrom
feat/frontend-dashboard-mcp
Jun 5, 2026
Merged

feat: frontend rebuild (dashboard, scalable service map, MCP console) + triage/SQLite hardening#98
aksOps merged 12 commits into
mainfrom
feat/frontend-dashboard-mcp

Conversation

@aksOps

@aksOps aksOps commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Rebuilds the OtelContext frontend and lands the supporting backend changes that
were staged on this branch. Single Go binary with embedded React UI; no new
runtime deps.

Frontend (React 19 + @ossrandom/design-system)

  • Dashboard (new default view): 12-col bento — hero health gauge, traffic/errors, top failing services, recent anomalies (15m window, deduped to 20), platform health (uptime, readiness, DB size).
  • Service map: raw cytoscape (cose-bilkent) scaling 1–200 services with every node on screen; node size = degree; hover/click reveals edges + stats; Graph/List toggle (list is the accessible default on touch).
  • MCP Trial console: list-detail over the 7-tool triage surface (JSON-RPC), dynamic tool forms, result views, history, live SSE stream, settings.
  • Removed Logs/Traces views; ErrorBoundary mounted; cytoscape lazy-loaded (initial 206KB/65KB gz).

Fixes in this batch

  • MCP SSE: metrics middleware dropped Flush (only forwarded Hijack), so GET /mcp returned 500 and the LiveStream looped on onerror. Added a Flush() forwarder → 200 text/event-stream.
  • Dashboard p99: was microseconds rendered under an "ms" label (e.g. 4.4M). Converted to ms at the API view boundary and renamed to p99_latency_ms (storage stays µs; its tests assert µs).
  • MCP console UX: format the SSE stream into concise lines instead of raw JSON-RPC; add 15m/1h/24h presets to the time_range duration field.

Backend (earlier on branch)

  • Drop private central-ops dependency (offline/air-gapped builds).
  • 7-tool MCP triage surface; drop legacy vectordb + graph_snapshots; SQLite per-driver defaults + PRAGMA tuning.

Validation

  • Go: go build, go vet, go test ./... pass.
  • UI: tsc -b, vitest (36), eslint pass.
  • Verified live on a running instance (ready=true, topology + SSE + dashboard exercised by a 7-service chaos sim).

🤖 Generated with Claude Code

aksOps and others added 12 commits May 24, 2026 18:42
Reduces the MCP HTTP-streamable surface from 21 tools to 7 — the minimum
set needed for an LLM-driven incident-triage workflow on a 120-service
SQLite deployment that's currently OOMing within an hour.

Kept (7): get_anomaly_timeline, get_service_map, get_service_health,
root_cause_analysis, impact_analysis, trace_graph, search_logs.

Cut (14): get_system_graph, tail_logs, get_trace, search_traces,
get_metrics, get_dashboard_stats, get_storage_status, find_similar_logs,
get_alerts, correlated_signals, get_error_chains, get_investigations,
get_investigation, get_graph_snapshot.

The cut tools fall into three buckets: (a) duplicates of a kept tool with
a slightly different framing (get_system_graph ≈ get_service_map,
get_error_chains is folded into root_cause_analysis); (b) require
subsystems being dropped in follow-up commits (find_similar_logs →
vectordb, get_graph_snapshot → snapshot table); (c) belong to a separate
forensic-analytics workflow not part of active triage (get_investigations,
get_dashboard_stats). MCP clients calling cut tools receive an "unknown
tool" RPC error — no deprecation period, the cut is intentional and
immediate.

Files touched: cache.go cacheable list re-sorted to mirror toolDefs;
dispatcher in tools.go collapsed to the 7-case switch; tools_ran20_test.go
(find_similar_logs only) deleted; server_ran22_test.go pared down to the
constructor-tenant signature test now that the HTTP find_similar_logs
flow is gone (the no-header default-tenant invariant is covered by
tenant_isolation_test.go); tenant_isolation_test.go drops subtests for
cut tools.

Design spec: docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
The vectordb package was a pure-Go TF-IDF index for semantic log search,
backing one MCP tool (find_similar_logs, cut in the prior commit) and one
HTTP endpoint (/api/logs/similar). With the kept search_logs MCP tool
already routing through SQLite FTS5 / pg_trgm GIN, the in-memory TF-IDF
index is no longer reachable by any survivor.

Removing it reclaims ~5-15% of resident heap on a 120-service SQLite
deployment that the maxSize=100000 index + 5-minute snapshot loop +
startup ReplayFromDB hydrator otherwise consume — heap pressure that
contributes to the OOM-within-an-hour failure mode this refactor is
solving for.

Deletions:
- internal/vectordb/ — index.go, snapshot.go, replay.go + tests
- internal/api/similar_handler.go + test — the /api/logs/similar route
- internal/storage/log_repo_replay_test.go + LogsForVectorReplay() and
  ListRecentHighSeverityLogsAllTenants() (only the vectordb hydrator
  read these; no other caller)
- internal/graphrag/clustering.go::SimilarErrors() — vectordb-dependent,
  no production caller; Drain template clustering is the survivor
- Vector* fields on telemetry.Metrics + RecordVector* observer methods
- VectorIndexMaxEntries / VectorIndexSnapshotPath /
  VectorIndexSnapshotInterval on config.Config

Signature changes:
- graphrag.New(repo, tsdbAgg, ringBuf, cfg) — vectordb arg removed
- mcp.New(defaultTenant, repo, metrics, svcGraph) — vectordb arg removed
- ui.NewServer(repo, metrics, topo) — vectordb arg removed
- api.Server.SetVectorIndex removed

Operator migration:
- The data/vectordb.snapshot file is left in place on disk; the loader
  that read it at boot is deleted, so it becomes a stale file that is
  safe to remove by hand. No automatic cleanup.
- MCP clients calling find_similar_logs already receive "unknown tool"
  after the prior commit; the HTTP /api/logs/similar route now 404s.
The `graph_snapshots` table backed exactly one MCP tool (get_graph_snapshot,
cut earlier in this PR) — no UI surface or REST endpoint reads it. With
the tool gone the table is pure write amplification: at 15-minute cadence
× ~100 tenants × per-row JSON nodes+edges blob it adds ~67k rows/week
even after the 7-day age prune, and the row-count backstop only kicks in
above 100k. On the SQLite OOM-within-an-hour deployment this contributes
meaningfully to the 2 TB/day disk growth.

Deletions:
- internal/graphrag/snapshot.go (entire file): GraphSnapshot GORM model,
  takeSnapshot / takeSnapshotForTenant, pruneOldSnapshots,
  GetGraphSnapshot, maxSnapshotRows constant.
- views.GraphSnapshot type + GraphSnapshotFromModel converter (only used
  by the removed test).
- TestGraphRAG_GetGraphSnapshot_TenantScoped + the GraphSnapshot wire-
  shape leak test in views_test.go.

Updates:
- AutoMigrateGraphRAG no longer creates the table on fresh installs.
  graphRAGTables slice drops "graph_snapshots" so tenant-backfill skips
  it and the test asserting the per-table backfill no longer expects
  the row.
- refresh.go::snapshotLoop now only calls persistDrainTemplates; the
  snapshotEvery field and the loop name are kept for wiring stability so
  external Config.SnapshotEvery still tunes the drain-persist cadence.

Operator migration: existing graph_snapshots tables are LEFT IN PLACE on
upgrade — AutoMigrate's IF NOT EXISTS semantics mean a populated table is
not touched. Operators wanting to reclaim disk should
`DROP TABLE graph_snapshots; VACUUM;` after upgrading. The table will
stop receiving new writes immediately.
Makes the platform survivable at 120 services on SQLite, the target the
prior commits in this PR have been shaving heap and disk pressure for.
Two coordinated changes:

1. SQLite PRAGMA stanza in factory.go is hardened from 3 to 8 settings
   and made fail-closed:

     PRAGMA journal_mode=WAL
     PRAGMA synchronous=NORMAL
     PRAGMA cache_size=-262144        # 256 MB page cache
     PRAGMA temp_store=MEMORY
     PRAGMA mmap_size=1073741824      # 1 GB mmap
     PRAGMA wal_autocheckpoint=10000  # checkpoint after 10k pages
     PRAGMA journal_size_limit=67108864  # cap WAL at 64 MB
     PRAGMA busy_timeout=5000

   Each PRAGMA failure now aborts startup with a wrapped error
   (`sqlite pragma %q failed: %w`) so an unexpected SQLite build that
   doesn't honour, e.g. mmap_size, can't silently regress the platform
   to default-tuned behaviour.

2. config.Load now runs `applyDriverDefaults(cfg)` after constructing
   the Config struct. When DBDriver=sqlite (case-insensitive) AND the
   operator did not explicitly set the env var (detected via
   os.LookupEnv presence — value comparison would falsely treat
   operator-set Postgres-default values as "unset"), the following
   defaults flip:

     DB_MAX_OPEN_CONNS           50    → 1
     DB_MAX_IDLE_CONNS           10    → 1
     INGEST_PIPELINE_WORKERS     8     → 2
     INGEST_PIPELINE_QUEUE_SIZE  50000 → 10000
     METRIC_MAX_CARDINALITY      10000 → 3000
     STORE_MIN_SEVERITY          ""    → "WARN"
     SAMPLING_RATE               1.0   → 0.05
     GRPC_MAX_CONCURRENT_STREAMS 1000  → 240
     LOG_FTS_ENABLED             false → true

   Postgres/MSSQL/MySQL paths are unchanged bit-for-bit (early-return
   in applyDriverDefaults).

The applyDriverDefaults override is unit-tested for: the all-flip path,
the "respect explicit operator override" path, the Postgres no-op path,
and case-insensitive driver matching.

Design rationale and per-default justification:
docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
Updates the operator-facing documentation to reflect the refactor in
this PR:

- CLAUDE.md "MCP Server" section rewritten to describe the 7-tool
  triage surface (kept + cut lists). The architecture diagram drops the
  legacy Vector accelerator layer. The "Storage Architecture",
  "GraphRAG Architecture" (background processes, persistence models,
  log clustering), and "Key Directories" sections drop their vectordb /
  graph_snapshots mentions. A new "SQLite per-driver defaults" section
  documents the nine env-var overrides flipped by applyDriverDefaults
  and the eight PRAGMAs applied at startup.
- LOG_FTS_ENABLED entry rewritten to document the new SQLite-default
  `true` (with the LIKE-fallback / drop_fts reclaim path preserved).
- STORE_MIN_SEVERITY entry notes the new SQLite-default `"WARN"`.
- README.md "Features" bullet swaps "21 tools" for the 7-tool triage
  surface and inlines the kept tool names.
- .env.example drops the VECTOR_INDEX_* block, adds a "SQLite Tuning"
  block listing every auto-flipped default, and notes the 7-tool MCP
  surface under the MCP section.
- The design spec at
  docs/superpowers/specs/2026-05-24-mcp-7tool-sqlite-survival-design.md
  is the canonical record of the refactor's rationale, decision matrix,
  per-default justification, migration notes, and risk/mitigation table.
The github.com/RandomCodeSpace/central-ops module is private and 404s for
this account, breaking offline/air-gapped builds. Only two symbols were used,
both trivially replaceable in-tree:

- main.go: version.Detect() -> local detectVersion() via runtime/debug
  (runtime/debug was already imported); falls back to "local".
- internal/mcp/server.go: httputil.CORSMiddleware -> local corsMiddleware
  that sets Access-Control-Allow-* and answers OPTIONS preflight.

go mod edit -droprequire + go.sum cleanup. go build and go vet pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sole

Full frontend overhaul on React 19 + @ossrandom/design-system.

Dashboard (new default view):
- 12-col bento grid, 5 bands: hero health gauge, traffic/errors,
  top failing services, recent anomalies, platform health.
- Uptime stat + readiness probe dots + DB size.
- Recent anomalies clamped to a 15-minute window, deduped by
  service|type to the 20 most recent.

Service map:
- Raw cytoscape (cose-bilkent) topology scaling 1–200 services with
  every node on screen; node size = degree (edge count).
- Hover/click reveals a node's edges + stats; label LOD past 120 nodes.
- Graph/List segmented toggle (list is the accessible default on
  touch/small screens).

MCP Trial console:
- List-detail layout over the 7-tool triage surface via JSON-RPC.
- Dynamic tool forms, result views, history, live SSE stream, settings.

Platform:
- TopNav segmented tabs (Dashboard/Services/MCP); removed Logs & Traces.
- ErrorBoundary mounted; theme race fixed in main.tsx.
- ServicesView and MCPConsoleView lazy-loaded to keep cytoscape out of
  the initial bundle (206KB / 65KB gz).
- Re-embedded internal/ui/dist.

tsc -b, vitest, and eslint pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The metrics middleware's responseWriter wraps http.ResponseWriter to capture
the status code and already forwards Hijack (for WebSocket upgrades), but
embedding the interface drops Flush from the method set. The MCP SSE handler's
w.(http.Flusher) assertion therefore failed and GET /mcp returned 500
"SSE not supported", so the UI LiveStream looped on EventSource onerror.

Add a Flush() forwarder mirroring the existing Hijack(). GET /mcp now returns
200 text/event-stream and pushes endpoint + resources/updated events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
DashboardStats.AvgLatencyMs was milliseconds but P99Latency was microseconds
(storage computes p99 in µs and its tests assert µs). Only avg got the µs→ms
conversion, so the dashboard rendered the raw µs p99 under an "ms" label —
e.g. 4,430,763 ms.

Convert p99 to ms at the API view boundary (storage stays µs so its tests
pass) and rename the field to p99_latency_ms for unit-explicit parity with
avg_latency_ms. Frontend reads the renamed field.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LiveStream dumped the raw JSON-RPC envelope (with an escaped graph blob).
Add formatStreamEvent() to render concise lines —
"graph · N svc · M edges · healthy/degraded/critical" and a handshake line —
and listen for the named `endpoint` event EventSource won't route to onmessage.

The -15m/-1h/-24h quick presets existed only on datetime fields (since/start/
end). The time_range duration field (root_cause_analysis, trace_graph) was a
bare input; add matching 15m/1h/24h preset chips that fill the Go-duration
string.

Rebuilds internal/ui/dist, which re-embeds these changes plus the p99→ms
dashboard fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves conflicts in favour of main's canonical backend (PR #91 superseded
this branch's older copies of the 7-tool MCP + SQLite survival work). Branch's
net-new contribution is unchanged: frontend rebuild, MCP SSE Flush fix, p99
µs→ms, and the MCP console stream-formatting + time_range presets.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Go stdlib 1.25.10 carries two advisories (GO-2026-5037, GO-2026-5039) flagged
by OSV-Scanner; both are fixed in 1.25.11. CI installs the toolchain via
go-version-file: go.mod, so bumping the directive clears the SCA gate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

@aksOps aksOps merged commit df1377d into main Jun 5, 2026
17 checks passed
@aksOps aksOps deleted the feat/frontend-dashboard-mcp branch June 5, 2026 10:22
aksOps added a commit that referenced this pull request Jun 5, 2026
Move Unreleased into a dated [v0.2.0-beta.6] section; add #98/#99 work + go 1.25.11; refresh compare links.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant