Skip to content

feat(graphrag): retune ingest defaults for 100-200 services#49

Merged
aksOps merged 1 commit into
mainfrom
chore/ingest-pipeline-phase0-defaults
Apr 27, 2026
Merged

feat(graphrag): retune ingest defaults for 100-200 services#49
aksOps merged 1 commit into
mainfrom
chore/ingest-pipeline-phase0-defaults

Conversation

@aksOps

@aksOps aksOps commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Raise GRAPHRAG_WORKER_COUNT default 4→16 and GRAPHRAG_EVENT_QUEUE_SIZE default 10k→100k. Sized for the documented 100–200 service operational target.
  • New "Edge Pre-processing (OTel Collector)" section in docs/OPERATIONS.md with a tail-sampling pipeline recipe and a double-sampling caveat.
  • Make TestOnSpanIngested_DropsIncrementMetric use defaultChannelSize+1000 instead of a hardcoded 11000 so future retuning doesn't silently invalidate it.

This is Phase 0 of a multi-phase robustness push for 150–200 component scale. Subsequent phases (already brainstormed and approved): async ingest pipeline with hybrid backpressure, per-tenant cardinality fairness, SQLite FTS5+BM25 for log search, Postgres partitioning as opt-in adapter, wire-level RESOURCE_EXHAUSTED/429 backpressure, DROP-PARTITION retention.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test ./... — all 12 packages pass
  • TestOnSpanIngested_DropsIncrementMetric passes against new 100k buffer (validates the test is not silently no-op)
  • CLAUDE.md and OPERATIONS.md doc lines reflect the new defaults
  • Memory impact of bumped defaults: ~5MB channel + ~50KB goroutine stacks (verified by inspection of the event struct shape)

🤖 Generated with Claude Code

Raise GraphRAG worker pool from 4 to 16 and event channel buffer from
10k to 100k slots. The previous defaults were sized for a handful of
services; at the 100-200 service scale the documented operational
target loud services would saturate the buffer and trigger
`graphrag_events_dropped_total` increments under steady-state load.

Memory cost of the new defaults is ~5MB extra channel capacity plus
~50KB extra goroutine stacks — negligible at the deployment scale
where this matters.

Also adds an "Edge Pre-processing (OTel Collector)" section to
docs/OPERATIONS.md with a recommended Collector pipeline (memory_limiter
+ tail_sampling + batch) and notes on avoiding double-sampling between
the edge Collector and OtelContext's internal sampler.

TestOnSpanIngested_DropsIncrementMetric now uses defaultChannelSize+1000
instead of a hardcoded 11000 so it stays valid through future retuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit acf904d into main Apr 27, 2026
17 checks passed
@aksOps aksOps deleted the chore/ingest-pipeline-phase0-defaults branch April 27, 2026 15:53
aksOps added a commit that referenced this pull request Apr 28, 2026
Five small follow-ups from the second-pass review of PRs #49#55:

- tsdb: fire cardinality-overflow callback AFTER releasing the
  Aggregator mutex. The callback is currently a Prometheus
  increment (atomic) but holding mu across an external function
  call is a footgun for any future hook. Capture the tenant
  under lock; invoke after Unlock.
- storage: use errors.Is(err, sql.ErrNoRows) in pgLogsRelkind
  instead of strings.Contains(err.Error(), "no rows"). Robust
  against driver wrapping.
- storage: convert Repository.logsPartitioned from plain bool
  to atomic.Bool. Removes the memory-model fragility of "the
  writer ran first" — read by retention.go from a separate
  goroutine.
- config: reject negative MCP_MAX_CONCURRENT / MCP_CALL_TIMEOUT_MS
  / MCP_CACHE_TTL_MS at Validate(). 0 stays the documented
  "disable" sentinel; negatives are typos that should fail loud.
- mcp: upgrade SetCallLimit doc to flag it startup-only — runtime
  resize leaks a slot in the old channel.

Skipped (with rationale, not silently dropped):
- M1 Submit TOCTOU on closed pipeline — cosmetic only, current
  ordering is documented.
- M2 ring/onIngest setter races — would require API change to
  fix properly; benign during normal startup-only usage.
- M4 FTS5 trigger throughput — needs a bulk-rebuild path, not
  a one-line tweak.
- M5 isQueueFull scope — hypothetical concern with no observed
  symptom; revisit only if metrics show drift.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant