feat(graphrag): retune ingest defaults for 100-200 services#49
Merged
Conversation
Raise GraphRAG worker pool from 4 to 16 and event channel buffer from 10k to 100k slots. The previous defaults were sized for a handful of services; at the 100-200 service scale the documented operational target loud services would saturate the buffer and trigger `graphrag_events_dropped_total` increments under steady-state load. Memory cost of the new defaults is ~5MB extra channel capacity plus ~50KB extra goroutine stacks — negligible at the deployment scale where this matters. Also adds an "Edge Pre-processing (OTel Collector)" section to docs/OPERATIONS.md with a recommended Collector pipeline (memory_limiter + tail_sampling + batch) and notes on avoiding double-sampling between the edge Collector and OtelContext's internal sampler. TestOnSpanIngested_DropsIncrementMetric now uses defaultChannelSize+1000 instead of a hardcoded 11000 so it stays valid through future retuning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This was referenced Apr 28, 2026
aksOps
added a commit
that referenced
this pull request
Apr 28, 2026
Five small follow-ups from the second-pass review of PRs #49–#55: - tsdb: fire cardinality-overflow callback AFTER releasing the Aggregator mutex. The callback is currently a Prometheus increment (atomic) but holding mu across an external function call is a footgun for any future hook. Capture the tenant under lock; invoke after Unlock. - storage: use errors.Is(err, sql.ErrNoRows) in pgLogsRelkind instead of strings.Contains(err.Error(), "no rows"). Robust against driver wrapping. - storage: convert Repository.logsPartitioned from plain bool to atomic.Bool. Removes the memory-model fragility of "the writer ran first" — read by retention.go from a separate goroutine. - config: reject negative MCP_MAX_CONCURRENT / MCP_CALL_TIMEOUT_MS / MCP_CACHE_TTL_MS at Validate(). 0 stays the documented "disable" sentinel; negatives are typos that should fail loud. - mcp: upgrade SetCallLimit doc to flag it startup-only — runtime resize leaks a slot in the old channel. Skipped (with rationale, not silently dropped): - M1 Submit TOCTOU on closed pipeline — cosmetic only, current ordering is documented. - M2 ring/onIngest setter races — would require API change to fix properly; benign during normal startup-only usage. - M4 FTS5 trigger throughput — needs a bulk-rebuild path, not a one-line tweak. - M5 isQueueFull scope — hypothetical concern with no observed symptom; revisit only if metrics show drift. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
GRAPHRAG_WORKER_COUNTdefault 4→16 andGRAPHRAG_EVENT_QUEUE_SIZEdefault 10k→100k. Sized for the documented 100–200 service operational target.docs/OPERATIONS.mdwith a tail-sampling pipeline recipe and a double-sampling caveat.TestOnSpanIngested_DropsIncrementMetricusedefaultChannelSize+1000instead of a hardcoded 11000 so future retuning doesn't silently invalidate it.This is Phase 0 of a multi-phase robustness push for 150–200 component scale. Subsequent phases (already brainstormed and approved): async ingest pipeline with hybrid backpressure, per-tenant cardinality fairness, SQLite FTS5+BM25 for log search, Postgres partitioning as opt-in adapter, wire-level RESOURCE_EXHAUSTED/429 backpressure, DROP-PARTITION retention.
Test plan
go build ./...cleango vet ./...cleango test ./...— all 12 packages passTestOnSpanIngested_DropsIncrementMetricpasses against new 100k buffer (validates the test is not silently no-op)🤖 Generated with Claude Code