feat(tsdb): per-tenant metric cardinality fairness#51
Merged
Conversation
Phase 2 of the 150-200 component robustness work. Adds a per-tenant
series budget so a noisy tenant cannot exhaust the global TSDB
cardinality pool and starve siblings of fresh series.
Behavior is opt-in to preserve back-compat:
- METRIC_MAX_CARDINALITY (existing, default 10000) — global series cap.
- METRIC_MAX_CARDINALITY_PER_TENANT (new, default 0=unlimited) — when
set, each tenant gets its own series budget.
- Per-tenant cap is checked FIRST; global cap is the backstop.
- Per-tenant overflow buckets are tenant-scoped (key suffix |<tenant>)
so each tenant's overflow stats stay separate.
Telemetry surface change:
- TSDBCardinalityOverflow (Counter) — kept for back-compat dashboards.
- TSDBCardinalityOverflowByTenant (CounterVec, label tenant_id) — new.
Sentinel "__global__" when the global cap (not per-tenant) triggered.
Lets operators identify noisy tenants:
sum by (tenant_id) (
rate(otelcontext_tsdb_cardinality_overflow_by_tenant_total[5m])
)
Aggregator API:
- SetCardinalityLimit signature changed to (global, perTenant int,
onOverflow func(tenantID string)). Sole external caller (main.go) is
updated. Old single-arg callback shape is gone.
- flush() resets seriesPerTenant alongside the buckets map so each
new window starts every tenant with a fresh budget.
Tests cover: zero-config baseline, global-only legacy behavior, per-tenant
fairness (tenant A exhausts budget, tenant B unaffected), per-tenant
overflow buckets stay separate (no merge regression), flush resets
counts, both caps coexist with correct precedence, default behavior
unchanged when only global is set, overflow bucket stat accumulation.
8 tests, all pass under -race; full suite (13 packages) green.
Docs updated in CLAUDE.md (env-var section) and docs/OPERATIONS.md
(defaults section + new alert query under Observability).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Phase 2 of the multi-phase 150–200 component robustness push. Adds per-tenant fairness to the in-memory TSDB cardinality budget so a noisy tenant cannot starve siblings.
METRIC_MAX_CARDINALITY_PER_TENANTenv var (default0= unlimited; preserves single-tenant legacy behavior).METRIC_MAX_CARDINALITYbecomes a backstop.|<tenant>) so overflow stats don't merge across tenants.otelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id}— sentinel__global__when the global cap (not per-tenant) was the trigger. Existing unlabeledOtelContext_tsdb_cardinality_overflow_totalis preserved for back-compat dashboards.Test plan
go build ./...cleango vet ./...cleango test -race ./...— all 13 packages pass (tsdb tests added)Aggregator.SetCardinalityLimitAPI change wired through main.go and the test file; no callers outside the packageBehavior matrix
METRIC_MAX_CARDINALITYMETRIC_MAX_CARDINALITY_PER_TENANTDocs
CLAUDE.mdenv-var section now documents the per-tenant cap and the new labeldocs/OPERATIONS.mddefaults section updated; new alert query added under Observability fortopknoisy tenantsFollow-ups (separate PRs)
🤖 Generated with Claude Code