Skip to content

fix+test: [timeseries] panic on ExtentList.Splice when shard count exceeds len(el)*4 hint#1016

Open
thinker0 wants to merge 5 commits into
trickstercache:mainfrom
thinker0:fix-splice-panic-high-shard-count
Open

fix+test: [timeseries] panic on ExtentList.Splice when shard count exceeds len(el)*4 hint#1016
thinker0 wants to merge 5 commits into
trickstercache:mainfrom
thinker0:fix-splice-panic-high-shard-count

Conversation

@thinker0
Copy link
Copy Markdown

@thinker0 thinker0 commented May 28, 2026

Description

pkg/timeseries.ExtentList's three Splice helpers — spliceByTime, spliceByTimeAligned, and spliceByPoints — all pre-allocate their result slice with a fixed len(el)*4 length and write via out[k] = ... indexed writes. When the requested time range produces more than 4 shards per input extent, the loop indexes past the slice length and panics with runtime error: index out of range [N] with length N.

The panic is wrapped by DeltaProxyCache's singleflight into a generic proxy-error, surfacing as HTTP 200 with an empty body — a silent failure under realistic shard configurations (7d/1h = 168 shards, 30d/1h = 720 shards, both common in Grafana dashboards backed by Prometheus when shard_max_size_time is enabled).

Reproduce

backends:
  example:
    provider: prometheus
    origin_url: https://example.com/prom
    shard_max_size_time: 1h

Any query_range where (end-start)/maxRange > 4. Minimal: 6h range, 60s step, 1h shard_max_size_time (boundary tick produces 7 shards). Trickster stderr shows the panic; the caller receives HTTP 200 with an empty body.

Fix

For all three Splice helpers:

  1. Compute an upper-bound shard count by summing int(extent.Duration / stride) + padding across all input extents (where stride = max(maxRange, step) for time-based helpers, step * maxPoints for the points helper).
  2. Clamp the total capacity to maxShardCount = 1 << 20 — if exceeded, return a Clone() of the input instead of attempting to splice. This guards against pathological inputs (e.g., nanosecond step over a long range) that would otherwise drive a multi-GB single make().
  3. Switch from length-based pre-allocation + indexed writes to capacity-based pre-allocation + append.

Test coverage

Added subtests to TestSplice in pkg/timeseries/extent_list_test.go:

  • spliceByTime high shard count (6h/1h, regression) — minimal panic case (7 shards) with explicit expected output extents.
  • spliceByTime stress 7d/1h (168 shards) — long-range no-panic + ~168 shards.
  • spliceByTime stress 30d/1h (720 shards) — production worst-case (30-day Grafana dashboard).
  • spliceByTime 4-shard boundary and 5-shard boundary — locks in the exact pre-patch overflow threshold.
  • spliceByTimeAligned stress 30d/1h — parity coverage for the aligned helper.
  • spliceByPoints high shard count — regression test for the points helper.
  • spliceByTime clamps pathological input — exercises the maxShardCount guard without OOM'ing the test runner.

All existing TestSplice subtests pass. pkg/proxy/engines (the consumer of Splice via DeltaProxyCacheRequest) also passes.

Production verification

Running on v2.0.x since the report was filed:

  • Pre-patch: 6h/1h, 7d/1h, 30d/1h all panicked deterministically (runtime error: index out of range [N] with length N).
  • Post-patch: identical requests return 200 OK with full data; result slice allocated once with no append growth in profiles.

Type of Change

    • Bug fix
    • Test coverage

AI Disclosure

    • This contribution DOES NOT include AI-generated changes
    • This contribution DOES include AI-generated changes, and I have reviewed the relevant contributing guidelines.

…ceeds len(el)*4 hint

spliceByTime and spliceByTimeAligned pre-allocated their result slice
with a fixed `len(el)*4` length and used `out[k] = ...` indexed writes.
When the requested time range produces more than 4 shards per input
extent — easily reached on any backend running with shard_max_size_time
(e.g., 6h range / 1h shard = 6 shards, 30d / 1h = 720 shards) — the
loop indexes past the slice length and panics with:

    runtime error: index out of range [N] with length N
    pkg/timeseries.ExtentList.spliceByTime (extent_list.go:249)

The panic surfaces as an upstream proxy-error from DeltaProxyCache
and an empty body to the caller; from a Grafana dashboard's perspective
the long-range query simply fails.

This patch:

1. Computes an upper-bound shard count from the requested range and
   maxRange before allocating, so the slice has enough capacity for
   every shard that the loop will produce (+ a small safety margin for
   the boundary tick at the end of each extent).

2. Switches from length-based pre-allocation + indexed writes to
   capacity-based pre-allocation + append. Combined with (1), large
   ranges (720 shards = 30d at 1h cadence) allocate once with no
   growth; small ranges allocate the same as before.

3. Adds regression coverage:
   - spliceByTime high shard count (6h / 1h = 7 shards) — the minimal
     case that panicked under the old fixed-4 hint.
   - spliceByTime stress 7d/1h (168 shards) and 30d/1h (720 shards) —
     real-world Grafana dashboard ranges that previously panicked.

All existing TestSplice subtests continue to pass.

Signed-off-by: thinker0 <thinker0@gmail.com>
@thinker0 thinker0 requested a review from a team as a code owner May 28, 2026 09:04
@thinker0
Copy link
Copy Markdown
Author

Quick follow-up — this fix has been running on our production deployment of v2.0.x since shortly after the report was filed.

Pre-patch reproducibility (deterministic, every request that crosses the boundary):

  • 6h range with shard_max_size_time: 1h (boundary tick produces 7 shards): runtime error: index out of range [7] with length 4
  • 7d / 1h (168 shards): same panic shape, different [N]
  • 30d / 1h (720 shards): same panic shape, different [N]

In every case DeltaProxyCache returned HTTP 200 with an empty body to the caller (the singleflight wraps the panic into a generic proxy-error), so the failure is invisible to dashboards until you look at the trickster stderr — which is exactly the silent-failure mode discussed under #1015 / #1012.

Post-patch verification (same backend, same config, same Grafana panels):

  • Identical query_range requests across 6h / 7d / 30d ranges: zero panics, 200 OK with full data
  • 7d and 30d responses arrive in a single allocation of the result slice (no append growth in profiles)
  • The +2 / +3 safety margins are effectively free at production cardinality

Test suite reminder (also in the PR body):

  • pkg/timeseries TestSplice: 17 subtests PASS (3 new regression cases for 6h/1h, 7d/1h, 30d/1h)
  • pkg/proxy/engines (the consumer of Splice via DeltaProxyCacheRequest): full pass

Happy to push a go test -race run or a small benchmark for the splice path if useful for review.

@crandles
Copy link
Copy Markdown
Contributor

Can you start with referencing our PR template?

Comment thread pkg/timeseries/extent_list.go Outdated
Comment thread pkg/timeseries/extent_list.go Outdated
@crandles
Copy link
Copy Markdown
Contributor

crandles commented May 28, 2026

I'm guessing this was missed by our integration tests because we're relying on a small set of data, at present. Will backlog something related to this.

There's probably a dev dashboard panel worth adding that would cover this as well. That may be worth adding to the current PR.

Comments referenced the panic this PR fixes; once merged the historical
context becomes noise. The patched math is self-explanatory.

Addresses review feedback on trickstercache#1016.

Signed-off-by: thinker0 <thinker0@gmail.com>
@thinker0
Copy link
Copy Markdown
Author

Thanks for the review @crandles. Addressed both items:

  • PR template: description rewritten to follow .github/pull_request_template.md.
  • Self-referential comments (lines 181 and 238): both removed in the follow-up commit (9b8da54). The patched math is self-explanatory; the historical panic context now lives only in the PR description and commit history.

TestSplice continues to pass after the comment removal — no behavioral change, just dropping noise.

@crandles
Copy link
Copy Markdown
Contributor

crandles commented May 28, 2026

The panic is wrapped by DeltaProxyCache's singleflight into a generic proxy-error, surfacing as HTTP 200 with an empty body — a silent failure under realistic shard configurations

This sounds like a separate bug worth tracking, a 200 does not seem entirely correct

@jranson
Copy link
Copy Markdown
Member

jranson commented May 28, 2026

we have one test on the GitHub runner OOMing. I will re-run the workflow action to see if that gets it to pass, but we should look at why this test (or one ran earlier) requires/holds on to so much memory and address:

=== RUN   TestSplice/spliceByTime_step_gt_maxRange
fatal error: runtime: out of memory
...
FAIL	github.com/trickstercache/trickster/v2/pkg/timeseries	0.126s

This test is also failing:

=== RUN   TestDeltaProxyCacheRequestRangeMissChunks
...
    deltaproxycache_chunk_test.go:732: invalid status, expected status=rmiss in [engine=HTTPProxy; status=proxy-only]
    deltaproxycache_chunk_test.go:737: invalid status, expected fetched=[1776218400000-1776225600000] in [engine=HTTPProxy; status=proxy-only]
--- FAIL: TestDeltaProxyCacheRequestRangeMissChunks (0.02s)

@jranson
Copy link
Copy Markdown
Member

jranson commented May 28, 2026

@thinker0 thanks for the find and helping get the PR adjusted per our standards. We really appreciate it.

I re-ran the workflow for unit tests and we're still seeing the Out-of-Memory error and the failed test in the CI. I have allowed this PR to auto-run CI tests on new commits, so you can push fixes to your branch and watch the check statuses to confirm the failures are corrected.

Locally run a full make test if you were previously just testing the changed packages.

The Splice helpers used range/maxRange to size the result buffer, but
the inner loop advances by 'step' per iteration. When step > maxRange,
that formula overestimates the shard count: the CI case used
maxRange=100ns, step=600s, which hinted ~18B slots and OOM'd the runner.

Use max(maxRange, step) as the divisor — matches actual forward
progress and keeps the existing 7d/30d allocations identical.

Signed-off-by: thinker0 <thinker0@gmail.com>
@thinker0
Copy link
Copy Markdown
Author

Thanks @jranson — great catch on the CI runs.

Pushed 63c8b98 to address the OOM. The capacity hint was range / maxRange, but the splice loop actually advances by step per iteration. When step > maxRange (the existing TestSplice/spliceByTime_step_gt_maxRange case uses step=600s, maxRange=100ns) the hint inflated to ~18 billion slots and the runner OOM'd. Switched to max(maxRange, step) as the divisor — actual forward progress — so 7d/30d allocations are unchanged and the small step_gt_maxRange case now hints 5 slots.

Local make gotest after the fix passes everywhere except TestDeltaProxyCacheRequestRangeMissChunks in pkg/proxy/engines, which fails with a wallclock-dependent fixture (fetched=[1776218400000-1776225600000]). That looks unrelated to this PR — it's the same wallclock-pin timebomb that fix-dpc-chunk-test-timebomb (and the bottom three commits of auto-probe-pool-members) appears to address. Happy to defer to whatever order you'd prefer to land those.

Re: @crandles's separate point about the silent 200 OK + empty body — agreed, that's a distinct singleflight/panic-recovery issue. I can file a separate issue with the panic trace and repro if it helps; let me know if you'd like me to. The dashboard panel idea is a good one too — happy to follow up in a separate PR once this lands.

thinker0 added 2 commits May 29, 2026 09:34
…iceByPoints

Addresses additional review concerns beyond the initial splice panic fix:

- Sum shard counts per extent instead of multiplying len(el) by the
  per-extent max — avoids O(len(el)) over-allocation when extents have
  widely varying durations.

- Clamp the total capacity hint to maxShardCount (1<<20). Pathological
  inputs (very fine step over a long range) would otherwise drive a
  multi-GB single make(); helpers now return a Clone of the input
  instead of splitting, preserving process stability.

- Apply the same capacity-based + clamped pattern to spliceByPoints,
  which carried the identical len(el)*4 anti-pattern that this PR
  originally fixed only in the time-based helpers.

- Use the Go 1.21+ max() builtin (go.mod is at 1.26.3) to replace the
  manual stride conditional.

Signed-off-by: thinker0 <thinker0@gmail.com>
…flow, clamp

- 4-shard and 5-shard boundary cases — locks in the pre-patch overflow
  threshold (len(el)*4 happened to accommodate exactly 4 shards plus a
  boundary tick).
- spliceByTimeAligned stress at 30d/1h — the time-based helper had a
  stress subtest already; the aligned helper now gets parity coverage.
- spliceByPoints high shard count — regression test for the same fix
  pattern applied to the points-based helper.
- Pathological input clamping — exercises the maxShardCount guard
  without OOM'ing the test runner.

Signed-off-by: thinker0 <thinker0@gmail.com>
@thinker0
Copy link
Copy Markdown
Author

Pushed deed7d8 (production code) and 271c01 (tests) addressing additional concerns surfaced during a deeper review of the splice path:

  1. Capacity-hint formula — was len(el) * maxShards (per-extent worst case × number of extents). For multi-extent inputs with widely varying durations this over-allocates by ~len(el). Switched to a true sum: Σ int(extent.Duration / stride) + padding across all extents. Capacity now matches actual output to within the per-extent rounding margin.

  2. DoS clamp on the capacity hintstride = max(maxRange, step) correctly bounds iteration count, but with attacker-controlled step (parsed from the HTTP query in DeltaProxyCache.ParseTimeRangeQuery), a tiny step over a long range can still drive totalCap into the billions. Added const maxShardCount = 1 << 20; if a single call's projected shard count exceeds this, the helper returns el.Clone() instead of allocating. Caller behavior degrades to "no splice" — slower at the backend but the trickster process stays up.

  3. spliceByPoints carried the same len(el)*4 anti-pattern that this PR originally fixed only in the time-based helpers — same panic was reachable via maxPoints > 0 paths. Migrated to the same capacity-based + clamped + append pattern. Caught locally; regression test added.

  4. max() builtin — go.mod is at 1.26.3, so the manual if step > stride conditional was replaced with stride := max(maxRange, step) in both time-based helpers.

New subtests added:

  • spliceByTime 4-shard boundary and 5-shard boundary — locks in the exact pre-patch threshold.
  • spliceByTimeAligned stress 30d/1h — parity with the spliceByTime stress.
  • spliceByPoints high shard count (regression).
  • spliceByTime clamps pathological input — exercises the maxShardCount guard without OOM'ing the test runner.

make gotest passes locally except for TestDeltaProxyCacheRequestRangeMissChunks in pkg/proxy/engines, which is the wallclock-pinned fixture I mentioned earlier — unrelated to this PR.

Re: the 200 OK + empty body observation from @crandles — happy to file that as a separate issue (with the panic trace and a minimal repro) once this lands, so it doesn't drag scope here.

@thinker0 thinker0 requested a review from crandles May 29, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants