fix+test: [timeseries] panic on ExtentList.Splice when shard count exceeds len(el)*4 hint#1016
fix+test: [timeseries] panic on ExtentList.Splice when shard count exceeds len(el)*4 hint#1016thinker0 wants to merge 5 commits into
Conversation
…ceeds len(el)*4 hint
spliceByTime and spliceByTimeAligned pre-allocated their result slice
with a fixed `len(el)*4` length and used `out[k] = ...` indexed writes.
When the requested time range produces more than 4 shards per input
extent — easily reached on any backend running with shard_max_size_time
(e.g., 6h range / 1h shard = 6 shards, 30d / 1h = 720 shards) — the
loop indexes past the slice length and panics with:
runtime error: index out of range [N] with length N
pkg/timeseries.ExtentList.spliceByTime (extent_list.go:249)
The panic surfaces as an upstream proxy-error from DeltaProxyCache
and an empty body to the caller; from a Grafana dashboard's perspective
the long-range query simply fails.
This patch:
1. Computes an upper-bound shard count from the requested range and
maxRange before allocating, so the slice has enough capacity for
every shard that the loop will produce (+ a small safety margin for
the boundary tick at the end of each extent).
2. Switches from length-based pre-allocation + indexed writes to
capacity-based pre-allocation + append. Combined with (1), large
ranges (720 shards = 30d at 1h cadence) allocate once with no
growth; small ranges allocate the same as before.
3. Adds regression coverage:
- spliceByTime high shard count (6h / 1h = 7 shards) — the minimal
case that panicked under the old fixed-4 hint.
- spliceByTime stress 7d/1h (168 shards) and 30d/1h (720 shards) —
real-world Grafana dashboard ranges that previously panicked.
All existing TestSplice subtests continue to pass.
Signed-off-by: thinker0 <thinker0@gmail.com>
|
Quick follow-up — this fix has been running on our production deployment of Pre-patch reproducibility (deterministic, every request that crosses the boundary):
In every case Post-patch verification (same backend, same config, same Grafana panels):
Test suite reminder (also in the PR body):
Happy to push a |
|
Can you start with referencing our PR template? |
|
I'm guessing this was missed by our integration tests because we're relying on a small set of data, at present. Will backlog something related to this. There's probably a dev dashboard panel worth adding that would cover this as well. That may be worth adding to the current PR. |
Comments referenced the panic this PR fixes; once merged the historical context becomes noise. The patched math is self-explanatory. Addresses review feedback on trickstercache#1016. Signed-off-by: thinker0 <thinker0@gmail.com>
|
Thanks for the review @crandles. Addressed both items:
|
This sounds like a separate bug worth tracking, a 200 does not seem entirely correct |
|
we have one test on the GitHub runner OOMing. I will re-run the workflow action to see if that gets it to pass, but we should look at why this test (or one ran earlier) requires/holds on to so much memory and address: This test is also failing: |
|
@thinker0 thanks for the find and helping get the PR adjusted per our standards. We really appreciate it. I re-ran the workflow for unit tests and we're still seeing the Out-of-Memory error and the failed test in the CI. I have allowed this PR to auto-run CI tests on new commits, so you can push fixes to your branch and watch the check statuses to confirm the failures are corrected. Locally run a full |
The Splice helpers used range/maxRange to size the result buffer, but the inner loop advances by 'step' per iteration. When step > maxRange, that formula overestimates the shard count: the CI case used maxRange=100ns, step=600s, which hinted ~18B slots and OOM'd the runner. Use max(maxRange, step) as the divisor — matches actual forward progress and keeps the existing 7d/30d allocations identical. Signed-off-by: thinker0 <thinker0@gmail.com>
|
Thanks @jranson — great catch on the CI runs. Pushed Local Re: @crandles's separate point about the silent |
…iceByPoints Addresses additional review concerns beyond the initial splice panic fix: - Sum shard counts per extent instead of multiplying len(el) by the per-extent max — avoids O(len(el)) over-allocation when extents have widely varying durations. - Clamp the total capacity hint to maxShardCount (1<<20). Pathological inputs (very fine step over a long range) would otherwise drive a multi-GB single make(); helpers now return a Clone of the input instead of splitting, preserving process stability. - Apply the same capacity-based + clamped pattern to spliceByPoints, which carried the identical len(el)*4 anti-pattern that this PR originally fixed only in the time-based helpers. - Use the Go 1.21+ max() builtin (go.mod is at 1.26.3) to replace the manual stride conditional. Signed-off-by: thinker0 <thinker0@gmail.com>
…flow, clamp - 4-shard and 5-shard boundary cases — locks in the pre-patch overflow threshold (len(el)*4 happened to accommodate exactly 4 shards plus a boundary tick). - spliceByTimeAligned stress at 30d/1h — the time-based helper had a stress subtest already; the aligned helper now gets parity coverage. - spliceByPoints high shard count — regression test for the same fix pattern applied to the points-based helper. - Pathological input clamping — exercises the maxShardCount guard without OOM'ing the test runner. Signed-off-by: thinker0 <thinker0@gmail.com>
|
Pushed
New subtests added:
Re: the |
Description
pkg/timeseries.ExtentList's three Splice helpers —spliceByTime,spliceByTimeAligned, andspliceByPoints— all pre-allocate their result slice with a fixedlen(el)*4length and write viaout[k] = ...indexed writes. When the requested time range produces more than 4 shards per input extent, the loop indexes past the slice length and panics withruntime error: index out of range [N] with length N.The panic is wrapped by
DeltaProxyCache's singleflight into a genericproxy-error, surfacing as HTTP 200 with an empty body — a silent failure under realistic shard configurations (7d/1h = 168 shards, 30d/1h = 720 shards, both common in Grafana dashboards backed by Prometheus whenshard_max_size_timeis enabled).Reproduce
Any
query_rangewhere(end-start)/maxRange > 4. Minimal: 6h range, 60s step, 1hshard_max_size_time(boundary tick produces 7 shards). Trickster stderr shows the panic; the caller receives HTTP 200 with an empty body.Fix
For all three Splice helpers:
int(extent.Duration / stride) + paddingacross all input extents (wherestride = max(maxRange, step)for time-based helpers,step * maxPointsfor the points helper).maxShardCount = 1 << 20— if exceeded, return aClone()of the input instead of attempting to splice. This guards against pathological inputs (e.g., nanosecond step over a long range) that would otherwise drive a multi-GB singlemake().append.Test coverage
Added subtests to
TestSpliceinpkg/timeseries/extent_list_test.go:spliceByTime high shard count (6h/1h, regression)— minimal panic case (7 shards) with explicit expected output extents.spliceByTime stress 7d/1h (168 shards)— long-range no-panic + ~168 shards.spliceByTime stress 30d/1h (720 shards)— production worst-case (30-day Grafana dashboard).spliceByTime 4-shard boundaryand5-shard boundary— locks in the exact pre-patch overflow threshold.spliceByTimeAligned stress 30d/1h— parity coverage for the aligned helper.spliceByPoints high shard count— regression test for the points helper.spliceByTime clamps pathological input— exercises themaxShardCountguard without OOM'ing the test runner.All existing
TestSplicesubtests pass.pkg/proxy/engines(the consumer ofSpliceviaDeltaProxyCacheRequest) also passes.Production verification
Running on
v2.0.xsince the report was filed:runtime error: index out of range [N] with length N).appendgrowth in profiles.Type of Change
AI Disclosure