Commit d1a9289
Fix two rare CI failures: ListPushPopStressTest host crash and VectorManager cleanup vs Reset() AVE (#1765)
Two independent rare CI failures, both surfacing as `Test host process
crashed` and aborting the whole test run.
## 1. `ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync` — fatal `AccessViolationException` in `VectorManager` cleanup task
### Symptom
```
Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
at Tsavorite.core.LogRecord.get_Info()
at Tsavorite.core.LogRecord.get_AllocatedSize()
at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed
```
The AVE is a Corrupted-State Exception — `catch (Exception)` in
`RunCleanupTaskAsync` cannot suppress it; the runtime fails fast and the
test host crashes.
### Root cause
`Recovery.Reset()` → `hlogBase.Reset()` (in `AllocatorBase` and the
per-allocator overrides `SpanByte` / `Object` / `TsavoriteLog`) frees pages
by synchronously invoking `OnPagesClosed(...)` and a
`for (i in BufferSize) FreePage(i)` loop. Both paths ultimately call
`ReturnPage(index)`, which sets:
```csharp
pageArrays[index] = default;
pagePointers[index] = default; // ★ becomes 0
```
`Reset()`'s docstring promised *"WARNING: assumes that threads have drained
out at this point."* But Garnet's cluster re-attach paths invoke it on a
running store:
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100`
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136`
In both files `storeWrapper.Reset()` is called **before**
`SuspendPrimaryOnlyTasksAsync()`, and even that suspend only drains
`TaskManager` tasks — `VectorManager.cleanupTask` is independent and never
drained.
Once `pagePointers[i] = 0`, the iterator's `GetPhysicalAddress` returns
`0 + offset` — a tiny kernel-page address — and dereferencing it in
`*(RecordInfo*)physicalAddress` raises a fatal AVE.
### The exact interleaving
Production scenario in `MigrateVectorSetWhileModifyingAsync`:
1. Source primary migrates a slot containing a vector set → drops the index → `CleanupDroppedIndex` queues a cleanup-task scan on the source primary.
2. The drop AOF entry replicates to the source's replica, which replays it and **also** queues a cleanup-task scan on the replica.
3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → `ReplicaDisklessSync.ReplicateAttachAsync` / `ReplicaDiskbasedSync.ReplicateAttachAsync` calls `storeWrapper.Reset()`.
4. The replica's cleanup task is still mid-iterate over the main store → AVE.
Thread-level interleaving:
```
Thread A: VectorManager cleanup task Thread B: storeWrapper.Reset()
───────────────────────────────────────── ─────────────────────────────────
loop session.Iterate(callbacks)
PushNext → ObjectScanIterator.GetNext()
epoch.Resume() ◄── enter at epoch E
headAddress = HeadAddress (still old value)
LoadPageIfNeeded(...) (cur >= head → in-mem)
physicalAddress =
pagePointers[pageIdx] + offset
Recovery.Reset()
hlogBase.Reset()
HeadAddress ← TailAddress
OnPagesClosed(...)
FreePage(p)
ReturnPage(p)
pagePointers[p] = 0 ◄── ★
// override loop:
for i in BufferSize:
FreePage(i)
ReturnPage(i)
pagePointers[i] = 0
*(RecordInfo*)physicalAddress ◄── ☠ AVE
(LogRecord.GetInfo /
LogRecord.AllocatedSize)
```
### Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
```csharp
epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));
```
`BumpCurrentEpoch` queues the action and only fires it after
`SafeToReclaimEpoch` has advanced past the prior epoch — i.e. after every
thread that was holding the prior epoch has either suspended or moved on.
That's why scan iterators are safe against normal eviction.
`Reset()` skipped that mechanism in two places:
1. `AllocatorBase.Reset()` invoked `OnPagesClosed(newBeginAddress)` directly.
2. The per-allocator overrides had a `for (i in BufferSize) FreePage(i)`
loop that ran **after** `base.Reset()` returned — also without epoch
protection. **This second loop is the actual point of failure**: even
if `OnPagesClosed` were deferred, the leftover (tail) page is freed by
the override loop while a reader could still be reading it.
### The fix (Tsavorite layer)
`AllocatorBase.Reset()` defers ALL page-close + page-free work through
`BumpCurrentEpoch` and waits on a `ManualResetEventSlim` signalled by the
deferred action — no polling:
```csharp
using var resetComplete = new ManualResetEventSlim(initialState: false);
// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
epoch.Resume(); // BumpCurrentEpoch requires a protected caller
try
{
epoch.BumpCurrentEpoch(() =>
{
try
{
if (headShifted) OnPagesClosed(newBeginAddress);
FreeAllAllocatedPages();
}
finally { resetComplete.Set(); } // never deadlock if action throws
});
}
finally { epoch.Suspend(); } // unconditionally so the action can fire
resetComplete.Wait();
if (wasProtected) epoch.Resume();
```
Each per-allocator override (`SpanByte` / `Object` / `TsavoriteLog`) moves
its `FreePage(i)` loop into a new `FreeAllAllocatedPages()` virtual so the
loop runs inside the deferred action:
```csharp
public override void Reset() { base.Reset(); Initialize(); }
protected override void FreeAllAllocatedPages()
{
for (int index = 0; index < BufferSize; index++)
if (IsAllocated(index)) FreePage(index);
}
```
### Why this is safe
* The deferred action runs only after `SafeToReclaimEpoch ≥ priorEpoch`,
i.e. after every iterator that was inside `GetNext` at the moment
`Reset()` was called has either suspended or advanced. By the time
`pagePointers[i] = 0` executes, no thread is reading `pagePointers[i]`.
* Iterators that re-enter `GetNext` after `HeadAddress` was shifted see
`currentAddress < headAddress` and route through the buffered disk frame
instead of `pagePointers` — so they don't touch the cleared array.
* `Reset()` blocks until the deferred work has actually run, preserving
its synchronous contract (the override's `Initialize()` after `Reset()`
observes a fully freed page set).
### Test vs. product
Strictly, `Reset()`'s docstring put the burden on callers. The cluster
re-attach paths violate that — they call `Reset()` before draining the
`VectorManager` cleanup task, and `SuspendPrimaryOnlyTasksAsync()` doesn't
cover it. The alternative would be to drain every background reader at
every `Reset()` callsite, but we chose to make `Reset()` itself epoch-safe
because the contract was implicit, callsites are scattered, and Tsavorite
already has the right primitive (`epoch.BumpCurrentEpoch`) — the normal
eviction path uses it. This makes the safety property **enforced** rather
than **assumed**, and protects any future caller / background reader.
### Repro
`test/Garnet.test/VectorCleanupVsResetRaceTests.cs` — adds 4 000 vectors,
drops the set (queues a full-keyspace cleanup scan), then spams
`storeWrapper.Reset()` for 5 s.
* **Without the fix:** crashes the host on every iteration with the exact
production stack (`LogRecord.get_Info` → `ObjectScanIterator.GetNext` →
`VectorManager.RunCleanupTaskAsync`).
* **With the fix:** all 5 `[Repeat]` iterations pass (~2 700 resets per
iteration concurrent with the cleanup iterator), no AVE.
## 2. `RespListTests.ListPushPopStressTest` — host crash on rare `RedisTimeoutException`
### Symptom
```
Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed
```
### Root cause (two compounding issues)
1. **Worker threads created via `new Thread(() => ...)` had no try/catch.**
In modern .NET an unhandled exception in a manually-created `Thread`
terminates the process, so a single transient `RedisTimeoutException`
aborted the entire test run.
2. **All 20 sync workers shared a single `ConnectionMultiplexer`.** Every
command went through one socket and one background writer. Under CI
load + lowMemory eviction overhead the writer falls behind and
accumulates queued messages until SyncTimeout (30s) trips. The failure
diagnostics confirmed this: `mc: 1/1, qs: 20, bw: SpinningDown`.
### Fix
* Pre-create one `ConnectionMultiplexer` per worker on the main thread.
Each thread now owns its own socket, eliminating the single-writer
bottleneck. Pre-creating also avoids a 20-way connect storm racing
`ConnectTimeout`.
* Wrap each worker body in try/catch; capture exceptions into a
`ConcurrentBag`, signal stop, exit cleanly. No more host crash.
* Throw the aggregate **before** the post-checks so a real timeout isn't
masked by secondary "list not empty" assertion noise.
* Route the deadline-exceeded path through the failure bag too.
## Files
```
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | 7 ++-
test/Garnet.test/RespListTests.cs | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs | new
```
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent eaa5734 commit d1a9289
7 files changed
Lines changed: 382 additions & 59 deletions
File tree
- .github
- libs/storage/Tsavorite/cs/src/core/Allocator
- test/Garnet.test
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
205 | 205 | | |
206 | 206 | | |
207 | 207 | | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
208 | 275 | | |
209 | 276 | | |
210 | 277 | | |
| |||
Lines changed: 95 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
246 | | - | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
247 | 271 | | |
248 | 272 | | |
249 | 273 | | |
250 | 274 | | |
251 | 275 | | |
252 | | - | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
253 | 284 | | |
254 | | - | |
255 | 285 | | |
256 | | - | |
257 | | - | |
| 286 | + | |
258 | 287 | | |
259 | | - | |
260 | | - | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
261 | 307 | | |
262 | | - | |
263 | | - | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
264 | 312 | | |
265 | | - | |
266 | | - | |
267 | | - | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
268 | 337 | | |
| 338 | + | |
| 339 | + | |
269 | 340 | | |
270 | 341 | | |
271 | | - | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
272 | 347 | | |
273 | 348 | | |
274 | 349 | | |
| |||
281 | 356 | | |
282 | 357 | | |
283 | 358 | | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
284 | 366 | | |
285 | 367 | | |
286 | 368 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
120 | 126 | | |
121 | 127 | | |
122 | 128 | | |
123 | 129 | | |
124 | 130 | | |
125 | | - | |
126 | 131 | | |
127 | 132 | | |
128 | 133 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
29 | 35 | | |
30 | 36 | | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
34 | | - | |
35 | 40 | | |
36 | 41 | | |
37 | 42 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
30 | 36 | | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
34 | 40 | | |
35 | | - | |
36 | 41 | | |
37 | 42 | | |
38 | 43 | | |
| |||
0 commit comments