Skip to content

fix: snapshot recording fails for multi-arm experiments#54

Merged
jjroelofs merged 5 commits into
1.xfrom
fix/snapshot-many-arms
Jun 30, 2026
Merged

fix: snapshot recording fails for multi-arm experiments#54
jjroelofs merged 5 commits into
1.xfrom
fix/snapshot-many-arms

Conversation

@jjroelofs

Copy link
Copy Markdown
Contributor

Summary

  • Fixes snapshot sampling for experiments where recordTurns() is called with many arm IDs (e.g. 208 arms for ai_sorting views). The old ($total_turns % $interval) === 0 check assumed single-step increments and never aligned with multi-arm jumps, producing zero snapshots regardless of traffic volume.
  • Replaces exact modulo with range-crossing detection: floor(total_turns / interval) != floor(previous_turns / interval). This correctly detects when a multi-step jump crosses a sampling boundary.
  • Passes the actual $step_size through recordSnapshot() and maybeRecordSnapshots() so the sampler knows the jump width.

Root cause

SnapshotStorage::shouldRecordSnapshot() uses a modulo check to sample snapshots at regular intervals. When ExperimentDataStorage::recordTurns() is called with N arm IDs, total_turns jumps by N per request. Two problems:

  1. First window skipped entirely. With 208 arms the first call sets total_turns to 208, instantly exceeding the first_window of 19 (floor(10000/208) * 0.4).
  2. Middle interval never aligns. The modulo (total_turns % interval) never hits zero because the 208-step jumps and the growing interval share no common factor. Simulation across all 82 page views (17,160 total turns) confirms zero hits.

Same bug affects rl_sorting (batches all visible arms via JS IntersectionObserver into a single action=turns POST), and any experiment with more than ~50 arms.

Changes

File Change
SnapshotStorageInterface.php Add optional $step_size param to recordSnapshot()
SnapshotStorage.php shouldRecordSnapshot() and isMilestone() use range-crossing instead of modulo
ExperimentDataStorage.php recordTurns() passes $arm_count as step_size; maybeRecordSnapshots() propagates it

Test plan

  • Verify fix with simulation: run shouldRecordSnapshot across a 208-arm experiment's traffic; confirm hits > 0
  • Verify existing behaviour preserved for small experiments (step_size=1 degenerates to the old modulo check)
  • Rsync to DDEV site and confirm the experiment detail page no longer shows "No data yet"
  • Run docker compose --profile lint run --rm drupal-lint (passed locally)

Fixes #53

When recordTurns() is called with many arm IDs (e.g. 208 for
ai_sorting), total_turns jumps by the arm count per request.
The shouldRecordSnapshot() modulo check ($total_turns % $interval === 0)
assumes increments of 1 and never aligns with these large jumps,
producing zero snapshots regardless of traffic volume.

Replace exact modulo with range-crossing detection: check whether the
step [previous_turns, total_turns] crosses a sampling boundary via
floor(total_turns / interval) != floor(previous_turns / interval).

Pass the actual step_size through the recording chain so the snapshot
sampler knows the jump width.

Fixes #53
GuzzleHttp\ClientInterface does not define post(); that method only
exists on the concrete Client class. PHPStan correctly flags this
as method.notFound. Use request('POST', ...) which is part of the
interface contract.
Comment thread src/Storage/SnapshotStorage.php Outdated
// For middle section, check if step crossed an interval boundary.
$interval = $this->calculateMiddleInterval($snapshots_per_arm, $total_turns);
return ($total_turns % $interval) === 0;
return (int) floor($total_turns / $interval) !== (int) floor($previous_turns / $interval);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] This moving interval makes every large batch a milestone. calculateMiddleInterval() derives the interval from the current $total_turns, so the boundary moves forward on every request. For 208 arms at request 82, total=17056, previous=16848, and interval=1701, yielding buckets 10 and 9; the same comparison succeeds on every preceding request too. Because recordTurns() then writes one row per arm and isMilestone() repeats this predicate, 82 requests create 17,056 permanent rows against the 9,984-row experiment budget, and cleanup cannot remove them because it only deletes is_milestone = 0. Please use a stable sampling threshold/schedule and add a regression test asserting batched traffic does not snapshot every request.

* interval boundary crossings.
*/
public function recordSnapshot(string $experiment_id, string $arm_id, int $turns, int $rewards, int $total_experiment_turns): void;
public function recordSnapshot(string $experiment_id, string $arm_id, int $turns, int $rewards, int $total_experiment_turns, int $step_size = 1): void;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Adding this optional parameter breaks existing interface implementers. It is backward-compatible for callers, but a custom class implementing the previous five-argument signature now fails at class loading with an incompatible declaration fatal error. Since this targets the stable 1.1.x line, please preserve the existing contract or introduce a secondary range-aware capability with a fallback for current implementations.

Address PR #54 review feedback:

P1: Sampling based on total_experiment_turns made every large batch
cross an interval boundary, marking all snapshots as permanent
milestones that cleanup could not remove. Switch to per-arm turns
which always increment by 1 regardless of batch size, making the
modulo check reliable. Milestones now use a coarser multiple of
the sampling interval so they always land on recorded snapshots.

P2: Revert the step_size parameter added to SnapshotStorageInterface,
preserving the original 5-argument contract for existing implementers.

Also:
- Raise MAX_ROWS_PER_EXPERIMENT from 10k to 100k for better chart
  resolution in many-arm experiments (208 arms: 48 -> 250 per arm).
- Lower calculateSnapshotsPerArm floor from 20 to 2 so large arm
  counts stay within the per-experiment row budget.
- Cleanup now enforces per-arm budgets by removing oldest rows
  (including milestones) when arm count grows and quotas shrink.
@jjroelofs

Copy link
Copy Markdown
Contributor Author

Critical follow-up before merge

The original batching and interface issues are addressed, but the revised cleanup introduces two critical retention problems:

  1. The configured global 100k row limit is not enforceable. MAX_ROWS_PER_EXPERIMENT now permits each experiment to retain 100k rows, while the global cleanup still deletes only is_milestone = 0 rows. In the 1,000-arm / 1,000-turn simulation, each experiment reaches the 100k per-arm-budget cap with roughly 60k milestones and 40k non-milestones. With only two such experiments, global cleanup can delete all 80k non-milestones but still leaves roughly 120k milestone rows, permanently above the configured 100k limit. Additional experiments make this grow linearly.

  2. The per-arm hard-cap fallback destroys the early history it promises to preserve. When an arm remains over budget, cleanup orders by total_experiment_turns ASC and deletes the oldest rows regardless of milestone status. Those are precisely the first-window snapshots documented as permanent. For 1,000 arms at 1,000 turns, the current policy deletes the earliest six snapshots from every arm. The fallback should compact middle history first while explicitly preserving the allocated early and recent windows.

These need to be resolved before merge: global cleanup requires a final hard-cap path that can actually enforce the configured limit, and per-arm cleanup needs an explicit keep-set for early, middle, and recent history rather than deleting oldest-first. Please also add regression tests covering multiple max-size experiments and preservation of early snapshots during quota shrinkage.

Address follow-up review on PR #54:

1. Per-arm cleanup now compacts middle history first, explicitly
   preserving the first-window (early learning) and recent-window
   snapshots. Only falls back to trimming early/recent rows when
   the middle section is fully exhausted.

2. Global cleanup now enforces the configured row limit even when
   milestone rows alone exceed it: a second pass removes oldest
   rows regardless of milestone status after the non-milestone
   pass is insufficient.
@jjroelofs jjroelofs merged commit 35e3477 into 1.x Jun 30, 2026
3 checks passed
@jjroelofs jjroelofs deleted the fix/snapshot-many-arms branch June 30, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot recording silently fails for experiments with many arms

1 participant