Skip to content

fix(bench): correct matrix-harness reuse measurement, thread guard, resume#189

Merged
wesleyscholl merged 1 commit into
mainfrom
bench/matrix-harness-fixes
Jun 30, 2026
Merged

fix(bench): correct matrix-harness reuse measurement, thread guard, resume#189
wesleyscholl merged 1 commit into
mainfrom
bench/matrix-harness-fixes

Conversation

@wesleyscholl

Copy link
Copy Markdown
Collaborator

Why

The ollama-vs-squish benchmark matrix (benchmarks/ollama_vs_squish/matrix/) had four issues that blocked a real run. All four are harness-only — no change to squish/ runtime.

Fixes

  • Reuse mismeasured. This Ollama build reports the full prompt_eval_count even when the KV prefix is reused, and Squish's reuse counters miss the prefix-slot path. Both engines now fall back to the prefill-time collapse 1 - warm/cold against a cold-prefill reference measured per (system, ctx), keeping the head-to-head apples-to-apples. A sentinel run-index (10_000_000) ensures the cold reference shares no cache prefix with the primed block. (cache_probe.py, cell.py)
  • Thread crash. RSSSampler/TemperatureSampler shadowed a threading.Thread internal with self._stop; renamed to _stop_event. (memory.py, thermal.py)
  • Governor false-positive. "compressed"/"swap" matched benign "free_swap=0 B" log lines and falsely flagged degradation; narrowed to "swapping"/"paged out". (memory.py)
  • Squish metrics URL was /metrics; it is served at /v1/metrics. (systems.py)

Also

  • --resume <dir> for run_matrix: cells whose <cell_id>.json already exists are loaded and skipped → crash-safe / overnight-resumable.
  • Two unit tests for the Ollama prefill-ratio fallback (count-unreliable → ratio wins; genuine partial count → count wins).

Test

CI=1 .venv/bin/python -m pytest tests/test_benchmark_matrix.py -q   # 58 passed
ruff check / ruff format --check                                    # clean

🤖 Generated with Claude Code

…esume

Four issues blocked a real ollama-vs-squish matrix run:

- Reuse mismeasured. This Ollama build reports the full prompt_eval_count even
  when the KV prefix is reused, and Squish's reuse counters miss the prefix-slot
  path. Both engines now fall back to the prefill-time collapse (1 - warm/cold)
  against a cold-prefill reference measured per system+ctx, keeping the
  head-to-head apples-to-apples. (cache_probe.py, cell.py)
- Thread crash. RSSSampler/TemperatureSampler shadowed a Thread internal with
  self._stop; renamed to _stop_event. (memory.py, thermal.py)
- Governor false-positive. "compressed"/"swap" matched benign "free_swap=0 B"
  log lines; narrowed to "swapping"/"paged out". (memory.py)
- Squish metrics URL was /metrics; it is served at /v1/metrics. (systems.py)

Also adds --resume to run_matrix (per-cell JSON skip -> crash-safe) and two unit
tests for the Ollama prefill-ratio fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wesleyscholl wesleyscholl merged commit ccb7dd4 into main Jun 30, 2026
17 checks passed
@wesleyscholl wesleyscholl deleted the bench/matrix-harness-fixes branch June 30, 2026 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant