fix(bench): correct matrix-harness reuse measurement, thread guard, resume by wesleyscholl · Pull Request #189 · konjoai/squish

wesleyscholl · 2026-06-29T18:17:47Z

Why

The ollama-vs-squish benchmark matrix (benchmarks/ollama_vs_squish/matrix/) had four issues that blocked a real run. All four are harness-only — no change to squish/ runtime.

Fixes

Reuse mismeasured. This Ollama build reports the full prompt_eval_count even when the KV prefix is reused, and Squish's reuse counters miss the prefix-slot path. Both engines now fall back to the prefill-time collapse 1 - warm/cold against a cold-prefill reference measured per (system, ctx), keeping the head-to-head apples-to-apples. A sentinel run-index (10_000_000) ensures the cold reference shares no cache prefix with the primed block. (cache_probe.py, cell.py)
Thread crash. RSSSampler/TemperatureSampler shadowed a threading.Thread internal with self._stop; renamed to _stop_event. (memory.py, thermal.py)
Governor false-positive. "compressed"/"swap" matched benign "free_swap=0 B" log lines and falsely flagged degradation; narrowed to "swapping"/"paged out". (memory.py)
Squish metrics URL was /metrics; it is served at /v1/metrics. (systems.py)

Also

--resume <dir> for run_matrix: cells whose <cell_id>.json already exists are loaded and skipped → crash-safe / overnight-resumable.
Two unit tests for the Ollama prefill-ratio fallback (count-unreliable → ratio wins; genuine partial count → count wins).

Test

CI=1 .venv/bin/python -m pytest tests/test_benchmark_matrix.py -q   # 58 passed
ruff check / ruff format --check                                    # clean

🤖 Generated with Claude Code

…esume Four issues blocked a real ollama-vs-squish matrix run: - Reuse mismeasured. This Ollama build reports the full prompt_eval_count even when the KV prefix is reused, and Squish's reuse counters miss the prefix-slot path. Both engines now fall back to the prefill-time collapse (1 - warm/cold) against a cold-prefill reference measured per system+ctx, keeping the head-to-head apples-to-apples. (cache_probe.py, cell.py) - Thread crash. RSSSampler/TemperatureSampler shadowed a Thread internal with self._stop; renamed to _stop_event. (memory.py, thermal.py) - Governor false-positive. "compressed"/"swap" matched benign "free_swap=0 B" log lines; narrowed to "swapping"/"paged out". (memory.py) - Squish metrics URL was /metrics; it is served at /v1/metrics. (systems.py) Also adds --resume to run_matrix (per-cell JSON skip -> crash-safe) and two unit tests for the Ollama prefill-ratio fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wesleyscholl merged commit ccb7dd4 into main Jun 30, 2026
17 checks passed

wesleyscholl deleted the bench/matrix-harness-fixes branch June 30, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bench): correct matrix-harness reuse measurement, thread guard, resume#189

fix(bench): correct matrix-harness reuse measurement, thread guard, resume#189
wesleyscholl merged 1 commit into
mainfrom
bench/matrix-harness-fixes

wesleyscholl commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wesleyscholl commented Jun 29, 2026

Why

Fixes

Also

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant