An end-to-end LLM evaluation pipeline connecting lm-evaluation-harness to a local Ollama server — benchmark scoring, load testing, prompt optimization, and output guardrails with zero cloud API calls.
Runs on Mac M3 8 GB. No cloud API keys required.
A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration. The core challenge is loglikelihood scoring: lm-eval needs log-probabilities over token continuations for multiple-choice tasks like MMLU and HellaSwag, and getting those out of Ollama required working through several endpoints before landing on a working approach. See eval_runner/DESIGN.md for the full technical walkthrough.
The hardware constraint — Mac M3 with 8 GB unified memory — is an intentional design parameter. Phi-3 Mini (3.8B, ~2.5 GB) is the largest model that fits without paging, keeping performance measurements meaningful.
Results run on phi3 with temperature=0.0, seed=42, num_ctx=2048.
| Benchmark | Evaluation Mode | n | Score |
|---|---|---|---|
| MMLU (57 subjects) | loglikelihood | 20 / subject | 68.3% accuracy |
| HellaSwag | loglikelihood | 20 | 60.0% acc_norm |
| Custom QA (factual) | generate_until | 15 | 46.7% exact match |
Four inference-time configurations tested on a 250-example MMLU subset (5 subjects, 50 each). 95% CIs via 1000-iteration bootstrap resampling (seed=42).
| Config | Accuracy | 95% CI | Δ vs Baseline |
|---|---|---|---|
| Baseline (0-shot) | 63.6% | [57.6%, 69.6%] | — |
| Few-shot (5-shot) | 66.4% | [60.4%, 72.0%] | +2.8 pp |
| Format hint only | 63.6% | [58.0%, 69.6%] | +0.0 pp |
| Few-shot + format hint | 65.6% | [60.0%, 71.2%] | +2.0 pp |
Concurrent load test using ThreadPoolExecutor. Note: Ollama processes requests single-threadedly, so concurrency > 1 measures server-side queue latency rather than parallel throughput — an important distinction when sizing inference infrastructure.
| Concurrency | Prompt | p50 latency | p95 latency | Mean TPS |
|---|---|---|---|---|
| 1 | short | 2,847 ms | 4,160 ms | 47.4 tok/s |
| 1 | long | 6,226 ms | 7,382 ms | 34.3 tok/s |
| 2 | short | 5,882 ms | 6,859 ms | 47.5 tok/s |
| 4 | short | 10,086 ms | 12,145 ms | 30.5 tok/s |
TTFT (Time To First Token): ~97 ms for short prompts, ~430 ms for long prompts — consistent across all concurrency levels, confirming the bottleneck is generation not prompt processing.
![]() |
![]() |
![]() |
![]() |
The pipeline has four layers, each with a single responsibility:
| Layer | File | Key Design Decision |
|---|---|---|
| Runner | eval_runner/run_eval.py |
Wraps lm_eval.simple_evaluate() — full compatibility with all 300+ lm-eval tasks |
| Adapter | eval_runner/ollama_lm_eval_adapter.py |
@register_model("ollama"), token-by-token logprob accumulation, -100.0 penalty for tokens outside top-20 |
| Client | ollama_client.py |
Thin HTTP wrapper, raw-mode vs generation-mode selection, single retry on timeout |
| Cache | eval_runner/cache.py |
SHA-256 keyed, disk-backed JSON, atexit flush — makes re-runs nearly instant |
The cache is essential: a single MMLU run (57 subjects × 20 examples × 4 answer choices × 1–3 tokens each) makes thousands of API calls. Without caching, the ablation study in Part E would be impractical to iterate on.
What it does: Starts Ollama, verifies the server is healthy, confirms the model is available, and runs sample generations to validate end-to-end connectivity.
make serve # health check + 3 sample generations
make client # demonstrate OllamaClient usage patterns
make smoke # determinism, instruction following, latency checksTechnical detail: serve.py registers a process cleanup handler via atexit, so the server is gracefully terminated when the script exits rather than left orphaned.
What it does: Runs lm-eval benchmarks against the local Ollama server via a custom @register_model("ollama") adapter.
make eval-mmlu # MMLU (57 subjects, 20 examples each)
make eval-hellaswag # HellaSwag (20 examples)
make eval-custom # Custom QA benchmark (15 examples, exact match)The adapter uses token-by-token logprob accumulation to score each answer choice. See eval_runner/DESIGN.md for the full implementation walkthrough.
Results are saved to eval_runner/results/ with timestamps.
What it does: Runs a concurrent load test across 24 configurations (3 concurrency levels × 2 prompt types × 2 stop-sequence settings × 2 cache states) and records per-request TTFT, total latency, and tokens/second.
make perf # runs load test → perf/metrics.csv + perf/summary.csvKey finding: TTFT stays flat at ~97 ms regardless of concurrency, but total latency scales linearly with concurrent requests — confirming that Ollama queues requests server-side. This means a single Ollama instance cannot benefit from request parallelism for latency-sensitive workloads; a pool of instances (or a batching-capable runtime like vLLM) would be needed for production.
The Jupyter notebook perf/analysis.ipynb walks through the full analysis with charts.
What it does: Validates output reliability across three dimensions: determinism, format correctness, and JSON schema adherence.
make guardrails # writes validation results to guardrails/results.json- Determinism: Runs the same 3 prompts 3 times each and asserts identical outputs. Guaranteed by
temperature=0.0; this check verifies the server is configured correctly before a full benchmark run. - Format validation: Regex-based checks on expected answer types (single letter, integer, city name, etc.).
- JSON schema: Validates that structured outputs are valid JSON containing the expected keys (manual key-set validation, no external library).
Note: top_p cannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. It is stored in the config for forward compatibility but has no runtime effect. The determinism guarantee relies on temperature=0.0 alone. See guardrails/guardrails.md for details.
What it does: Tests inference-time accuracy improvements on a 250-example MMLU subset using two levers: few-shot prompting and format hints.
make improve # downloads data, runs all 4 configs, prints comparison tableOr step by step:
python improve/prepare_data.py # download MMLU subset → dev.jsonl / test.jsonl
bash improve/eval.sh phi3 # run all 4 configs, write improve/results/Best result: 5-shot prompting → +2.8 pp (63.6% → 66.4%)
The gain is not because the model "learns" from the examples — it's because the formatted shots anchor the response pattern (Answer: X), shifting the token probability distribution away from non-letter tokens at the answer position. This is especially important for logprob-based evaluation, where a single tokenization difference can flip a correct answer to incorrect.
Full analysis in improve/report.md.
All published results can be reproduced exactly on any Apple Silicon Mac with 8 GB of unified memory. The key reproducibility guarantees are built into the design:
| Factor | Value | Where set |
|---|---|---|
| Model | phi3 (3.8B, Ollama tag phi3:latest) |
Makefile, all scripts |
| Inference | temperature=0.0, seed=42, top_p=1.0 |
ollama_client.py, improve/infer.py |
| Context window | num_ctx=2048 |
All scripts |
| Bootstrap seed | seed=42, 1000 iterations |
improve/infer.py |
| Eval data | committed to repo | improve/dev.jsonl, improve/test.jsonl |
| Results | committed to repo | eval_runner/results/, improve/results/ |
| Python env | pinned in requirements-lock.txt |
— |
# Install exact environment used to generate published results
pip install -r requirements-lock.txtResults in eval_runner/results/ and improve/results/ are the reference outputs. Running the pipeline again should produce identical scores — the prompt cache at eval_runner/cache.json is excluded from the repo (auto-regenerated), so the first re-run will be slower but all subsequent runs will be instant.
- macOS (tested on M3; Metal acceleration used automatically)
- Ollama ≥ 0.17.0 installed and on your PATH
- Python 3.14 (pyenv recommended for version management)
git clone https://github.com/your-username/local-llm-eval.git
cd local-llm-eval
# Pull the model
ollama pull phi3
# Create and activate a virtual environment
python3.14 -m venv venv
source venv/bin/activate
# Install exact pinned environment (recommended) or loose spec
pip install -r requirements-lock.txt # reproducible
# pip install -r requirements.txt # if 3.14 isn't availablemake serve # start Ollama, verify health, run sample generations
make test # unit tests (no Ollama server required)
make eval-mmlu # MMLU benchmark (~10 min, 57 subjects)
make eval-hellaswag # HellaSwag benchmark
make eval-custom # custom QA benchmark (exact match)
make perf # load test + latency/throughput charts
make guardrails # output validation (determinism, format, JSON schema)
make improve # prompt optimization pipeline| Subject | Baseline | Few-Shot | Δ |
|---|---|---|---|
| High School Biology | 70.0% | 80.0% | +10.0 pp |
| College Computer Science | 54.0% | 62.0% | +8.0 pp |
| High School Mathematics | 32.0% | 34.0% | +2.0 pp |
| Marketing | 88.0% | 88.0% | 0.0 pp |
| Philosophy | 74.0% | 68.0% | −6.0 pp |
Biology and CS benefit most from few-shot context. Mathematics barely moves — phi3's symbolic reasoning ceiling is the binding constraint, not prompt format. Marketing is already near ceiling (88%). Philosophy regresses slightly because the dev-set examples introduce misleading context for edge-case ethical reasoning questions, illustrating that few-shot is not universally beneficial.
| # | Subject | Question (truncated) | Correct | Baseline | Few-Shot |
|---|---|---|---|---|---|
| 1 | Math | "Powerful" integer: 392, 336, 300, or 297? | A | D | A |
| 2 | Math | Arithmetic sequences of odd integers summing to 240 | B | A | B |
| 3 | Math | Polynomial expansion coefficient | B | D | B |
| 4 | Math | 4 balls into 2 indistinguishable boxes | D | B | D |
| 5 | Math | Parallelogram angle B=110°, find degrees | D | A | D |
| 6 | Math | Probability drawing black-white pairs in order | D | A | D |
| 7 | Biology | Pike-cichlid predation — selective pressure on algae-eaters | C | D | C |
| 8 | Biology | What is a Barr body and its significance? | C | B | C |
| 9 | Biology | Mink fur genetics (brown dominant) — parental cross | C | D | C |
| 10 | Biology | Heterotroph hypothesis — event before oxygen photosynthesis | B | C | B |
| 11 | Biology | Which statement about variation is true? | D | B | D |
| 12 | CS | NoNicks OS — file-read time ratio comparison | B | D | B |
| 13 | CS | Which is NOT characteristic of good software design? | — | wrong | correct |
| 14 | CS | Database normalisation concept | — | wrong | correct |
| Metric | Baseline | Few-Shot | Change |
|---|---|---|---|
| Elapsed time (250 examples) | 79 s | 84 s | +6% |
| Approx. prompt tokens / question | ~80 | ~450 | ~5.6× |
| API calls | 250 | 250 | — |
Few-shot adds ~370 tokens per prompt but only +6% wall-clock because the bottleneck is generation latency (one token at a time), not prompt processing. The accuracy gain is essentially free at inference time.
All unit tests mock OllamaClient — no Ollama server required.
make test # runs pytest on all test_*.py| Test File | What It Covers |
|---|---|
tests/test_ollama_client.py |
HTTP client: generate, generate_raw, get_token_logprobs, timeout handling, retry logic |
tests/test_ollama_lm_eval.py |
Adapter: generate_until, loglikelihood, multi-token accumulation, out-of-top-20 fallback, empty continuation edge case |
tests/test_cache.py |
Cache: SHA-256 keying, hit/miss tracking, disk persistence, param-sensitivity |
Key edge cases covered: empty continuation returns (0.0, True), token not in top-20 applies -100.0 penalty, exception returns (-inf, False), multi-token continuation accumulates logprobs correctly.
ollamaeval/
├── ollama_client.py # HTTP client for Ollama /api/generate
├── eval_smoke_test.py # Pre-integration smoke tests
├── Makefile # One-liner entry points for every stage
├── requirements.txt # Loose dependency spec (broad compatibility)
├── requirements-lock.txt # Exact pinned env used to generate published results
│
├── tests/
│ ├── test_ollama_client.py # Unit tests — OllamaClient
│ ├── test_ollama_lm_eval.py # Unit tests — lm-eval adapter
│ └── test_cache.py # Unit tests — prompt cache
│
├── serve/
│ ├── serve.py # Part A: server lifecycle + sample generations
│ └── client.py # Part A: OllamaClient usage examples
│
├── eval_runner/
│ ├── run_eval.py # Part B: CLI wrapper around lm_eval.simple_evaluate()
│ ├── ollama_lm_eval_adapter.py # lm-eval LM interface (generate_until + loglikelihood)
│ ├── cache.py # SHA-256 keyed prompt cache (disk-backed)
│ ├── DESIGN.md # Architecture + loglikelihood implementation notes
│ ├── architecture.svg
│ ├── tasks/
│ │ ├── custom_qa.jsonl # 15-question factual QA benchmark
│ │ └── custom_qa.yaml # lm-eval task config (generate_until, exact_match)
│ └── results/
│ ├── summary.md # Aggregated benchmark scores
│ └── *.json # Per-run result files
│
├── perf/
│ ├── load_test.py # Part C: concurrent request load test
│ ├── analysis.ipynb # Latency / throughput analysis notebook
│ ├── metrics.csv # Raw per-request timing data
│ ├── summary.csv # Aggregated percentile stats
│ └── *.png # Charts (embedded above)
│
├── guardrails/
│ ├── validate.py # Part D: determinism + format + JSON schema checks
│ ├── guardrails.md # Design notes and top_p implementation caveat
│ └── results.json # Validation results
│
└── improve/
├── prepare_data.py # Part E: download MMLU subset from HuggingFace
├── infer.py # Run one inference config over test.jsonl
├── optimize_prompt.py # Prompt building utilities
├── eval.sh # Orchestrate all configs end-to-end
├── dev.jsonl # 25 dev examples (5/subject, used for few-shot)
├── test.jsonl # 250 test examples (50/subject, used for scoring)
├── report.md # Full improvement report with analysis
└── results/
├── baseline.json
├── fewshot.json
├── format.json
└── fewshot_format.json
loglikelihood_rollingis not implemented. No standard benchmarks in this pipeline require it.top_pcannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. The config storestop_p=1.0for forward compatibility but it has no effect at runtime.- Bootstrap CIs overlap at n=250. The [57.6%, 69.6%] and [60.4%, 72.0%] intervals for baseline and few-shot share significant range; a larger test set (n≥1000) would produce statistically separable estimates.
- Ollama is single-threaded for inference: concurrency > 1 measures queue latency, not parallel throughput. A batching-capable runtime (e.g., vLLM) would be needed for production horizontal scaling.
- -100.0 penalty logprob is an approximation for tokens outside the top-20 candidates. The true logprob is unknown; this value is large enough to penalize effectively while preserving loop semantics.



