OllamaEval

An end-to-end LLM evaluation pipeline connecting lm-evaluation-harness to a local Ollama server — benchmark scoring, load testing, prompt optimization, and output guardrails with zero cloud API calls.

Runs on Mac M3 8 GB. No cloud API keys required.

What This Is (and Why It's Interesting)

A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration. The core challenge is loglikelihood scoring: lm-eval needs log-probabilities over token continuations for multiple-choice tasks like MMLU and HellaSwag, and getting those out of Ollama required working through several endpoints before landing on a working approach. See eval_runner/DESIGN.md for the full technical walkthrough.

The hardware constraint — Mac M3 with 8 GB unified memory — is an intentional design parameter. Phi-3 Mini (3.8B, ~2.5 GB) is the largest model that fits without paging, keeping performance measurements meaningful.

Key Results

Benchmark Scores

Results run on phi3 with temperature=0.0, seed=42, num_ctx=2048.

Benchmark	Evaluation Mode	n	Score
MMLU (57 subjects)	loglikelihood	20 / subject	68.3% accuracy
HellaSwag	loglikelihood	20	60.0% acc_norm
Custom QA (factual)	generate_until	15	46.7% exact match

Prompt Optimization Results

Four inference-time configurations tested on a 250-example MMLU subset (5 subjects, 50 each). 95% CIs via 1000-iteration bootstrap resampling (seed=42).

Config	Accuracy	95% CI	Δ vs Baseline
Baseline (0-shot)	63.6%	[57.6%, 69.6%]	—
Few-shot (5-shot)	66.4%	[60.4%, 72.0%]	+2.8 pp
Format hint only	63.6%	[58.0%, 69.6%]	+0.0 pp
Few-shot + format hint	65.6%	[60.0%, 71.2%]	+2.0 pp

Performance Profile (Phi-3 Mini, Mac M3 8 GB)

Concurrent load test using ThreadPoolExecutor. Note: Ollama processes requests single-threadedly, so concurrency > 1 measures server-side queue latency rather than parallel throughput — an important distinction when sizing inference infrastructure.

Concurrency	Prompt	p50 latency	p95 latency	Mean TPS
1	short	2,847 ms	4,160 ms	47.4 tok/s
1	long	6,226 ms	7,382 ms	34.3 tok/s
2	short	5,882 ms	6,859 ms	47.5 tok/s
4	short	10,086 ms	12,145 ms	30.5 tok/s

TTFT (Time To First Token): ~97 ms for short prompts, ~430 ms for long prompts — consistent across all concurrency levels, confirming the bottleneck is generation not prompt processing.

Architecture

The pipeline has four layers, each with a single responsibility:

Layer	File	Key Design Decision
Runner	`eval_runner/run_eval.py`	Wraps `lm_eval.simple_evaluate()` — full compatibility with all 300+ lm-eval tasks
Adapter	`eval_runner/ollama_lm_eval_adapter.py`	`@register_model("ollama")`, token-by-token logprob accumulation, -100.0 penalty for tokens outside top-20
Client	`ollama_client.py`	Thin HTTP wrapper, raw-mode vs generation-mode selection, single retry on timeout
Cache	`eval_runner/cache.py`	SHA-256 keyed, disk-backed JSON, `atexit` flush — makes re-runs nearly instant

The cache is essential: a single MMLU run (57 subjects × 20 examples × 4 answer choices × 1–3 tokens each) makes thousands of API calls. Without caching, the ablation study in Part E would be impractical to iterate on.

Pipeline

A — Serve

What it does: Starts Ollama, verifies the server is healthy, confirms the model is available, and runs sample generations to validate end-to-end connectivity.

make serve    # health check + 3 sample generations
make client   # demonstrate OllamaClient usage patterns
make smoke    # determinism, instruction following, latency checks

Technical detail: serve.py registers a process cleanup handler via atexit, so the server is gracefully terminated when the script exits rather than left orphaned.

B — Eval Runner

What it does: Runs lm-eval benchmarks against the local Ollama server via a custom @register_model("ollama") adapter.

make eval-mmlu        # MMLU (57 subjects, 20 examples each)
make eval-hellaswag   # HellaSwag (20 examples)
make eval-custom      # Custom QA benchmark (15 examples, exact match)

The adapter uses token-by-token logprob accumulation to score each answer choice. See eval_runner/DESIGN.md for the full implementation walkthrough.

Results are saved to eval_runner/results/ with timestamps.

C — Performance Profiling

What it does: Runs a concurrent load test across 24 configurations (3 concurrency levels × 2 prompt types × 2 stop-sequence settings × 2 cache states) and records per-request TTFT, total latency, and tokens/second.

make perf    # runs load test → perf/metrics.csv + perf/summary.csv

Key finding: TTFT stays flat at ~97 ms regardless of concurrency, but total latency scales linearly with concurrent requests — confirming that Ollama queues requests server-side. This means a single Ollama instance cannot benefit from request parallelism for latency-sensitive workloads; a pool of instances (or a batching-capable runtime like vLLM) would be needed for production.

The Jupyter notebook perf/analysis.ipynb walks through the full analysis with charts.

D — Guardrails

What it does: Validates output reliability across three dimensions: determinism, format correctness, and JSON schema adherence.

make guardrails    # writes validation results to guardrails/results.json

Determinism: Runs the same 3 prompts 3 times each and asserts identical outputs. Guaranteed by temperature=0.0; this check verifies the server is configured correctly before a full benchmark run.
Format validation: Regex-based checks on expected answer types (single letter, integer, city name, etc.).
JSON schema: Validates that structured outputs are valid JSON containing the expected keys (manual key-set validation, no external library).

Note: top_p cannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. It is stored in the config for forward compatibility but has no runtime effect. The determinism guarantee relies on temperature=0.0 alone. See guardrails/guardrails.md for details.

E — Prompt Optimization

What it does: Tests inference-time accuracy improvements on a 250-example MMLU subset using two levers: few-shot prompting and format hints.

make improve    # downloads data, runs all 4 configs, prints comparison table

Or step by step:

python improve/prepare_data.py    # download MMLU subset → dev.jsonl / test.jsonl
bash improve/eval.sh phi3         # run all 4 configs, write improve/results/

Best result: 5-shot prompting → +2.8 pp (63.6% → 66.4%)

The gain is not because the model "learns" from the examples — it's because the formatted shots anchor the response pattern (Answer: X), shifting the token probability distribution away from non-letter tokens at the answer position. This is especially important for logprob-based evaluation, where a single tokenization difference can flip a correct answer to incorrect.

Full analysis in improve/report.md.

Reproducing Results

All published results can be reproduced exactly on any Apple Silicon Mac with 8 GB of unified memory. The key reproducibility guarantees are built into the design:

Factor	Value	Where set
Model	`phi3` (3.8B, Ollama tag `phi3:latest`)	`Makefile`, all scripts
Inference	`temperature=0.0`, `seed=42`, `top_p=1.0`	`ollama_client.py`, `improve/infer.py`
Context window	`num_ctx=2048`	All scripts
Bootstrap seed	`seed=42`, 1000 iterations	`improve/infer.py`
Eval data	committed to repo	`improve/dev.jsonl`, `improve/test.jsonl`
Results	committed to repo	`eval_runner/results/`, `improve/results/`
Python env	pinned in `requirements-lock.txt`	—

# Install exact environment used to generate published results
pip install -r requirements-lock.txt

Results in eval_runner/results/ and improve/results/ are the reference outputs. Running the pipeline again should produce identical scores — the prompt cache at eval_runner/cache.json is excluded from the repo (auto-regenerated), so the first re-run will be slower but all subsequent runs will be instant.

Quick Start

Prerequisites

macOS (tested on M3; Metal acceleration used automatically)
Ollama ≥ 0.17.0 installed and on your PATH
Python 3.14 (pyenv recommended for version management)

Setup

git clone https://github.com/your-username/local-llm-eval.git
cd local-llm-eval

# Pull the model
ollama pull phi3

# Create and activate a virtual environment
python3.14 -m venv venv
source venv/bin/activate

# Install exact pinned environment (recommended) or loose spec
pip install -r requirements-lock.txt   # reproducible
# pip install -r requirements.txt      # if 3.14 isn't available

Run

make serve          # start Ollama, verify health, run sample generations
make test           # unit tests (no Ollama server required)
make eval-mmlu      # MMLU benchmark (~10 min, 57 subjects)
make eval-hellaswag # HellaSwag benchmark
make eval-custom    # custom QA benchmark (exact match)
make perf           # load test + latency/throughput charts
make guardrails     # output validation (determinism, format, JSON schema)
make improve        # prompt optimization pipeline

Detailed Analysis

Per-Subject MMLU Results (Baseline vs Few-Shot)

Subject	Baseline	Few-Shot	Δ
High School Biology	70.0%	80.0%	+10.0 pp
College Computer Science	54.0%	62.0%	+8.0 pp
High School Mathematics	32.0%	34.0%	+2.0 pp
Marketing	88.0%	88.0%	0.0 pp
Philosophy	74.0%	68.0%	−6.0 pp

Biology and CS benefit most from few-shot context. Mathematics barely moves — phi3's symbolic reasoning ceiling is the binding constraint, not prompt format. Marketing is already near ceiling (88%). Philosophy regresses slightly because the dev-set examples introduce misleading context for edge-case ethical reasoning questions, illustrating that few-shot is not universally beneficial.

Before / After: Questions That Flipped Correct (21 flips, 14 regressions, net +7)

#	Subject	Question (truncated)	Correct	Baseline	Few-Shot
1	Math	"Powerful" integer: 392, 336, 300, or 297?	A	D	A
2	Math	Arithmetic sequences of odd integers summing to 240	B	A	B
3	Math	Polynomial expansion coefficient	B	D	B
4	Math	4 balls into 2 indistinguishable boxes	D	B	D
5	Math	Parallelogram angle B=110°, find degrees	D	A	D
6	Math	Probability drawing black-white pairs in order	D	A	D
7	Biology	Pike-cichlid predation — selective pressure on algae-eaters	C	D	C
8	Biology	What is a Barr body and its significance?	C	B	C
9	Biology	Mink fur genetics (brown dominant) — parental cross	C	D	C
10	Biology	Heterotroph hypothesis — event before oxygen photosynthesis	B	C	B
11	Biology	Which statement about variation is true?	D	B	D
12	CS	NoNicks OS — file-read time ratio comparison	B	D	B
13	CS	Which is NOT characteristic of good software design?	—	wrong	correct
14	CS	Database normalisation concept	—	wrong	correct

Cost / Latency Trade-Off

Metric	Baseline	Few-Shot	Change
Elapsed time (250 examples)	79 s	84 s	+6%
Approx. prompt tokens / question	~80	~450	~5.6×
API calls	250	250	—

Few-shot adds ~370 tokens per prompt but only +6% wall-clock because the bottleneck is generation latency (one token at a time), not prompt processing. The accuracy gain is essentially free at inference time.

Testing

All unit tests mock OllamaClient — no Ollama server required.

make test    # runs pytest on all test_*.py

Test File	What It Covers
`tests/test_ollama_client.py`	HTTP client: generate, generate_raw, get_token_logprobs, timeout handling, retry logic
`tests/test_ollama_lm_eval.py`	Adapter: generate_until, loglikelihood, multi-token accumulation, out-of-top-20 fallback, empty continuation edge case
`tests/test_cache.py`	Cache: SHA-256 keying, hit/miss tracking, disk persistence, param-sensitivity

Key edge cases covered: empty continuation returns (0.0, True), token not in top-20 applies -100.0 penalty, exception returns (-inf, False), multi-token continuation accumulates logprobs correctly.

Repository Structure

ollamaeval/
├── ollama_client.py              # HTTP client for Ollama /api/generate
├── eval_smoke_test.py            # Pre-integration smoke tests
├── Makefile                      # One-liner entry points for every stage
├── requirements.txt              # Loose dependency spec (broad compatibility)
├── requirements-lock.txt         # Exact pinned env used to generate published results
│
├── tests/
│   ├── test_ollama_client.py     # Unit tests — OllamaClient
│   ├── test_ollama_lm_eval.py    # Unit tests — lm-eval adapter
│   └── test_cache.py             # Unit tests — prompt cache
│
├── serve/
│   ├── serve.py                  # Part A: server lifecycle + sample generations
│   └── client.py                 # Part A: OllamaClient usage examples
│
├── eval_runner/
│   ├── run_eval.py               # Part B: CLI wrapper around lm_eval.simple_evaluate()
│   ├── ollama_lm_eval_adapter.py # lm-eval LM interface (generate_until + loglikelihood)
│   ├── cache.py                  # SHA-256 keyed prompt cache (disk-backed)
│   ├── DESIGN.md                 # Architecture + loglikelihood implementation notes
│   ├── architecture.svg
│   ├── tasks/
│   │   ├── custom_qa.jsonl       # 15-question factual QA benchmark
│   │   └── custom_qa.yaml        # lm-eval task config (generate_until, exact_match)
│   └── results/
│       ├── summary.md            # Aggregated benchmark scores
│       └── *.json                # Per-run result files
│
├── perf/
│   ├── load_test.py              # Part C: concurrent request load test
│   ├── analysis.ipynb            # Latency / throughput analysis notebook
│   ├── metrics.csv               # Raw per-request timing data
│   ├── summary.csv               # Aggregated percentile stats
│   └── *.png                     # Charts (embedded above)
│
├── guardrails/
│   ├── validate.py               # Part D: determinism + format + JSON schema checks
│   ├── guardrails.md             # Design notes and top_p implementation caveat
│   └── results.json              # Validation results
│
└── improve/
    ├── prepare_data.py           # Part E: download MMLU subset from HuggingFace
    ├── infer.py                  # Run one inference config over test.jsonl
    ├── optimize_prompt.py        # Prompt building utilities
    ├── eval.sh                   # Orchestrate all configs end-to-end
    ├── dev.jsonl                 # 25 dev examples (5/subject, used for few-shot)
    ├── test.jsonl                # 250 test examples (50/subject, used for scoring)
    ├── report.md                 # Full improvement report with analysis
    └── results/
        ├── baseline.json
        ├── fewshot.json
        ├── format.json
        └── fewshot_format.json

Limitations

loglikelihood_rolling is not implemented. No standard benchmarks in this pipeline require it.
top_p cannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. The config stores top_p=1.0 for forward compatibility but it has no effect at runtime.
Bootstrap CIs overlap at n=250. The [57.6%, 69.6%] and [60.4%, 72.0%] intervals for baseline and few-shot share significant range; a larger test set (n≥1000) would produce statistically separable estimates.
Ollama is single-threaded for inference: concurrency > 1 measures queue latency, not parallel throughput. A batching-capable runtime (e.g., vLLM) would be needed for production horizontal scaling.
-100.0 penalty logprob is an approximation for tokens outside the top-20 candidates. The true logprob is unknown; this value is large enough to penalize effectively while preserving loop semantics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OllamaEval

What This Is (and Why It's Interesting)

Key Results

Benchmark Scores

Prompt Optimization Results

Performance Profile (Phi-3 Mini, Mac M3 8 GB)

Architecture

Pipeline

A — Serve

B — Eval Runner

C — Performance Profiling

D — Guardrails

E — Prompt Optimization

Reproducing Results

Quick Start

Prerequisites

Setup

Run

Detailed Analysis

Per-Subject MMLU Results (Baseline vs Few-Shot)

Before / After: Questions That Flipped Correct (21 flips, 14 regressions, net +7)

Cost / Latency Trade-Off

Testing

Repository Structure

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval_runner		eval_runner
guardrails		guardrails
improve		improve
perf		perf
serve		serve
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
eval_smoke_test.py		eval_smoke_test.py
ollama_client.py		ollama_client.py
requirements-lock.txt		requirements-lock.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OllamaEval

What This Is (and Why It's Interesting)

Key Results

Benchmark Scores

Prompt Optimization Results

Performance Profile (Phi-3 Mini, Mac M3 8 GB)

Architecture

Pipeline

A — Serve

B — Eval Runner

C — Performance Profiling

D — Guardrails

E — Prompt Optimization

Reproducing Results

Quick Start

Prerequisites

Setup

Run

Detailed Analysis

Per-Subject MMLU Results (Baseline vs Few-Shot)

Before / After: Questions That Flipped Correct (21 flips, 14 regressions, net +7)

Cost / Latency Trade-Off

Testing

Repository Structure

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages