Skip to content

goyalmus/local-llm-eval

Repository files navigation

OllamaEval

An end-to-end LLM evaluation pipeline connecting lm-evaluation-harness to a local Ollama server — benchmark scoring, load testing, prompt optimization, and output guardrails with zero cloud API calls.

Python lm-eval Ollama Runs Locally

Architecture

Runs on Mac M3 8 GB. No cloud API keys required.


What This Is (and Why It's Interesting)

A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration. The core challenge is loglikelihood scoring: lm-eval needs log-probabilities over token continuations for multiple-choice tasks like MMLU and HellaSwag, and getting those out of Ollama required working through several endpoints before landing on a working approach. See eval_runner/DESIGN.md for the full technical walkthrough.

The hardware constraint — Mac M3 with 8 GB unified memory — is an intentional design parameter. Phi-3 Mini (3.8B, ~2.5 GB) is the largest model that fits without paging, keeping performance measurements meaningful.


Key Results

Benchmark Scores

Results run on phi3 with temperature=0.0, seed=42, num_ctx=2048.

Benchmark Evaluation Mode n Score
MMLU (57 subjects) loglikelihood 20 / subject 68.3% accuracy
HellaSwag loglikelihood 20 60.0% acc_norm
Custom QA (factual) generate_until 15 46.7% exact match

Prompt Optimization Results

Four inference-time configurations tested on a 250-example MMLU subset (5 subjects, 50 each). 95% CIs via 1000-iteration bootstrap resampling (seed=42).

Config Accuracy 95% CI Δ vs Baseline
Baseline (0-shot) 63.6% [57.6%, 69.6%]
Few-shot (5-shot) 66.4% [60.4%, 72.0%] +2.8 pp
Format hint only 63.6% [58.0%, 69.6%] +0.0 pp
Few-shot + format hint 65.6% [60.0%, 71.2%] +2.0 pp

Performance Profile (Phi-3 Mini, Mac M3 8 GB)

Concurrent load test using ThreadPoolExecutor. Note: Ollama processes requests single-threadedly, so concurrency > 1 measures server-side queue latency rather than parallel throughput — an important distinction when sizing inference infrastructure.

Concurrency Prompt p50 latency p95 latency Mean TPS
1 short 2,847 ms 4,160 ms 47.4 tok/s
1 long 6,226 ms 7,382 ms 34.3 tok/s
2 short 5,882 ms 6,859 ms 47.5 tok/s
4 short 10,086 ms 12,145 ms 30.5 tok/s

TTFT (Time To First Token): ~97 ms for short prompts, ~430 ms for long prompts — consistent across all concurrency levels, confirming the bottleneck is generation not prompt processing.

Latency Distribution TTFT Comparison
Throughput Cache & Stop Impact

Architecture

Architecture Diagram

The pipeline has four layers, each with a single responsibility:

Layer File Key Design Decision
Runner eval_runner/run_eval.py Wraps lm_eval.simple_evaluate() — full compatibility with all 300+ lm-eval tasks
Adapter eval_runner/ollama_lm_eval_adapter.py @register_model("ollama"), token-by-token logprob accumulation, -100.0 penalty for tokens outside top-20
Client ollama_client.py Thin HTTP wrapper, raw-mode vs generation-mode selection, single retry on timeout
Cache eval_runner/cache.py SHA-256 keyed, disk-backed JSON, atexit flush — makes re-runs nearly instant

The cache is essential: a single MMLU run (57 subjects × 20 examples × 4 answer choices × 1–3 tokens each) makes thousands of API calls. Without caching, the ablation study in Part E would be impractical to iterate on.


Pipeline

A — Serve

What it does: Starts Ollama, verifies the server is healthy, confirms the model is available, and runs sample generations to validate end-to-end connectivity.

make serve    # health check + 3 sample generations
make client   # demonstrate OllamaClient usage patterns
make smoke    # determinism, instruction following, latency checks

Technical detail: serve.py registers a process cleanup handler via atexit, so the server is gracefully terminated when the script exits rather than left orphaned.


B — Eval Runner

What it does: Runs lm-eval benchmarks against the local Ollama server via a custom @register_model("ollama") adapter.

make eval-mmlu        # MMLU (57 subjects, 20 examples each)
make eval-hellaswag   # HellaSwag (20 examples)
make eval-custom      # Custom QA benchmark (15 examples, exact match)

The adapter uses token-by-token logprob accumulation to score each answer choice. See eval_runner/DESIGN.md for the full implementation walkthrough.

Results are saved to eval_runner/results/ with timestamps.


C — Performance Profiling

What it does: Runs a concurrent load test across 24 configurations (3 concurrency levels × 2 prompt types × 2 stop-sequence settings × 2 cache states) and records per-request TTFT, total latency, and tokens/second.

make perf    # runs load test → perf/metrics.csv + perf/summary.csv

Key finding: TTFT stays flat at ~97 ms regardless of concurrency, but total latency scales linearly with concurrent requests — confirming that Ollama queues requests server-side. This means a single Ollama instance cannot benefit from request parallelism for latency-sensitive workloads; a pool of instances (or a batching-capable runtime like vLLM) would be needed for production.

The Jupyter notebook perf/analysis.ipynb walks through the full analysis with charts.


D — Guardrails

What it does: Validates output reliability across three dimensions: determinism, format correctness, and JSON schema adherence.

make guardrails    # writes validation results to guardrails/results.json
  • Determinism: Runs the same 3 prompts 3 times each and asserts identical outputs. Guaranteed by temperature=0.0; this check verifies the server is configured correctly before a full benchmark run.
  • Format validation: Regex-based checks on expected answer types (single letter, integer, city name, etc.).
  • JSON schema: Validates that structured outputs are valid JSON containing the expected keys (manual key-set validation, no external library).

Note: top_p cannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. It is stored in the config for forward compatibility but has no runtime effect. The determinism guarantee relies on temperature=0.0 alone. See guardrails/guardrails.md for details.


E — Prompt Optimization

What it does: Tests inference-time accuracy improvements on a 250-example MMLU subset using two levers: few-shot prompting and format hints.

make improve    # downloads data, runs all 4 configs, prints comparison table

Or step by step:

python improve/prepare_data.py    # download MMLU subset → dev.jsonl / test.jsonl
bash improve/eval.sh phi3         # run all 4 configs, write improve/results/

Best result: 5-shot prompting → +2.8 pp (63.6% → 66.4%)

The gain is not because the model "learns" from the examples — it's because the formatted shots anchor the response pattern (Answer: X), shifting the token probability distribution away from non-letter tokens at the answer position. This is especially important for logprob-based evaluation, where a single tokenization difference can flip a correct answer to incorrect.

Full analysis in improve/report.md.


Reproducing Results

All published results can be reproduced exactly on any Apple Silicon Mac with 8 GB of unified memory. The key reproducibility guarantees are built into the design:

Factor Value Where set
Model phi3 (3.8B, Ollama tag phi3:latest) Makefile, all scripts
Inference temperature=0.0, seed=42, top_p=1.0 ollama_client.py, improve/infer.py
Context window num_ctx=2048 All scripts
Bootstrap seed seed=42, 1000 iterations improve/infer.py
Eval data committed to repo improve/dev.jsonl, improve/test.jsonl
Results committed to repo eval_runner/results/, improve/results/
Python env pinned in requirements-lock.txt
# Install exact environment used to generate published results
pip install -r requirements-lock.txt

Results in eval_runner/results/ and improve/results/ are the reference outputs. Running the pipeline again should produce identical scores — the prompt cache at eval_runner/cache.json is excluded from the repo (auto-regenerated), so the first re-run will be slower but all subsequent runs will be instant.


Quick Start

Prerequisites

  • macOS (tested on M3; Metal acceleration used automatically)
  • Ollama ≥ 0.17.0 installed and on your PATH
  • Python 3.14 (pyenv recommended for version management)

Setup

git clone https://github.com/your-username/local-llm-eval.git
cd local-llm-eval

# Pull the model
ollama pull phi3

# Create and activate a virtual environment
python3.14 -m venv venv
source venv/bin/activate

# Install exact pinned environment (recommended) or loose spec
pip install -r requirements-lock.txt   # reproducible
# pip install -r requirements.txt      # if 3.14 isn't available

Run

make serve          # start Ollama, verify health, run sample generations
make test           # unit tests (no Ollama server required)
make eval-mmlu      # MMLU benchmark (~10 min, 57 subjects)
make eval-hellaswag # HellaSwag benchmark
make eval-custom    # custom QA benchmark (exact match)
make perf           # load test + latency/throughput charts
make guardrails     # output validation (determinism, format, JSON schema)
make improve        # prompt optimization pipeline

Detailed Analysis

Per-Subject MMLU Results (Baseline vs Few-Shot)

Subject Baseline Few-Shot Δ
High School Biology 70.0% 80.0% +10.0 pp
College Computer Science 54.0% 62.0% +8.0 pp
High School Mathematics 32.0% 34.0% +2.0 pp
Marketing 88.0% 88.0% 0.0 pp
Philosophy 74.0% 68.0% −6.0 pp

Biology and CS benefit most from few-shot context. Mathematics barely moves — phi3's symbolic reasoning ceiling is the binding constraint, not prompt format. Marketing is already near ceiling (88%). Philosophy regresses slightly because the dev-set examples introduce misleading context for edge-case ethical reasoning questions, illustrating that few-shot is not universally beneficial.

Before / After: Questions That Flipped Correct (21 flips, 14 regressions, net +7)

# Subject Question (truncated) Correct Baseline Few-Shot
1 Math "Powerful" integer: 392, 336, 300, or 297? A D A
2 Math Arithmetic sequences of odd integers summing to 240 B A B
3 Math Polynomial expansion coefficient B D B
4 Math 4 balls into 2 indistinguishable boxes D B D
5 Math Parallelogram angle B=110°, find degrees D A D
6 Math Probability drawing black-white pairs in order D A D
7 Biology Pike-cichlid predation — selective pressure on algae-eaters C D C
8 Biology What is a Barr body and its significance? C B C
9 Biology Mink fur genetics (brown dominant) — parental cross C D C
10 Biology Heterotroph hypothesis — event before oxygen photosynthesis B C B
11 Biology Which statement about variation is true? D B D
12 CS NoNicks OS — file-read time ratio comparison B D B
13 CS Which is NOT characteristic of good software design? wrong correct
14 CS Database normalisation concept wrong correct

Cost / Latency Trade-Off

Metric Baseline Few-Shot Change
Elapsed time (250 examples) 79 s 84 s +6%
Approx. prompt tokens / question ~80 ~450 ~5.6×
API calls 250 250

Few-shot adds ~370 tokens per prompt but only +6% wall-clock because the bottleneck is generation latency (one token at a time), not prompt processing. The accuracy gain is essentially free at inference time.


Testing

All unit tests mock OllamaClient — no Ollama server required.

make test    # runs pytest on all test_*.py
Test File What It Covers
tests/test_ollama_client.py HTTP client: generate, generate_raw, get_token_logprobs, timeout handling, retry logic
tests/test_ollama_lm_eval.py Adapter: generate_until, loglikelihood, multi-token accumulation, out-of-top-20 fallback, empty continuation edge case
tests/test_cache.py Cache: SHA-256 keying, hit/miss tracking, disk persistence, param-sensitivity

Key edge cases covered: empty continuation returns (0.0, True), token not in top-20 applies -100.0 penalty, exception returns (-inf, False), multi-token continuation accumulates logprobs correctly.


Repository Structure

ollamaeval/
├── ollama_client.py              # HTTP client for Ollama /api/generate
├── eval_smoke_test.py            # Pre-integration smoke tests
├── Makefile                      # One-liner entry points for every stage
├── requirements.txt              # Loose dependency spec (broad compatibility)
├── requirements-lock.txt         # Exact pinned env used to generate published results
│
├── tests/
│   ├── test_ollama_client.py     # Unit tests — OllamaClient
│   ├── test_ollama_lm_eval.py    # Unit tests — lm-eval adapter
│   └── test_cache.py             # Unit tests — prompt cache
│
├── serve/
│   ├── serve.py                  # Part A: server lifecycle + sample generations
│   └── client.py                 # Part A: OllamaClient usage examples
│
├── eval_runner/
│   ├── run_eval.py               # Part B: CLI wrapper around lm_eval.simple_evaluate()
│   ├── ollama_lm_eval_adapter.py # lm-eval LM interface (generate_until + loglikelihood)
│   ├── cache.py                  # SHA-256 keyed prompt cache (disk-backed)
│   ├── DESIGN.md                 # Architecture + loglikelihood implementation notes
│   ├── architecture.svg
│   ├── tasks/
│   │   ├── custom_qa.jsonl       # 15-question factual QA benchmark
│   │   └── custom_qa.yaml        # lm-eval task config (generate_until, exact_match)
│   └── results/
│       ├── summary.md            # Aggregated benchmark scores
│       └── *.json                # Per-run result files
│
├── perf/
│   ├── load_test.py              # Part C: concurrent request load test
│   ├── analysis.ipynb            # Latency / throughput analysis notebook
│   ├── metrics.csv               # Raw per-request timing data
│   ├── summary.csv               # Aggregated percentile stats
│   └── *.png                     # Charts (embedded above)
│
├── guardrails/
│   ├── validate.py               # Part D: determinism + format + JSON schema checks
│   ├── guardrails.md             # Design notes and top_p implementation caveat
│   └── results.json              # Validation results
│
└── improve/
    ├── prepare_data.py           # Part E: download MMLU subset from HuggingFace
    ├── infer.py                  # Run one inference config over test.jsonl
    ├── optimize_prompt.py        # Prompt building utilities
    ├── eval.sh                   # Orchestrate all configs end-to-end
    ├── dev.jsonl                 # 25 dev examples (5/subject, used for few-shot)
    ├── test.jsonl                # 250 test examples (50/subject, used for scoring)
    ├── report.md                 # Full improvement report with analysis
    └── results/
        ├── baseline.json
        ├── fewshot.json
        ├── format.json
        └── fewshot_format.json

Limitations

  • loglikelihood_rolling is not implemented. No standard benchmarks in this pipeline require it.
  • top_p cannot be set to a custom value via Ollama's REST API options dict — it requires a Modelfile. The config stores top_p=1.0 for forward compatibility but it has no effect at runtime.
  • Bootstrap CIs overlap at n=250. The [57.6%, 69.6%] and [60.4%, 72.0%] intervals for baseline and few-shot share significant range; a larger test set (n≥1000) would produce statistically separable estimates.
  • Ollama is single-threaded for inference: concurrency > 1 measures queue latency, not parallel throughput. A batching-capable runtime (e.g., vLLM) would be needed for production horizontal scaling.
  • -100.0 penalty logprob is an approximation for tokens outside the top-20 candidates. The true logprob is unknown; this value is large enough to penalize effectively while preserving loop semantics.

About

A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors