Lightweight NLP/LLM evaluation toolkit — metrics, LLM-as-Judge, significance testing, prompt robustness
-
Updated
Jun 7, 2026 - Python
Lightweight NLP/LLM evaluation toolkit — metrics, LLM-as-Judge, significance testing, prompt robustness
Evaluation framework for self-hosted LLMs. Systematic prompt ablation (baseline, CoT, few-shot, self-consistency voting) on Llama 3.1 8B via lm-evaluation-harness, with Wilson CI statistical analysis, determinism validation, and load testing under concurrency. Found chain-of-thought degrades accuracy 25pp at small scale.
Custom evaluation harness with domain-specific benchmarks for enterprise LLMs
Task-dependent benchmark gap between /v1/completions and /v1/chat/completions on instruction-tuned LLMs -- two case studies on Qwen3.x GGUFs, reproduction recipe, and probes.
A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration.
Audit toolkit for trustworthy LLM-merging evaluation. Companion to the 'Silent Contamination' paper.
Apple Silicon LLM benchmark harness for local runtimes and OpenAI-compatible endpoints: speed, memory, eval coverage, and public reports.
Add a description, image, and links to the lm-eval-harness topic page so that developers can more easily learn about it.
To associate your repository with the lm-eval-harness topic, visit your repo's landing page and select "manage topics."