lm-eval-harness

Star

Here are 7 public repositories matching this topic...

liodon-ai / juryeval

Star

Lightweight NLP/LLM evaluation toolkit — metrics, LLM-as-Judge, significance testing, prompt robustness

python nlp bootstrap benchmark metrics evaluation llm prompt-testing llm-judge lm-eval-harness

Updated Jun 7, 2026
Python

jameswniu / self-hosted-llm-evals-lab

Star

Evaluation framework for self-hosted LLMs. Systematic prompt ablation (baseline, CoT, few-shot, self-consistency voting) on Llama 3.1 8B via lm-evaluation-harness, with Wilson CI statistical analysis, determinism validation, and load testing under concurrency. Found chain-of-thought degrades accuracy 25pp at small scale.

benchmark natural-language-processing load-testing self-hosted statistical-analysis llama self-consistency determinism ablation-study prompt-engineering chain-of-thought ollama llm-evaluation lm-eval-harness

Updated Mar 9, 2026
Python

fikreab-s / eval-harness-custom-benchmarks

Star

Custom evaluation harness with domain-specific benchmarks for enterprise LLMs

nlp benchmarking evaluation llm lm-eval-harness

Updated May 9, 2026
Python

bit-incarnas / chat-vs-raw-methodology

Star

Task-dependent benchmark gap between /v1/completions and /v1/chat/completions on instruction-tuned LLMs -- two case studies on Qwen3.x GGUFs, reproduction recipe, and probes.

benchmarking methodology instruction-following llm llama-cpp qwen gguf chat-completions lm-eval-harness ifeval

Updated May 16, 2026
Python

goyalmus / local-llm-eval

Star

A custom lm-eval adapter built from scratch — implementing the full LM interface against Ollama's native API rather than relying on any off-the-shelf integration.

evaluation llm ollama lm-eval-harness

Updated Mar 13, 2026
Python

telleroutlook / evomerge-framework

Star

Audit toolkit for trustworthy LLM-merging evaluation. Companion to the 'Silent Contamination' paper.

benchmark paper audit reproducibility conformal-prediction model-merging gsm8k llm-evaluation mcnemar-test lm-eval-harness

Updated Jun 5, 2026
Python

haxlys / llm-bench

Star

Apple Silicon LLM benchmark harness for local runtimes and OpenAI-compatible endpoints: speed, memory, eval coverage, and public reports.

python macos benchmark gemma mlx huggingface streamlit apple-silicon llm llama-cpp gguf lm-eval-harness

Updated May 22, 2026
Python

Improve this page

Add a description, image, and links to the lm-eval-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the lm-eval-harness topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lm-eval-harness

Here are 7 public repositories matching this topic...

liodon-ai / juryeval

jameswniu / self-hosted-llm-evals-lab

fikreab-s / eval-harness-custom-benchmarks

bit-incarnas / chat-vs-raw-methodology

goyalmus / local-llm-eval

telleroutlook / evomerge-framework

haxlys / llm-bench

Improve this page

Add this topic to your repo