Skip to content

WBChain3/eval_guard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval-guard

How do you know your AI isn't getting worse at reading your documents?

What this is

eval-guard runs your AI model through four tests on a schedule and tracks the scores over time. If something changes — a model update, a configuration drift, a new document type — you see it in the dashboard before your clients do.

You run a single command, the tests run against your local model, and the results are saved automatically. Open the dashboard whenever you want to check the trend.

What it measures

  • Factuality — Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
  • Consistency — Measures whether the model gives similar answers when the same question is asked in several different ways.
  • Context Adherence — Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
  • Latency — Measures how fast the model responds and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.

What you need to run it

  • Python 3.12
  • Ollama installed and running on your machine
  • At least one model pulled locally (e.g. ollama pull llama3.1:8b)

How to run it

Install eval-guard and its dependencies:

pip install -e .

Run all four benchmarks:

python run_eval.py --model llama3.1:8b --benchmark all

Launch the dashboard to see results over time:

streamlit run run_dashboard.py

What the scores mean

Benchmark Good score Bad score What a bad score means for your business
Factuality 1.0 Below 0.8 Your model is giving wrong answers. Clients lose trust.
Consistency Above 0.5 Below 0.3 Your model gives different answers to the same question depending on how you ask. Confusing and unreliable.
Context Adherence Above 0.8 Below 0.2 Your model invents facts that were never in your documents. This is how hallucinations become client-facing.
Latency Above 0.8 Below 0.3 Your model is responding slowly. Clients are waiting.

The database is local

The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.

About

Scheduled evaluation harness that tracks factuality, consistency, context adherence, and latency for local LLM deployments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages