eval-guard

How do you know your AI isn't getting worse at reading your documents?

What this is

eval-guard runs your AI model through four tests on a schedule and tracks the scores over time. If something changes — a model update, a configuration drift, a new document type — you see it in the dashboard before your clients do.

You run a single command, the tests run against your local model, and the results are saved automatically. Open the dashboard whenever you want to check the trend.

What it measures

Factuality — Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
Consistency — Measures whether the model gives similar answers when the same question is asked in several different ways.
Context Adherence — Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
Latency — Measures how fast the model responds and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.

What you need to run it

Python 3.12
Ollama installed and running on your machine
At least one model pulled locally (e.g. ollama pull llama3.1:8b)

How to run it

Install eval-guard and its dependencies:

pip install -e .

Run all four benchmarks:

python run_eval.py --model llama3.1:8b --benchmark all

Launch the dashboard to see results over time:

streamlit run run_dashboard.py

What the scores mean

Benchmark	Good score	Bad score	What a bad score means for your business
Factuality	1.0	Below 0.8	Your model is giving wrong answers. Clients lose trust.
Consistency	Above 0.5	Below 0.3	Your model gives different answers to the same question depending on how you ask. Confusing and unreliable.
Context Adherence	Above 0.8	Below 0.2	Your model invents facts that were never in your documents. This is how hallucinations become client-facing.
Latency	Above 0.8	Below 0.3	Your model is responding slowly. Clients are waiting.

The database is local

The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run_dashboard.py		run_dashboard.py
run_eval.py		run_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-guard

What this is

What it measures

What you need to run it

How to run it

What the scores mean

The database is local

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eval-guard

What this is

What it measures

What you need to run it

How to run it

What the scores mean

The database is local

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages