How do you know your AI isn't getting worse at reading your documents?
eval-guard runs your AI model through four tests on a schedule and tracks the scores over time. If something changes — a model update, a configuration drift, a new document type — you see it in the dashboard before your clients do.
You run a single command, the tests run against your local model, and the results are saved automatically. Open the dashboard whenever you want to check the trend.
- Factuality — Checks whether the model gives correct factual answers to straightforward questions about geography, science, and math.
- Consistency — Measures whether the model gives similar answers when the same question is asked in several different ways.
- Context Adherence — Tests whether the model admits it does not know the answer when the information is not present in the provided text, rather than making something up.
- Latency — Measures how fast the model responds and how much GPU memory it uses. This is an infrastructure health metric, not a quality metric.
- Python 3.12
- Ollama installed and running on your machine
- At least one model pulled locally (e.g.
ollama pull llama3.1:8b)
Install eval-guard and its dependencies:
pip install -e .
Run all four benchmarks:
python run_eval.py --model llama3.1:8b --benchmark all
Launch the dashboard to see results over time:
streamlit run run_dashboard.py
| Benchmark | Good score | Bad score | What a bad score means for your business |
|---|---|---|---|
| Factuality | 1.0 | Below 0.8 | Your model is giving wrong answers. Clients lose trust. |
| Consistency | Above 0.5 | Below 0.3 | Your model gives different answers to the same question depending on how you ask. Confusing and unreliable. |
| Context Adherence | Above 0.8 | Below 0.2 | Your model invents facts that were never in your documents. This is how hallucinations become client-facing. |
| Latency | Above 0.8 | Below 0.3 | Your model is responding slowly. Clients are waiting. |
The database file (eval_guard.db) is created automatically the first time you run a benchmark. It is not included in this repository.