Open-source runtime guardrail that catches RAG hallucinations sentence-by-sentence — with per-claim source spans that show exactly which chunk supported (or didn't support) each sentence.
from athena_verify import verify
result = verify(
question="What is the indemnification cap?",
answer="The cap is $1M per incident, with a $5M annual aggregate.",
context=retrieved_chunks,
)
for s in result.sentences:
status = "✓" if s.supported else "✗"
print(f"{status} {s.text}")
for span in s.supporting_spans:
print(f" ← chunk[{span.chunk_idx}] {span.start}–{span.end}: {span.text!r}")No document ingestion. No chunking. No agents. No database. Works identically on GPT, Claude, Gemini, Llama, Qwen, or any other model — provider-neutral by design.
In a multi-step agent, each step's output feeds the next — a single fabricated
figure propagates straight into the final answer. verify_step() is a circuit
breaker that halts the chain the moment a claim stops being grounded in the
evidence:
from athena_verify import verify_step
step = verify_step(claim=reasoning_step, evidence=retrieved_chunks, threshold=0.5)
if step.action == "halt":
raise RuntimeError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")Run it yourself: examples/agent_circuit_breaker.py.
LLM Answer
│
▼
Split into sentences
│
▼
┌──────────────────────────────────────────┐
│ For each answer sentence: │
│ │
│ 1. Split context into sentences too │
│ 2. NLI: each ctx sentence vs answer ──► │ max entailment score
│ 3. Lexical overlap vs context ──► │ token F1 score
│ 4. [optional] LLM judge ──► │ SUPPORTED / UNSUPPORTED
│ 5. Combine → trust score │
└──────────────┬───────────────────────────┘
│
▼
VerificationResult
├─ trust_score: 0.0–1.0
├─ supported: [sentences that passed]
├─ unsupported: [sentences that failed]
└─ verification_passed: bool
Two modes:
- NLI-only (default): ~20ms per sentence, catches fabricated claims, out-of-context info, number swaps, and negation flips
- NLI + LLM judge: escalate borderline cases to a local LLM for additional verification when accuracy is critical
Hard latency budget — the first open-source verifier with an explicit latency_budget_ms knob:
verify(..., latency_budget_ms=50) # pure NLI+lexical only — voice AI / agent fast-path
verify(..., latency_budget_ms=500) # escalate borderline cases if budget allows
verify(..., latency_budget_ms=None) # always escalate (default)pip install athena-verifyThe NLI model (DeBERTa-v3, ~1.2 GB) downloads automatically on first use.
For LLM-judge support (optional, local models via LM Studio or API):
pip install "athena-verify[all]"Evaluated on 100 synthetic cases across 6 hallucination categories (legal, medical, technical, general). Real-world benchmarks against RAGTruth and HaluEval are in progress — download instructions are in benchmarks/RESULTS.md.
Each row is the per-category F1 for catching hallucinations. The faithful-text row is intentionally excluded here — it contains no hallucinations, so its F1 is undefined; we report its false-positive rate separately below, which is the number that actually matters for clean text.
| Category | Precision | Recall | F1 |
|---|---|---|---|
| Fabricated claims | 100% | 96% | 97.9% ✓ |
| Out-of-context | 100% | 97% | 98.3% ✓ |
| Subtle contradictions | 100% | 97% | 98.3% ✓ |
| Partial support | 95% | 91% | 93.0% |
| Number substitutions | 82% | 96% | 88.5% |
| Overall | 95% | 96% | 95.0% (synthetic) |
False-positive rate on faithful text: 4.6% (4 of 87 genuinely-supported
sentences flagged) on the base model, 3.4% on the large model — down from 17%
before calibration. Latency: p50 22.5 ms, p95 34.5 ms per verification on the
base model. Numbers are reproducible with python benchmarks/run_full_eval.py.
Standalone NLI scores many faithful paraphrases as "neutral" (entailment ≈ 0) even when the claim is fully supported. Athena recovers these without letting hallucinations through, using three guarded signals:
- Anaphora windowing — a sentence opening with a referent ("This cap…", "It also…") is scored together with its predecessor, restoring the antecedent.
- Contradiction-aware rescue — a not-entailed claim is only rescued when the most on-topic context unit does not contradict it, so reversals and subtle contradictions stay flagged.
- Numeric gate — rescue requires every number in the claim to appear in the context, so number-substitution hallucinations ("$5M" vs a "$2M" context) are never rescued.
The remaining false positives are heavily-paraphrased claims with little lexical
overlap (e.g. "olive oil is drizzled on top"); enable the optional LLM-judge
escalation (use_llm_judge=True) for those. Athena still biases toward catching
hallucinations over passing every clean sentence — treat it as a guardrail.
LettuceDetect beats athena on span-level F1 on real-world benchmarks (LettuceDetect 79.2% F1 on annotated spans vs. athena's unvalidated real-world score). Athena wins on latency bounds, provider-neutrality, offline execution, and the spans-in-library integration story — not raw F1.
Recommendation: Use athena as a guardrail, not a final gate. Flag suspicious statements for human review rather than silently dropping them.
| Tool | Runs locally | Provider-neutral | Latency budget knob | Per-claim spans | F1 (real) |
|---|---|---|---|---|---|
| Athena | Yes | Yes | Yes | Yes | TBD¹ |
| LettuceDetect | Yes | Yes | No | No | 79.2% |
| HHEM-2.1 | Yes | Yes | No | No | ~82% |
| Ragas | Yes | No (LLM calls) | No | No | ~75% |
| Azure Groundedness | No (cloud only) | No (GPT-4o only) | No | No | ~90% |
| Vertex Grounding | No (cloud only) | No (Gemini only) | No | No | ~88% |
| Anthropic Citations | No (cloud only) | No (Claude only) | No | No | — |
¹ RAGTruth and HaluEval benchmarks pending; see benchmarks/RESULTS.md for download instructions.
Full methodology: benchmarks/RESULTS.md
| Athena | Ragas | Azure Groundedness | LettuceDetect | |
|---|---|---|---|---|
| Runtime detection | Yes | Offline eval | Yes | Offline eval |
| Sentence-level + spans | Yes | Answer-level | No | Span-level |
| Works offline / local | Yes | Yes | No (cloud) | Yes |
| Provider-neutral | Yes | Partial | No (GPT-4o only) | Yes |
| Latency budget knob | Yes | No | No | No |
| Open source | Yes | Yes | No | Yes |
| No external API required | Yes | No (LLM calls) | No | Yes |
Ragas and TruLens are great for offline evaluation. Azure/Vertex/Anthropic detectors work only in their own cloud. Athena is the runtime guardrail for everywhere else — any model, any stack, fully offline.
from athena_verify.integrations.langchain import VerifyingLLM
chain = RetrievalQA.from_llm(
VerifyingLLM(llm, retriever=retriever, on_unsupported="re-retrieve", max_retries=2),
retriever=retriever,
)from athena_verify.integrations.llamaindex import VerifyingPostprocessor
engine = index.as_query_engine(
response_postprocessors=[VerifyingPostprocessor()]
)from athena_verify.integrations.langgraph import VerifyStepNode
graph.add_node("verify", VerifyStepNode(threshold=0.8))from athena_verify import verified_completion
result = verified_completion(
model="gpt-4o",
question="What is the indemnification cap?",
context=retrieved_chunks,
)| Parameter | Type | Default | Description |
|---|---|---|---|
question |
str |
required | The original question |
answer |
str |
required | The LLM-generated answer |
context |
list[str] |
required | Retrieved context chunks |
nli_model |
str |
nli-deberta-v3-base |
Cross-encoder model |
use_llm_judge |
bool |
False |
Enable LLM judge for all sentences |
trust_threshold |
float |
0.70 |
Minimum trust to pass |
latency_budget_ms |
int | None |
None |
Hard latency cap; ≤100 skips LLM judge entirely |
Returns VerificationResult with trust_score, sentences (each with supporting_spans), supported, unsupported, and verification_passed.
Circuit-breaker primitive for agent pipelines:
from athena_verify import verify_step
step = verify_step(
claim="The contract was signed on 2024-01-15",
evidence=retrieved_chunks,
threshold=0.8,
)
# step.passed: bool, step.trust_score: float, step.action: "continue" | "halt"
if step.action == "halt":
raise ValueError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")See examples/agent_circuit_breaker.py for a full LangGraph example.
| Example | Description |
|---|---|
examples/quickstart.py |
5-minute getting started |
examples/langchain_example.py |
LangChain RetrievalQA with self-healing re-retrieve |
examples/llamaindex_example.py |
LlamaIndex query engine |
examples/agent_circuit_breaker.py |
LangGraph agent with verify_step halt |
- Security & Data Privacy — What data leaves your machine? How to stay fully offline?
- Threshold Tuning — How to pick
trust_thresholdfor your domain (legal, support, etc.) - NLI Model Trade-offs — Speed vs accuracy: which model to use (DeBERTa, Lightweight, etc.)
See CONTRIBUTING.md. PRs welcome — especially benchmark results, new integrations, and NLI model improvements.
