athena-verify

Open-source runtime guardrail that catches RAG hallucinations sentence-by-sentence — with per-claim source spans that show exactly which chunk supported (or didn't support) each sentence.

from athena_verify import verify

result = verify(
    question="What is the indemnification cap?",
    answer="The cap is $1M per incident, with a $5M annual aggregate.",
    context=retrieved_chunks,
)

for s in result.sentences:
    status = "✓" if s.supported else "✗"
    print(f"{status} {s.text}")
    for span in s.supporting_spans:
        print(f"  ← chunk[{span.chunk_idx}] {span.start}–{span.end}: {span.text!r}")

No document ingestion. No chunking. No agents. No database. Works identically on GPT, Claude, Gemini, Llama, Qwen, or any other model — provider-neutral by design.

Stop hallucinations before they cascade

In a multi-step agent, each step's output feeds the next — a single fabricated figure propagates straight into the final answer. verify_step() is a circuit breaker that halts the chain the moment a claim stops being grounded in the evidence:

from athena_verify import verify_step

step = verify_step(claim=reasoning_step, evidence=retrieved_chunks, threshold=0.5)
if step.action == "halt":
    raise RuntimeError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")

Run it yourself: examples/agent_circuit_breaker.py.

How It Works

  LLM Answer
      │
      ▼
  Split into sentences
      │
      ▼
  ┌──────────────────────────────────────────┐
  │  For each answer sentence:               │
  │                                          │
  │  1. Split context into sentences too     │
  │  2. NLI: each ctx sentence vs answer ──► │ max entailment score
  │  3. Lexical overlap vs context       ──► │ token F1 score
  │  4. [optional] LLM judge             ──► │ SUPPORTED / UNSUPPORTED
  │  5. Combine → trust score                │
  └──────────────┬───────────────────────────┘
                 │
                 ▼
    VerificationResult
    ├─ trust_score: 0.0–1.0
    ├─ supported: [sentences that passed]
    ├─ unsupported: [sentences that failed]
    └─ verification_passed: bool

Two modes:

NLI-only (default): ~20ms per sentence, catches fabricated claims, out-of-context info, number swaps, and negation flips
NLI + LLM judge: escalate borderline cases to a local LLM for additional verification when accuracy is critical

Hard latency budget — the first open-source verifier with an explicit latency_budget_ms knob:

verify(..., latency_budget_ms=50)   # pure NLI+lexical only — voice AI / agent fast-path
verify(..., latency_budget_ms=500)  # escalate borderline cases if budget allows
verify(..., latency_budget_ms=None) # always escalate (default)

Install

pip install athena-verify

The NLI model (DeBERTa-v3, ~1.2 GB) downloads automatically on first use.

For LLM-judge support (optional, local models via LM Studio or API):

pip install "athena-verify[all]"

Benchmarks

Evaluated on 100 synthetic cases across 6 hallucination categories (legal, medical, technical, general). Real-world benchmarks against RAGTruth and HaluEval are in progress — download instructions are in benchmarks/RESULTS.md.

Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)

Each row is the per-category F1 for catching hallucinations. The faithful-text row is intentionally excluded here — it contains no hallucinations, so its F1 is undefined; we report its false-positive rate separately below, which is the number that actually matters for clean text.

Category	Precision	Recall	F1
Fabricated claims	100%	96%	97.9% ✓
Out-of-context	100%	97%	98.3% ✓
Subtle contradictions	100%	97%	98.3% ✓
Partial support	95%	91%	93.0%
Number substitutions	82%	96%	88.5%
Overall	95%	96%	95.0% (synthetic)

False-positive rate on faithful text: 4.6% (4 of 87 genuinely-supported sentences flagged) on the base model, 3.4% on the large model — down from 17% before calibration. Latency: p50 22.5 ms, p95 34.5 ms per verification on the base model. Numbers are reproducible with python benchmarks/run_full_eval.py.

How false positives are kept low

Standalone NLI scores many faithful paraphrases as "neutral" (entailment ≈ 0) even when the claim is fully supported. Athena recovers these without letting hallucinations through, using three guarded signals:

Anaphora windowing — a sentence opening with a referent ("This cap…", "It also…") is scored together with its predecessor, restoring the antecedent.
Contradiction-aware rescue — a not-entailed claim is only rescued when the most on-topic context unit does not contradict it, so reversals and subtle contradictions stay flagged.
Numeric gate — rescue requires every number in the claim to appear in the context, so number-substitution hallucinations ("$5M" vs a "$2M" context) are never rescued.

The remaining false positives are heavily-paraphrased claims with little lexical overlap (e.g. "olive oil is drizzled on top"); enable the optional LLM-judge escalation (use_llm_judge=True) for those. Athena still biases toward catching hallucinations over passing every clean sentence — treat it as a guardrail.

LettuceDetect beats athena on span-level F1 on real-world benchmarks (LettuceDetect 79.2% F1 on annotated spans vs. athena's unvalidated real-world score). Athena wins on latency bounds, provider-neutrality, offline execution, and the spans-in-library integration story — not raw F1.

Recommendation: Use athena as a guardrail, not a final gate. Flag suspicious statements for human review rather than silently dropping them.

Comparison

Tool	Runs locally	Provider-neutral	Latency budget knob	Per-claim spans	F1 (real)
Athena	Yes	Yes	Yes	Yes	TBD¹
LettuceDetect	Yes	Yes	No	No	79.2%
HHEM-2.1	Yes	Yes	No	No	~82%
Ragas	Yes	No (LLM calls)	No	No	~75%
Azure Groundedness	No (cloud only)	No (GPT-4o only)	No	No	~90%
Vertex Grounding	No (cloud only)	No (Gemini only)	No	No	~88%
Anthropic Citations	No (cloud only)	No (Claude only)	No	No	—

¹ RAGTruth and HaluEval benchmarks pending; see benchmarks/RESULTS.md for download instructions.

Full methodology: benchmarks/RESULTS.md

How We Compare

	Athena	Ragas	Azure Groundedness	LettuceDetect
Runtime detection	Yes	Offline eval	Yes	Offline eval
Sentence-level + spans	Yes	Answer-level	No	Span-level
Works offline / local	Yes	Yes	No (cloud)	Yes
Provider-neutral	Yes	Partial	No (GPT-4o only)	Yes
Latency budget knob	Yes	No	No	No
Open source	Yes	Yes	No	Yes
No external API required	Yes	No (LLM calls)	No	Yes

Ragas and TruLens are great for offline evaluation. Azure/Vertex/Anthropic detectors work only in their own cloud. Athena is the runtime guardrail for everywhere else — any model, any stack, fully offline.

Integrations

LangChain

from athena_verify.integrations.langchain import VerifyingLLM

chain = RetrievalQA.from_llm(
    VerifyingLLM(llm, retriever=retriever, on_unsupported="re-retrieve", max_retries=2),
    retriever=retriever,
)

LlamaIndex

from athena_verify.integrations.llamaindex import VerifyingPostprocessor

engine = index.as_query_engine(
    response_postprocessors=[VerifyingPostprocessor()]
)

LangGraph (agent circuit-breaker)

from athena_verify.integrations.langgraph import VerifyStepNode

graph.add_node("verify", VerifyStepNode(threshold=0.8))

OpenAI / Anthropic

from athena_verify import verified_completion

result = verified_completion(
    model="gpt-4o",
    question="What is the indemnification cap?",
    context=retrieved_chunks,
)

API

`verify(question, answer, context, ...)`

Parameter	Type	Default	Description
`question`	`str`	required	The original question
`answer`	`str`	required	The LLM-generated answer
`context`	`list[str]`	required	Retrieved context chunks
`nli_model`	`str`	`nli-deberta-v3-base`	Cross-encoder model
`use_llm_judge`	`bool`	`False`	Enable LLM judge for all sentences
`trust_threshold`	`float`	`0.70`	Minimum trust to pass
`latency_budget_ms`	`int \| None`	`None`	Hard latency cap; `≤100` skips LLM judge entirely

Returns VerificationResult with trust_score, sentences (each with supporting_spans), supported, unsupported, and verification_passed.

`verify_step(claim, evidence, threshold)`

Circuit-breaker primitive for agent pipelines:

from athena_verify import verify_step

step = verify_step(
    claim="The contract was signed on 2024-01-15",
    evidence=retrieved_chunks,
    threshold=0.8,
)
# step.passed: bool, step.trust_score: float, step.action: "continue" | "halt"
if step.action == "halt":
    raise ValueError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")

See examples/agent_circuit_breaker.py for a full LangGraph example.

Examples

Example	Description
`examples/quickstart.py`	5-minute getting started
`examples/langchain_example.py`	LangChain RetrievalQA with self-healing re-retrieve
`examples/llamaindex_example.py`	LlamaIndex query engine
`examples/agent_circuit_breaker.py`	LangGraph agent with `verify_step` halt

Documentation

Security & Data Privacy — What data leaves your machine? How to stay fully offline?
Threshold Tuning — How to pick trust_threshold for your domain (legal, support, etc.)
NLI Model Trade-offs — Speed vs accuracy: which model to use (DeBERTa, Lightweight, etc.)

Contributing

See CONTRIBUTING.md. PRs welcome — especially benchmark results, new integrations, and NLI model improvements.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.claude		.claude
.github/workflows		.github/workflows
assets		assets
athena_verify		athena_verify
benchmarks		benchmarks
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LAUNCH_PLAN.md		LAUNCH_PLAN.md
LICENSE		LICENSE
README.md		README.md
REVISED_LAUNCH_PLAN.md		REVISED_LAUNCH_PLAN.md
WORLDCLASS_PLAN.md		WORLDCLASS_PLAN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

athena-verify

Stop hallucinations before they cascade

How It Works

Install

Benchmarks

Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)

How false positives are kept low

Comparison

How We Compare

Integrations

LangChain

LlamaIndex

LangGraph (agent circuit-breaker)

OpenAI / Anthropic

API

`verify(question, answer, context, ...)`

`verify_step(claim, evidence, threshold)`

Examples

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

athena-verify

Stop hallucinations before they cascade

How It Works

Install

Benchmarks

Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)

How false positives are kept low

Comparison

How We Compare

Integrations

LangChain

LlamaIndex

LangGraph (agent circuit-breaker)

OpenAI / Anthropic

API

verify(question, answer, context, ...)

verify_step(claim, evidence, threshold)

Examples

Documentation

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`verify(question, answer, context, ...)`

`verify_step(claim, evidence, threshold)`

Packages