Skip to content

RahulModugula/athena

Repository files navigation

athena-verify

Open-source runtime guardrail that catches RAG hallucinations sentence-by-sentence — with per-claim source spans that show exactly which chunk supported (or didn't support) each sentence.

from athena_verify import verify

result = verify(
    question="What is the indemnification cap?",
    answer="The cap is $1M per incident, with a $5M annual aggregate.",
    context=retrieved_chunks,
)

for s in result.sentences:
    status = "✓" if s.supported else "✗"
    print(f"{status} {s.text}")
    for span in s.supporting_spans:
        print(f"  ← chunk[{span.chunk_idx}] {span.start}{span.end}: {span.text!r}")

No document ingestion. No chunking. No agents. No database. Works identically on GPT, Claude, Gemini, Llama, Qwen, or any other model — provider-neutral by design.

Python 3.12 License MIT Version

Stop hallucinations before they cascade

In a multi-step agent, each step's output feeds the next — a single fabricated figure propagates straight into the final answer. verify_step() is a circuit breaker that halts the chain the moment a claim stops being grounded in the evidence:

Agent circuit-breaker demo

from athena_verify import verify_step

step = verify_step(claim=reasoning_step, evidence=retrieved_chunks, threshold=0.5)
if step.action == "halt":
    raise RuntimeError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")

Run it yourself: examples/agent_circuit_breaker.py.

How It Works

  LLM Answer
      │
      ▼
  Split into sentences
      │
      ▼
  ┌──────────────────────────────────────────┐
  │  For each answer sentence:               │
  │                                          │
  │  1. Split context into sentences too     │
  │  2. NLI: each ctx sentence vs answer ──► │ max entailment score
  │  3. Lexical overlap vs context       ──► │ token F1 score
  │  4. [optional] LLM judge             ──► │ SUPPORTED / UNSUPPORTED
  │  5. Combine → trust score                │
  └──────────────┬───────────────────────────┘
                 │
                 ▼
    VerificationResult
    ├─ trust_score: 0.0–1.0
    ├─ supported: [sentences that passed]
    ├─ unsupported: [sentences that failed]
    └─ verification_passed: bool

Two modes:

  • NLI-only (default): ~20ms per sentence, catches fabricated claims, out-of-context info, number swaps, and negation flips
  • NLI + LLM judge: escalate borderline cases to a local LLM for additional verification when accuracy is critical

Hard latency budget — the first open-source verifier with an explicit latency_budget_ms knob:

verify(..., latency_budget_ms=50)   # pure NLI+lexical only — voice AI / agent fast-path
verify(..., latency_budget_ms=500)  # escalate borderline cases if budget allows
verify(..., latency_budget_ms=None) # always escalate (default)

Install

pip install athena-verify

The NLI model (DeBERTa-v3, ~1.2 GB) downloads automatically on first use.

For LLM-judge support (optional, local models via LM Studio or API):

pip install "athena-verify[all]"

Benchmarks

Evaluated on 100 synthetic cases across 6 hallucination categories (legal, medical, technical, general). Real-world benchmarks against RAGTruth and HaluEval are in progress — download instructions are in benchmarks/RESULTS.md.

Hallucination Detection (NLI-only, synthetic, nli-deberta-v3-base)

Each row is the per-category F1 for catching hallucinations. The faithful-text row is intentionally excluded here — it contains no hallucinations, so its F1 is undefined; we report its false-positive rate separately below, which is the number that actually matters for clean text.

Category Precision Recall F1
Fabricated claims 100% 96% 97.9%
Out-of-context 100% 97% 98.3%
Subtle contradictions 100% 97% 98.3%
Partial support 95% 91% 93.0%
Number substitutions 82% 96% 88.5%
Overall 95% 96% 95.0% (synthetic)

False-positive rate on faithful text: 4.6% (4 of 87 genuinely-supported sentences flagged) on the base model, 3.4% on the large model — down from 17% before calibration. Latency: p50 22.5 ms, p95 34.5 ms per verification on the base model. Numbers are reproducible with python benchmarks/run_full_eval.py.

How false positives are kept low

Standalone NLI scores many faithful paraphrases as "neutral" (entailment ≈ 0) even when the claim is fully supported. Athena recovers these without letting hallucinations through, using three guarded signals:

  • Anaphora windowing — a sentence opening with a referent ("This cap…", "It also…") is scored together with its predecessor, restoring the antecedent.
  • Contradiction-aware rescue — a not-entailed claim is only rescued when the most on-topic context unit does not contradict it, so reversals and subtle contradictions stay flagged.
  • Numeric gate — rescue requires every number in the claim to appear in the context, so number-substitution hallucinations ("$5M" vs a "$2M" context) are never rescued.

The remaining false positives are heavily-paraphrased claims with little lexical overlap (e.g. "olive oil is drizzled on top"); enable the optional LLM-judge escalation (use_llm_judge=True) for those. Athena still biases toward catching hallucinations over passing every clean sentence — treat it as a guardrail.

LettuceDetect beats athena on span-level F1 on real-world benchmarks (LettuceDetect 79.2% F1 on annotated spans vs. athena's unvalidated real-world score). Athena wins on latency bounds, provider-neutrality, offline execution, and the spans-in-library integration story — not raw F1.

Recommendation: Use athena as a guardrail, not a final gate. Flag suspicious statements for human review rather than silently dropping them.

Comparison

Tool Runs locally Provider-neutral Latency budget knob Per-claim spans F1 (real)
Athena Yes Yes Yes Yes TBD¹
LettuceDetect Yes Yes No No 79.2%
HHEM-2.1 Yes Yes No No ~82%
Ragas Yes No (LLM calls) No No ~75%
Azure Groundedness No (cloud only) No (GPT-4o only) No No ~90%
Vertex Grounding No (cloud only) No (Gemini only) No No ~88%
Anthropic Citations No (cloud only) No (Claude only) No No

¹ RAGTruth and HaluEval benchmarks pending; see benchmarks/RESULTS.md for download instructions.

Full methodology: benchmarks/RESULTS.md

How We Compare

Athena Ragas Azure Groundedness LettuceDetect
Runtime detection Yes Offline eval Yes Offline eval
Sentence-level + spans Yes Answer-level No Span-level
Works offline / local Yes Yes No (cloud) Yes
Provider-neutral Yes Partial No (GPT-4o only) Yes
Latency budget knob Yes No No No
Open source Yes Yes No Yes
No external API required Yes No (LLM calls) No Yes

Ragas and TruLens are great for offline evaluation. Azure/Vertex/Anthropic detectors work only in their own cloud. Athena is the runtime guardrail for everywhere else — any model, any stack, fully offline.

Integrations

LangChain

from athena_verify.integrations.langchain import VerifyingLLM

chain = RetrievalQA.from_llm(
    VerifyingLLM(llm, retriever=retriever, on_unsupported="re-retrieve", max_retries=2),
    retriever=retriever,
)

LlamaIndex

from athena_verify.integrations.llamaindex import VerifyingPostprocessor

engine = index.as_query_engine(
    response_postprocessors=[VerifyingPostprocessor()]
)

LangGraph (agent circuit-breaker)

from athena_verify.integrations.langgraph import VerifyStepNode

graph.add_node("verify", VerifyStepNode(threshold=0.8))

OpenAI / Anthropic

from athena_verify import verified_completion

result = verified_completion(
    model="gpt-4o",
    question="What is the indemnification cap?",
    context=retrieved_chunks,
)

API

verify(question, answer, context, ...)

Parameter Type Default Description
question str required The original question
answer str required The LLM-generated answer
context list[str] required Retrieved context chunks
nli_model str nli-deberta-v3-base Cross-encoder model
use_llm_judge bool False Enable LLM judge for all sentences
trust_threshold float 0.70 Minimum trust to pass
latency_budget_ms int | None None Hard latency cap; ≤100 skips LLM judge entirely

Returns VerificationResult with trust_score, sentences (each with supporting_spans), supported, unsupported, and verification_passed.

verify_step(claim, evidence, threshold)

Circuit-breaker primitive for agent pipelines:

from athena_verify import verify_step

step = verify_step(
    claim="The contract was signed on 2024-01-15",
    evidence=retrieved_chunks,
    threshold=0.8,
)
# step.passed: bool, step.trust_score: float, step.action: "continue" | "halt"
if step.action == "halt":
    raise ValueError(f"Ungrounded claim blocked (trust={step.trust_score:.2f})")

See examples/agent_circuit_breaker.py for a full LangGraph example.

Examples

Example Description
examples/quickstart.py 5-minute getting started
examples/langchain_example.py LangChain RetrievalQA with self-healing re-retrieve
examples/llamaindex_example.py LlamaIndex query engine
examples/agent_circuit_breaker.py LangGraph agent with verify_step halt

Documentation

Contributing

See CONTRIBUTING.md. PRs welcome — especially benchmark results, new integrations, and NLI model improvements.

License

MIT

About

RAG-powered research assistant with hybrid search (pgvector + BM25), cross-encoder reranking, and RAGAS evaluation

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages