Fix/false positive calibration by RahulModugula · Pull Request #5 · RahulModugula/athena

RahulModugula · 2026-06-28T03:04:38Z

No description provided.

Sentence-level NLI collapsed to ~0 on faithful answers in two common cases: a sentence opening with an anaphor ("This cap applies...") lost its antecedent once the answer was split, and facts spread across several context sentences matched no single sentence-unit premise. - Prepend the previous sentence when a hypothesis starts with a pronoun or discourse marker, restoring the referent before NLI scoring. - Score each sentence against individual context sentences *and* the whole chunk, taking the max. - Share one _ground_sentences helper across verify / verify_async / verify_batch / verify_batch_async / verify_stream, removing the old concatenate-all-chunks premise that silently truncated at the model's token limit. All entry points now return identical results and populate supporting_spans. - Resolve the entailment class index from the model's id2label instead of hardcoding it, so non-default NLI checkpoints aren't scored on the wrong class. - Make the regex sentence splitter abbreviation-aware (Dr., U.S., Inc.). Synthetic benchmark: faithful false-positive rate 16.9% -> 11.5%, overall F1 91.3% -> 93.5%, recall unchanged at 96.7%.

- Annotate print_table and import VerificationResult in the CLI. - Treat crewai as an optional dependency in the mypy config and drop a now-unused type-ignore. - Sort imports (ruff I001) across the package and tests. ruff, mypy --strict, and the full test suite (140 tests) all pass.

Regenerate the synthetic eval with the false-positive fixes and refresh the README and RESULTS.md tables to match: - Faithful false-positive rate: 17% -> 11.5% (base), 9.2% (large). - Overall F1: 91.3% -> 93.6% (base), 93.8% (large). - Replace the misleading "0% F1 on faithful" row (F1 is undefined with zero hallucinations) with a stated false-positive rate. Also tidy .gitignore (add .ruff_cache, scratchpad, local settings dirs).

Standalone NLI scores many faithful paraphrases as neutral (entailment ~0) even when fully supported, which drove the bulk of the remaining false positives. Recover them without admitting hallucinations: - Expose the contradiction class alongside entailment (batch_compute_nli), read from the model's label map so it is model-agnostic. - For each sentence, take the contradiction of the most lexically on-topic context unit — not the global max — so an unrelated unit can't veto a faithful claim while genuine reversals still fire. - Add lexical containment and numeric-consistency signals (overlap.py). - apply_grounding_rescue lifts a not-entailed sentence to PARTIAL only when it is not contradicted, every number in it appears in the context, and most of its content words are grounded — gating out number swaps and contradictions. - Share the logic across all five verify entry points. Synthetic benchmark (base model): faithful false-positive rate 17% -> 4.6% (3.4% on large), overall F1 91.3% -> 95.0%, recall ~96%. Adds tests/test_rescue.py; README and RESULTS.md updated to match.

The NLI cross-encoder is the core of the library, but sentence-transformers lived in the optional [nli] extra, so a fresh `pip install athena-verify` followed by verify() raised ImportError — contradicting the README. Ship it by default so the documented one-liner install works cold. The [nli] extra is kept (now empty) for backwards compatibility. Verified: clean build (twine check passes) and a from-scratch venv install + examples/quickstart.py run end-to-end.

Rework examples/agent_circuit_breaker.py into a realistic 4-step financial research agent: a hallucinated "35% net margin" (the filing says 22%) trips the verify_step() circuit breaker at step 3, so the BUY recommendation built on it is never produced. Silence the ML stack's load report / progress bars and warm the model up front so the run is clean. Add assets/circuit_breaker.gif (rendered from the demo) and feature it near the top of the README — the cascade-prevention story is the launch narrative. Also sort imports in the LangChain example.

Copilot

Pull request overview

This PR reduces false-positive “unsupported” flags by improving grounding signals beyond plain entailment: it adds contradiction-aware NLI outputs, a guarded lexical/numeric “rescue” path for neutral-but-grounded paraphrases, anaphora windowing, and more robust sentence splitting; it also updates tests, docs/benchmarks, and examples to match.

Changes:

Add 2D NLI scoring (entailment, contradiction) plus entailment/contradiction label-index resolution, and wire it through core verification paths.
Introduce grounding-rescue calibration using lexical containment + numeric consistency gates, and apply it consistently across verify entry points.
Update docs/benchmarks and add an “agent circuit breaker” example/demo assets; refresh tests to patch the new NLI API.

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`athena_verify/core.py`	Refactors grounding to use per-unit + whole-chunk premises, anaphora windowing, contradiction signal, supporting spans, and rescue-aware trust/status.
`athena_verify/nli.py`	Adds label-map–based entailment/contradiction index resolution and introduces `batch_compute_nli()` returning `(entail, contra)`.
`athena_verify/calibration.py`	Adds rescue thresholds and `apply_grounding_rescue()` to lift neutral-but-grounded sentences.
`athena_verify/overlap.py`	Adds `containment_score()` and `numeric_consistency()` used by the rescue path.
`athena_verify/parser.py`	Improves regex fallback sentence splitter with abbreviation awareness.
`athena_verify/cli.py`	Adds type annotation for `print_table` argument.
`athena_verify/__init__.py`	Re-formats imports for readability/consistency.
`athena_verify/integrations/langgraph.py`	Switches `Callable` import to `collections.abc`.
`athena_verify/integrations/crewai.py`	Tweaks typing ignores / fallback behavior for optional dependency import.
`tests/test_verify.py`	Updates autouse NLI mocking and tightens the latency-budget test to avoid unpatched calls.
`tests/test_new_features.py`	Updates autouse NLI mocking for the new `batch_compute_nli` shape.
`tests/test_supporting_spans.py`	Updates span tests to patch `batch_compute_nli` with `(entail, contra)` tuples.
`tests/test_rescue.py`	Adds new unit tests for containment/numeric gate and rescue behavior.
`tests/test_nli.py`	Updates model-cache fixture to work with `@lru_cache`’d loaders/indexes.
`README.md`	Adds circuit-breaker section and updates performance/false-positive claims and explanation.
`benchmarks/RESULTS.md`	Updates benchmark date, metrics, and methodology description to match new grounding logic.
`examples/agent_circuit_breaker.py`	Expands the circuit-breaker demo and silences ML stack output for cleaner UX.
`examples/langchain_example.py`	Reorders imports.
`assets/circuit_breaker.tape`	Adds VHS tape script to regenerate the demo GIF.
`.gitignore`	Ignores additional tooling caches and demo recording artifacts/settings dirs.
`pyproject.toml`	Promotes `sentence-transformers` to a core dependency and keeps `[nli]` extra for backwards compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    nli_pairs = [(unit, hyp) for hyp in hypotheses for unit in units]
+    flat = batch_compute_nli(nli_pairs, model_name=nli_model)


+    Only ever raises the score, and only when all guards pass:
+      - the claim is not contradicted by any context unit,
+      - it is not already strongly entailed (nothing to rescue),
+      - its content words are heavily present in the context, and
+      - every number in it appears in the context.


+    get_nli_model and entailment_index are both @lru_cache'd, so clear them
+    around the patch to keep tests isolated.
+    """
+    nli_module.get_nli_model.cache_clear()
+    nli_module.entailment_index.cache_clear()
+    models: dict[str, object] = {}


+    nli_module.get_nli_model.cache_clear()
+    nli_module.entailment_index.cache_clear()


+"""Tests for the grounding-rescue path: containment, numeric gate, and the
+contradiction-vetoed rescue that recovers faithful paraphrases NLI scores low.
+"""


RahulModugula added 6 commits June 27, 2026 18:47

Copilot AI review requested due to automatic review settings June 28, 2026 03:04

Copilot started reviewing on behalf of RahulModugula June 28, 2026 03:04 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

RahulModugula merged commit a32f6f5 into main Jun 28, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/false positive calibration#5

Fix/false positive calibration#5
RahulModugula merged 6 commits into
mainfrom
fix/false-positive-calibration

RahulModugula commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		nli_pairs = [(unit, hyp) for hyp in hypotheses for unit in units]
		flat = batch_compute_nli(nli_pairs, model_name=nli_model)

		nli_module.get_nli_model.cache_clear()
		nli_module.entailment_index.cache_clear()

Conversation

RahulModugula commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants