Fix/false positive calibration#5
Merged
Merged
Conversation
Sentence-level NLI collapsed to ~0 on faithful answers in two common
cases: a sentence opening with an anaphor ("This cap applies...") lost
its antecedent once the answer was split, and facts spread across
several context sentences matched no single sentence-unit premise.
- Prepend the previous sentence when a hypothesis starts with a pronoun
or discourse marker, restoring the referent before NLI scoring.
- Score each sentence against individual context sentences *and* the
whole chunk, taking the max.
- Share one _ground_sentences helper across verify / verify_async /
verify_batch / verify_batch_async / verify_stream, removing the old
concatenate-all-chunks premise that silently truncated at the model's
token limit. All entry points now return identical results and
populate supporting_spans.
- Resolve the entailment class index from the model's id2label instead
of hardcoding it, so non-default NLI checkpoints aren't scored on the
wrong class.
- Make the regex sentence splitter abbreviation-aware (Dr., U.S., Inc.).
Synthetic benchmark: faithful false-positive rate 16.9% -> 11.5%,
overall F1 91.3% -> 93.5%, recall unchanged at 96.7%.
- Annotate print_table and import VerificationResult in the CLI. - Treat crewai as an optional dependency in the mypy config and drop a now-unused type-ignore. - Sort imports (ruff I001) across the package and tests. ruff, mypy --strict, and the full test suite (140 tests) all pass.
Regenerate the synthetic eval with the false-positive fixes and refresh the README and RESULTS.md tables to match: - Faithful false-positive rate: 17% -> 11.5% (base), 9.2% (large). - Overall F1: 91.3% -> 93.6% (base), 93.8% (large). - Replace the misleading "0% F1 on faithful" row (F1 is undefined with zero hallucinations) with a stated false-positive rate. Also tidy .gitignore (add .ruff_cache, scratchpad, local settings dirs).
Standalone NLI scores many faithful paraphrases as neutral (entailment ~0) even when fully supported, which drove the bulk of the remaining false positives. Recover them without admitting hallucinations: - Expose the contradiction class alongside entailment (batch_compute_nli), read from the model's label map so it is model-agnostic. - For each sentence, take the contradiction of the most lexically on-topic context unit — not the global max — so an unrelated unit can't veto a faithful claim while genuine reversals still fire. - Add lexical containment and numeric-consistency signals (overlap.py). - apply_grounding_rescue lifts a not-entailed sentence to PARTIAL only when it is not contradicted, every number in it appears in the context, and most of its content words are grounded — gating out number swaps and contradictions. - Share the logic across all five verify entry points. Synthetic benchmark (base model): faithful false-positive rate 17% -> 4.6% (3.4% on large), overall F1 91.3% -> 95.0%, recall ~96%. Adds tests/test_rescue.py; README and RESULTS.md updated to match.
The NLI cross-encoder is the core of the library, but sentence-transformers lived in the optional [nli] extra, so a fresh `pip install athena-verify` followed by verify() raised ImportError — contradicting the README. Ship it by default so the documented one-liner install works cold. The [nli] extra is kept (now empty) for backwards compatibility. Verified: clean build (twine check passes) and a from-scratch venv install + examples/quickstart.py run end-to-end.
Rework examples/agent_circuit_breaker.py into a realistic 4-step financial research agent: a hallucinated "35% net margin" (the filing says 22%) trips the verify_step() circuit breaker at step 3, so the BUY recommendation built on it is never produced. Silence the ML stack's load report / progress bars and warm the model up front so the run is clean. Add assets/circuit_breaker.gif (rendered from the demo) and feature it near the top of the README — the cascade-prevention story is the launch narrative. Also sort imports in the LangChain example.
There was a problem hiding this comment.
Pull request overview
This PR reduces false-positive “unsupported” flags by improving grounding signals beyond plain entailment: it adds contradiction-aware NLI outputs, a guarded lexical/numeric “rescue” path for neutral-but-grounded paraphrases, anaphora windowing, and more robust sentence splitting; it also updates tests, docs/benchmarks, and examples to match.
Changes:
- Add 2D NLI scoring
(entailment, contradiction)plus entailment/contradiction label-index resolution, and wire it through core verification paths. - Introduce grounding-rescue calibration using lexical containment + numeric consistency gates, and apply it consistently across verify entry points.
- Update docs/benchmarks and add an “agent circuit breaker” example/demo assets; refresh tests to patch the new NLI API.
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
athena_verify/core.py |
Refactors grounding to use per-unit + whole-chunk premises, anaphora windowing, contradiction signal, supporting spans, and rescue-aware trust/status. |
athena_verify/nli.py |
Adds label-map–based entailment/contradiction index resolution and introduces batch_compute_nli() returning (entail, contra). |
athena_verify/calibration.py |
Adds rescue thresholds and apply_grounding_rescue() to lift neutral-but-grounded sentences. |
athena_verify/overlap.py |
Adds containment_score() and numeric_consistency() used by the rescue path. |
athena_verify/parser.py |
Improves regex fallback sentence splitter with abbreviation awareness. |
athena_verify/cli.py |
Adds type annotation for print_table argument. |
athena_verify/__init__.py |
Re-formats imports for readability/consistency. |
athena_verify/integrations/langgraph.py |
Switches Callable import to collections.abc. |
athena_verify/integrations/crewai.py |
Tweaks typing ignores / fallback behavior for optional dependency import. |
tests/test_verify.py |
Updates autouse NLI mocking and tightens the latency-budget test to avoid unpatched calls. |
tests/test_new_features.py |
Updates autouse NLI mocking for the new batch_compute_nli shape. |
tests/test_supporting_spans.py |
Updates span tests to patch batch_compute_nli with (entail, contra) tuples. |
tests/test_rescue.py |
Adds new unit tests for containment/numeric gate and rescue behavior. |
tests/test_nli.py |
Updates model-cache fixture to work with @lru_cache’d loaders/indexes. |
README.md |
Adds circuit-breaker section and updates performance/false-positive claims and explanation. |
benchmarks/RESULTS.md |
Updates benchmark date, metrics, and methodology description to match new grounding logic. |
examples/agent_circuit_breaker.py |
Expands the circuit-breaker demo and silences ML stack output for cleaner UX. |
examples/langchain_example.py |
Reorders imports. |
assets/circuit_breaker.tape |
Adds VHS tape script to regenerate the demo GIF. |
.gitignore |
Ignores additional tooling caches and demo recording artifacts/settings dirs. |
pyproject.toml |
Promotes sentence-transformers to a core dependency and keeps [nli] extra for backwards compatibility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+175
to
+176
| nli_pairs = [(unit, hyp) for hyp in hypotheses for unit in units] | ||
| flat = batch_compute_nli(nli_pairs, model_name=nli_model) |
Comment on lines
+80
to
+84
| Only ever raises the score, and only when all guards pass: | ||
| - the claim is not contradicted by any context unit, | ||
| - it is not already strongly entailed (nothing to rescue), | ||
| - its content words are heavily present in the context, and | ||
| - every number in it appears in the context. |
Comment on lines
+75
to
+80
| get_nli_model and entailment_index are both @lru_cache'd, so clear them | ||
| around the patch to keep tests isolated. | ||
| """ | ||
| nli_module.get_nli_model.cache_clear() | ||
| nli_module.entailment_index.cache_clear() | ||
| models: dict[str, object] = {} |
Comment on lines
+88
to
+89
| nli_module.get_nli_model.cache_clear() | ||
| nli_module.entailment_index.cache_clear() |
Comment on lines
+1
to
+3
| """Tests for the grounding-rescue path: containment, numeric gate, and the | ||
| contradiction-vetoed rescue that recovers faithful paraphrases NLI scores low. | ||
| """ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.