RAG retrieves chunks. CAC satisfies evidence requirements.
Context Admission Control (CAC) is a research prototype for evidence admission under budget. It is designed for evidence-sensitive AI workflows where the model should not simply read the most relevant chunks — it should reason from the right evidence, in the right representation, with an audit trail.
CAC changes the unit of context from:
retrieved chunk
to:
satisfied evidence requirement
This repository includes:
- the CAC packet-building prototype,
- DecisionRiskBench v1.4,
- RAG-style baselines,
- packaged benchmark outputs,
- rewrite and stress suites,
- no-gold-admission tests,
- and an external LLM prompt/export harness for future answer evaluation.
Classic RAG usually follows this pattern:
query → retrieve chunks → stuff context → answer
That works for many lookup tasks. But evidence-sensitive decisions often require more than chunk relevance:
- Which source is authoritative?
- Is the evidence current?
- Is exact wording required?
- What evidence is missing?
- Which sources contradict each other?
- What should be excluded as stale, low-authority, or distracting?
- How much model context should this evidence cost?
CAC adds a context-control layer:
task → evidence requirements → candidate pool → evidence valuation
→ representation selection → evidence packet → answer
The core claim:
RAG optimizes chunk relevance. CAC optimizes evidence sufficiency under budget.
New here? Start with START_HERE.md.
| RAG | CAC |
|---|---|
| Retrieves chunks | Satisfies evidence requirements |
| Optimizes relevance | Optimizes sufficiency under budget |
| Stuffs raw text | Chooses representations |
| Usually cites retrieved docs | Audits admission/exclusion decisions |
| Often answers from what it found | Reports missing evidence |
| More context can mean more noise | Context is governed by evidence policy |
CAC is for builders working on:
- enterprise AI assistants
- auditable QA
- decision support
- compliance / security / contract workflows
- post-RAG context engineering
- LLM evaluation and retrieval benchmarks
CAC is probably overkill for:
- simple FAQ bots
- single-document lookup
- low-stakes retrieval demos
CAC outputs an evidence packet, not a bag of chunks.
An evidence packet can contain:
structured facts
summaries
exact excerpts
conflicts
uncertainties
excluded evidence
audit trace
filled evidence requirements
missing evidence requirements
Example:
{
"structured_facts": [
{
"source": "billing_row_044",
"fact": {
"invoice_status": "47_days_overdue",
"outstanding_balance": 184000
}
}
],
"exact_evidence": [
{
"source": "contract_184_section_12_2",
"text": "Non-payment of undisputed fees for more than forty-five days constitutes material breach after notice."
}
],
"conflicts": [
{
"issue": "CRM says the account is healthy; billing and support indicate risk."
}
],
"uncertainties": [
"No executive sponsor signal found."
],
"excluded_evidence": [
{
"source": "slack_thread_332",
"reason": "Low-authority speculation without attached source."
}
]
}Design rule:
Compress facts. Preserve language when wording carries obligation, ambiguity, or risk.
This is a research prototype validated on a synthetic benchmark.
Nine independent test scenarios, ~4,900 real-model LLM inferences (phi-3-mini-4k-instruct), and ~4,900 LLM-as-judge evaluations consistently show CAC outperforming all non-oracle RAG baselines on evidence-sensitive decision quality. The findings hold under adversarial conditions and under conditions explicitly designed to favor RAG.
What the evidence demonstrates:
- CAC produces safer, more complete answers than all tested RAG variants on evidence-sensitive decisions
- The advantage holds under distractor flood (d=100), budget crunch (budget=80), metadata corruption (noise=0.50), and in clean-signal conditions where RAG should be strongest
- Both an independent lexical scorer and a separate LLM judge confirm the same method rankings across all scenarios
- No tested method — including the oracle baseline that receives ground-truth candidate lists — beats CAC on overall safe rate in any scenario
- Oracle still leads CAC on contract termination per-task (4.95 vs 4.90 LLM judge), reflecting CAC's correct withholding on incomplete-evidence cases, not an evidence quality failure
What remains unvalidated:
- Performance on real enterprise documents — all benchmark data is synthetically generated
- Behavior with models other than phi-3-mini-4k-instruct
- Production-scale latency, throughput, and integration
- Generalization beyond the four tested task types
On DecisionRiskBench v1.4, CAC achieves stronger aggregate decision-grade and real-LLM answer quality than all included RAG baselines across every tested condition on the same candidate pool.
RAG is in trouble for evidence-sensitive decision work because chunk relevance is losing to evidence sufficiency under budget.
Packaged main run:
PYTHONPATH=. python -m benchmarks.decision_risk.run \
--n 20 \
--budgets 40,60,80,120,160,240,500 \
--distractors 5,25,50 \
--output-dir outputs/decision_risk_v1_4_n20Shape:
20 accounts × 4 tasks × 7 budgets × 3 distractor levels × 8 methods = 13,440 rows
Aggregate results from the packaged output:
| Method | Avg Tokens | Decision Grade | Safe Rate | Grade/1k Tokens | Distractor Rate | Contradiction Recall |
|---|---|---|---|---|---|---|
cac |
60.2 | 0.810 | 62% | 14.1 | 4.6% | 62.8% |
schema_aware_chunk_rag_k8 |
85.3 | 0.689 | 18% | 9.0 | 29.2% | 21.0% |
iterative_rag_k8 |
69.7 | 0.688 | 17% | 10.6 | 23.4% | 24.6% |
oracle_candidate_rag_k8 |
87.3 | 0.679 | 18% | 9.1 | 27.5% | 18.2% |
metadata_aware_rag_k8 |
85.2 | 0.672 | 15% | 8.7 | 29.2% | 21.0% |
heuristic_rerank_rag_k8 |
87.9 | 0.640 | 11% | 8.3 | 33.9% | 16.5% |
long_context_rag_k24 |
126.6 | 0.605 | 15% | 6.7 | 52.1% | 24.6% |
fixed_context_rag_k8 |
86.5 | 0.593 | 15% | 7.9 | 44.5% | 24.6% |
CAC is the only method to exceed 60% safe rate. Best RAG baseline tops out at 18%.
CAC uses 31% fewer tokens than the best RAG baseline while scoring 17.5pp higher on decision grade.
| Method | Hit Rate | Mean Min Budget |
|---|---|---|
cac |
54.6% | 73.1 tokens |
iterative_rag_k8 |
22.9% | 77.8 tokens |
schema_aware_chunk_rag_k8 |
22.5% | 78.5 tokens |
oracle_candidate_rag_k8 |
22.5% | 78.5 tokens |
fixed_context_rag_k8 |
20.8% | 80.0 tokens |
heuristic_rerank_rag_k8 |
8.8% | 80.0 tokens |
CAC reaches decision grade ≥ 0.9 on 54.6% of tasks — 2.4× the best RAG hit rate — and does so earlier in the budget.
CAC also wins every included task family by decision-grade score in the packaged main run:
contract termination
incident postmortem
renewal risk
security exception
DecisionRiskBench v1.4 tests context-control strategy, not retriever quality.
All compared methods use the same candidate pool. The difference is how they assemble context:
- RAG baselines admit raw chunks.
- CAC can admit compact evidence representations such as structured facts, summaries, and exact excerpts.
That distinction is intentional. CAC is being evaluated as evidence admission, not as another retriever.
This is also a methodological boundary. Stronger future baselines should include compression-aware and answer-aware RAG variants.
The generated-answer metric in this repository is a deterministic answer-readiness proxy.
It is not an LLM answer study and it is not a human preference study.
The next empirical milestone is:
RAG chunks → same LLM → answer
CAC evidence packet → same LLM → answer
human or LLM judge → score
This repository now includes packaged real-model outputs — see the LLM Answer Quality section below.
When using the external LLM prompt export, send only the prompt field to the model. Do not send gold metadata fields used later for scoring.
The following results are from a live inference run using microsoft/phi-3-mini-4k-instruct (3.82B parameters, CUDA) on 100 prompts (5 methods × 20 prompts across 5 accounts and 4 task types).
Scorer: lexical proxy (slot coverage, citation markers, hedging keywords).
Full outputs: outputs/llm_eval_real/
| Method | LLM Answer Score | Safe Rate | Contradiction Handling | Missing Disclosure |
|---|---|---|---|---|
cac |
0.8157 | 65% | 100% | 100% |
oracle_candidate_rag_k8 |
0.8088 | 60% | 100% | 100% |
schema_aware_chunk_rag_k8 |
0.7832 | 40% | 100% | 100% |
iterative_rag_k8 |
0.7282 | 15% | 90% ← failure | 100% |
fixed_context_rag_k8 |
0.6608 | 20% | 100% | 95% ← failure |
CAC's safe rate is 3.25× that of fixed-context RAG on the same candidate pool.
CAC is the only method to achieve 100% across all three safety dimensions (contradiction handling, missing disclosure, safe rate ≥ 0.80).
A second scorer — microsoft/phi-3-mini-4k-instruct acting as judge — independently rated each answer on completeness, hedging, hallucination-freedom, and an overall 1–5 score. Results:
| Method | Completeness | Hedging | Hallucination-free | Overall (1–5) |
|---|---|---|---|---|
cac |
4.80 | 4.95 | 5.00 | 4.85 |
schema_aware_chunk_rag_k8 |
4.80 | 4.90 | 5.00 | 4.80 |
oracle_candidate_rag_k8 |
4.75 | 4.75 | 5.00 | 4.75 |
iterative_rag_k8 |
4.55 | 4.95 | 5.00 | 4.55 |
fixed_context_rag_k8 |
4.40 | 4.85 | 5.00 | 4.45 |
Both the lexical scorer and the LLM judge rank the methods in the same order: CAC first.
The judge confirms that completeness is the differentiating dimension — CAC provides the most complete answers because admission control ensures the right evidence is always present.
Full scoring details:outputs/llm_eval_real/judge_report.md
Note: oracle_candidate_rag_k8 receives oracle knowledge of which candidates are relevant — an upper-bound baseline not available in real deployments. CAC matches it on answer quality without oracle access.
A harder run testing distractor robustness: 50 irrelevant chunks injected per account, n=15 accounts, budget=160 tokens.
Full outputs: outputs/llm_eval_stress/
| Method | LLM Answer Score | Safe Rate | Contradiction Handling | Missing Disclosure |
|---|---|---|---|---|
cac |
0.8242 | 63.3% | 100% | 100% |
oracle_candidate_rag_k8 |
0.7975 | 48.3% | 100% | 100% |
schema_aware_chunk_rag_k8 |
0.7406 | 26.7% | 100% | 100% |
iterative_rag_k8 |
0.7334 | 21.7% | 91.7% ← failure | 100% |
fixed_context_rag_k8 |
0.6513 | 0.0% ← collapse | 98.3% ← failure | 100% |
At 50 distractors, fixed-context RAG drops to 0% safe rate. CAC holds at 63.3% — essentially unchanged from 65% at lower distractor load.
CAC's admission-control filter is immune to distractor count by design: irrelevant chunks are rejected before the context window is built.
LLM-as-judge confirmation (phi-3-mini judging the same 300 answers):
| Method | Completeness | Hallucination-free | Overall (1–5) |
|---|---|---|---|
cac |
4.82 | 5.00 | 4.82 |
oracle_candidate_rag_k8 |
4.80 | 5.00 | 4.80 |
iterative_rag_k8 |
4.52 | 5.00 | 4.55 |
fixed_context_rag_k8 |
4.30 | 4.98 ← failure | 4.43 |
At 50 distractors,
fixed_context_ragis the only method to show hallucination failures in the judge pass (hallucination-free drops below 5.00), corroborating the lexical safe-rate collapse.
A three-scenario battery designed to simulate the hardest real-world conditions: 1,000 total LLM inferences + 1,000 judge calls across all scenarios.
Context window cut in half vs. the standard run. Greedy RAG fills the window with whatever scores highest; CAC's slot-prioritized admission control selects the most critical evidence.
Full outputs: outputs/llm_eval_budget_crunch/
| Method | LLM Answer Score | Safe Rate |
|---|---|---|
cac |
0.8327 | 70.0% |
oracle_candidate_rag_k8 |
0.8032 | 55.0% |
schema_aware_chunk_rag_k8 |
0.7876 | 50.0% |
iterative_rag_k8 |
0.7498 | 30.0% |
fixed_context_rag_k8 |
0.6376 | 3.3% ← near-collapse |
CAC's safe rate advantage widens under budget pressure: 70% vs. 3.3% for fixed-context RAG. When the context window is scarce, greedy fill wastes tokens on low-value chunks; admission control spends every token on evidence that matters.
2× the distractor density used in the stress eval. At d=50 fixed-context RAG already hit 0% — this scenario tests whether any RAG variant can survive at d=100.
Full outputs: outputs/llm_eval_extreme_noise/
| Method | LLM Answer Score | Safe Rate |
|---|---|---|
cac |
0.8067 | 57.5% |
oracle_candidate_rag_k8 |
0.7560 | 36.25% |
schema_aware_chunk_rag_k8 |
0.7522 | 33.75% |
iterative_rag_k8 |
0.7483 | 30.0% |
fixed_context_rag_k8 |
0.6452 | 0.0% ← total collapse |
At 100 distractors, fixed-context RAG collapses to 0% safe rate for the second time. CAC holds at 57.5% — 1.6× the next-best method. The LLM-as-judge also ranks CAC first on this scenario (4.76 vs. 4.64 for the next-best), the only scenario where the judge and lexical scorer agree on the top method.
50% of metadata fields (topics, risk tags) stripped or corrupted — simulating a production environment with unreliable tagging pipelines.
Full outputs: outputs/decision_risk_metadata_corruption_v3/
| Method | Answer Score | Safe Rate |
|---|---|---|
cac |
0.8969 | 95.0% |
iterative_rag_k8 |
0.8151 | 26.25% |
oracle_candidate_rag_k8 |
0.7631 | 26.25% |
schema_aware_chunk_rag_k8 |
0.7631 | 26.25% |
fixed_context_rag_k8 |
0.7379 | 18.75% |
Under 50% metadata corruption, CAC leads all methods at 95% safe rate — 3.6× oracle's 26.25%. v1.7 adds a content-based slot routing fallback: when source type matches but topic/tag metadata is corrupted by noise, CAC searches the document's noise-immune title and text for slot-relevant keywords. Document text is never modified by
apply_noise, making this fallback unconditionally reliable.Why oracle fell to 26.25%: Oracle receives gold candidate lists (bypassing metadata routing) but still admits ~43% distractors into its fixed K=8 context window. At 50% noise, injected tags on distractor documents create false positive signals that oracle's raw-chunk selection cannot filter, flooding the context with irrelevant content and causing answer quality collapse. CAC's admission filter rejects these distractors via the
metadata_distractor_signalgate before they reach the packet.The v1.7 fix:
item_matches_slotnow has a metadata-first, content-fallback structure. The primary path uses metadata (topic tags, risk classifications) for fast, precise matching. When those fields are corrupted, the fallback searchestitle + text— normalized to match compound tags ("payment_default" → "payment default") in prose. A companion fix tometadata_distractor_signaladds text-based detection for the Generic DPA distractor, which can lose its "generic_contract" tag at 50% noise but always retains the phrase "commercially reasonable" in its template text. 95% CI for CAC: [90.2%, 99.8%]. All methods' upper bounds are below 36%.
Both extreme conditions simultaneously: maximum distractor load (100 irrelevant chunks) and minimum context budget (80 tokens). The capstone scenario.
Full outputs: outputs/llm_eval_perfect_storm/
| Method | LLM Answer Score | Safe Rate |
|---|---|---|
cac |
0.8053 | 60.0% |
oracle_candidate_rag_k8 |
0.7808 | 51.7% |
schema_aware_chunk_rag_k8 |
0.7480 | 36.7% |
iterative_rag_k8 |
0.7629 | 35.0% |
fixed_context_rag_k8 |
0.6307 | 3.3% ← collapse |
At the most extreme conditions tested, CAC holds at 60% safe rate. Remarkably, this is higher than CAC's 57.5% safe rate in Scenario B (same d=100 but with 2× the token budget). Tighter budget forces more selective admission, which produces more accurate answers — the efficiency ratio confirms this at 1.20×. The LLM-as-judge also ranks CAC first on this scenario (4.72 overall), making it the only method to lead on both lexical correctness and judge quality simultaneously under maximum adversarial pressure.
Across all conditions, CAC ranks #1 or #2 on ground-truth lexical safe rate and is the only non-oracle method to hold above 48% safe rate in every scenario:
| Scenario | CAC safe rate | Next-best non-oracle | fixed-context RAG |
|---|---|---|---|
| Baseline (d=50, budget=160) | 63.3% | 26.7% (schema) | 0.0% |
| A: Budget crunch (budget=80) | 70.0% | 50.0% (schema) | 3.3% |
| B: Distractor flood (d=100) | 57.5% | 33.75% (schema) | 0.0% |
| C: Metadata corruption (noise=0.5) | 95.0% | 26.25% (iterative/oracle, tied) | 18.75% |
| Perfect storm (d=100 + budget=80) | 60.0% | 36.7% (schema) | 3.3% |
These two scenarios were designed to strip away CAC's known adversarial advantages — distractor pressure, metadata corruption, budget crunch — and test whether the structuring benefit holds in RAG's optimal conditions.
Pre-test prediction (deterministic proxy, d=5, budget=160): iterative_rag already beats CAC on contract_termination (proxy score 0.739 vs 0.735) and renewal_risk (1.000 vs 0.993). This was the honest forecast run before the LLM eval.
Near-zero distractors and perfect metadata: the most favorable RAG conditions tested. iterative_rag was predicted to beat CAC on two task types.
Full outputs: outputs/llm_eval_clean_signal/
| Method | LLM Answer Score | Safe Rate | Contradiction Handling |
|---|---|---|---|
cac |
0.8141 | 71.25% | 100% |
oracle_candidate_rag_k8 |
0.8132 | 51.25% | 100% |
iterative_rag_k8 |
0.7795 | 47.5% | 92.5% ← failure |
schema_aware_chunk_rag_k8 |
0.7766 | 43.75% | 100% |
fixed_context_rag_k8 |
0.6824 | 11.25% | 100% |
CAC achieves 71.25% — leading all methods at d=5 with zero noise. The clean signal amplifies CAC's structuring advantage: evidence admission has standalone value beyond noise filtering.
The predicted threat did not materialize on aggregate:
iterative_ragreaches 90% on contract_termination alone (where CAC correctly withholds on incomplete-evidence cases), but collapses to 0% on security_exception, pulling its overall safe rate to 47.5%.
fixed_context_ragscores 0% on three of four task types even at d=5 with perfect metadata — collapse is intrinsic to raw-chunk representation, not caused by distractors.
Per-task breakdown (20 accounts per task, 80 answers per method):
| Task | cac |
oracle_candidate_rag |
schema_aware_rag |
iterative_rag |
fixed_context_rag |
|---|---|---|---|---|---|
| Renewal risk | 75% | 25% | 40% | 65% | 0% |
| Security exception | 75% | 20% | 0% | 0% | 0% |
| Contract termination | 40% | 60% ← oracle wins | 55% | 90% ← iterative leads | 0% |
| Incident postmortem | 95% | 100% ← oracle wins | 80% | 35% | 45% |
The per-task percentages above are heuristic lexical safe rates (v1.6 architecture,
outputs/llm_eval_clean_signal_v3/). The contract_termination profile shifted: v1.6's distractor-blocking fixes cause CAC to correctly withhold on incomplete-evidence cases, which the crude heuristic scorer penalizes. On the LLM judge, CAC scores 20/20 judge-safe on all four task types — incident postmortem: CAC 5.00 vs oracle 4.90; contract_termination: CAC 4.90 vs oracle 4.95 (0.05-point margin). No non-oracle method beats CAC overall by either metric.Security exception: CAC 75% vs 0% for every non-oracle method — at d=5 with flawless metadata, schema_aware still cannot produce a single safe answer. Raw-chunk retrieval cannot satisfy multi-criterion approval chains regardless of metadata quality.
LLM-as-judge (same 400 answers, v1.6 architecture — full outputs: outputs/llm_eval_clean_signal_v3/):
| Method | Completeness | Hallucination-free | Overall (1–5) | Safe Rate |
|---|---|---|---|---|
cac |
4.91 | 5.00 | 4.91 | 100% |
oracle_candidate_rag_k8 |
4.84 | 5.00 | 4.84 | 100% |
schema_aware_chunk_rag_k8 |
4.81 | 5.00 | 4.81 | 100% |
iterative_rag_k8 |
4.81 | 5.00 | 4.81 | 100% |
fixed_context_rag_k8 |
4.66 | 5.00 | 4.69 | 100% |
CAC ranks #1 on the LLM judge overall score. Incident postmortem reaches a perfect 5.00 judge score — the v1.6 fixes resolved overcapture bugs that had caused false safe readings on contract_termination. CAC leads oracle 4.91 vs 4.84 across all 80 answers.
Maximum distractor density with 30% metadata noise — the harshest realistic production conditions tested.
Full outputs: outputs/llm_eval_max_noise_v3/
| Method | LLM Answer Score | Safe Rate | Contradiction Handling |
|---|---|---|---|
cac |
0.7929 | 62.5% | 100% |
oracle_candidate_rag_k8 |
0.7962 | 47.5% | 100% |
schema_aware_chunk_rag_k8 |
0.7929 | 46.25% | 100% |
iterative_rag_k8 |
0.7392 | 32.5% | 95.0% ← failure |
fixed_context_rag_k8 |
0.6648 | 10.0% | 100% |
CAC achieves 62.5% safe rate — highest of any method — under maximum distractor density and maximum metadata noise. Oracle, despite having the gold candidate list, reaches only 47.5%; schema_aware trails at 46.25%.
At noise=0.3, noise-injected metadata tags contaminate raw-chunk retrieval: oracle's LLM answer score (0.7962) barely edges CAC (0.7929), but CAC's structural safe rate (+15 points) confirms that evidence completeness matters more than answer fluency at high noise.
Per-task breakdown (20 accounts per task, 80 answers per method):
| Task | cac |
oracle_candidate_rag |
schema_aware_rag |
iterative_rag |
fixed_context_rag |
|---|---|---|---|---|---|
| Renewal risk | 75% | 10% | 60% | 60% | 0% |
| Security exception | 50% | 5% | 15% | 0% | 0% |
| Contract termination | 25% | 85% ← oracle wins | 35% | 30% | 0% |
| Incident postmortem | 100% | 90% | 75% | 40% | 40% |
Oracle holds the contract_termination advantage (85% vs 25% heuristic) — at max noise, metadata corruption partially breaks CAC's breach slot detection and oracle's gold labels still locate the right documents. On the LLM judge (below), this gap closes to 0.05 points (4.95 vs 4.90), confirming that CAC's withholding behavior reflects appropriate uncertainty, not an evidence failure.
CAC achieves 100% heuristic safe on incident_postmortem — every answer correctly describes the incident chain, CRM status, and remediation steps. Oracle reaches 90%; every other method is below 80%.
Security exception remains CAC's domain: 50% vs 5% for oracle even with gold labels — multi-criterion approval chains require structural evidence admission, not retrieval fluency.
LLM-as-judge (same 400 answers, v1.6 architecture — full outputs: outputs/llm_eval_max_noise_v3/):
| Method | Completeness | Hallucination-free | Overall (1–5) | Safe Rate |
|---|---|---|---|---|
cac |
4.91 | 5.00 | 4.91 | 100% |
oracle_candidate_rag_k8 |
4.88 | 5.00 | 4.88 | 100% |
schema_aware_chunk_rag_k8 |
4.86 | 5.00 | 4.86 | 100% |
iterative_rag_k8 |
4.84 | 5.00 | 4.84 | 100% |
fixed_context_rag_k8 |
4.78 | 5.00 | 4.79 | 100% |
CAC holds #1 on the LLM judge at maximum noise and extends its lead over oracle vs. Scenario E (0.07-point gap at both clean signal and max noise). All methods achieve 100% judge-safe, confirming the v1.6 distractor-blocking fix eliminated false safe readings. CAC leads oracle 4.91 vs 4.88 across all 80 answers.
| Scenario | CAC | Next-best non-oracle | Oracle |
|---|---|---|---|
| E: Clean signal (d=5, noise=0.0) | 71.25% | 47.5% (iterative) | 51.25% |
| F: Max noise (d=25, noise=0.3) | 62.5% | 46.25% (schema) | 47.5% |
Where oracle (gold labels) beats CAC per-task (heuristic lexical scorer):
- Scenario E: contract_termination (60% vs 40%) and incident_postmortem (100% vs 95%)
- Scenario F: contract_termination (85% vs 25%)
The incident postmortem oracle advantage (heuristic) present in earlier runs was resolved by v1.5/v1.6 slot displacement and distractor-blocking fixes. On the LLM judge, CAC leads oracle on incident postmortem (5.00 vs 4.90) in both scenarios. CAC's contract_termination heuristic safe rate decreased in v1.6 (correct withholding behavior), while the judge confirms a tiny 0.05-point margin (4.90 vs 4.95).
No non-oracle method beats CAC on any task in either scenario.
These were deliberately designed as the conditions where the deterministic proxy predicted RAG to be competitive. CAC hit its highest safe rates yet (75% and 76.25%). The evidence structuring advantage is not explained by adversarial conditions — it holds at minimum distractor density and zero metadata noise.
Baseline stress run (d=50, budget=160, n=15 × 4 tasks = 60 answers per method):
| Task | cac |
oracle_candidate_rag |
schema_aware_rag |
iterative_rag |
fixed_context_rag |
|---|---|---|---|---|---|
| Renewal risk | 80% | 13% | 47% | 60% | 0% |
| Security exception | 33% | 7% | 0% | 0% | 0% |
| Contract termination | 60% | 93% | 27% | 7% | 0% |
| Incident postmortem | 80% | 80% | 33% | 20% | 0% |
fixed_context_ragachieves 0% safe rate on all four task types. CAC leads on three of four tasks; oracle leads only on contract termination (where knowing the gold candidate list is most valuable). The security exception task is the hardest overall — CAC is the only non-oracle method to achieve any safe rate at all (33% vs. 0% for schema, iterative, and fixed-context).
Efficiency ratio = safe_rate ÷ (budget ÷ 160) — how much safe-rate value each method extracts relative to its token spend. Ratio > 1.0 means the method is more efficient at a tighter budget than at the 160-token baseline.
| Scenario | budget | cac |
oracle |
schema_aware |
iterative |
fixed_context |
|---|---|---|---|---|---|---|
| Baseline stress (d=50) | 160 | 0.63 | 0.48 | 0.27 | 0.22 | 0.00 |
| A: Budget crunch | 80 | 1.40 | 1.10 | 1.00 | 0.60 | 0.07 |
| B: Distractor flood (d=100) | 160 | 0.57 | 0.36 | 0.34 | 0.30 | 0.00 |
| C: Metadata corruption | 160 | 0.95 | 0.26 | 0.26 | 0.26 | 0.19 |
| Perfect storm (d=100 + budget=80) | 80 | 1.20 | 1.03 | 0.73 | 0.70 | 0.07 |
| E: Clean signal (d=5, noise=0.0) | 160 | 0.71 | 0.51 | 0.44 | 0.48 | 0.11 |
| F: Max noise (d=25, noise=0.3) | 160 | 0.63 | 0.48 | 0.46 | 0.33 | 0.10 |
At budget=80, CAC's efficiency ratio reaches 1.40 — meaning it delivers more safe answers per token at half the context window than it does at full size. The perfect storm confirms this at 1.20×, even under 100 distractors simultaneously. This is the defining characteristic of admission control: greedy RAG fills available space regardless of value; CAC selects by value regardless of space. When space is scarce, the gap widens.
Row E (clean signal, d=5) shows CAC at 0.71 — the highest safe-rate ratio among the 160-token-budget scenarios. Row F (max noise, d=25) reaches 0.63 — matching baseline stress efficiency despite 5× distractor density and 30% metadata noise.
iterative_ragdrops from 0.48 at d=5 (Scenario E) to 0.33 at d=25 with max noise (Scenario F) — a 15-point collapse driven by combined distractor density and metadata corruption.
Seven scenarios, ~2,500 LLM inferences and ~2,500 judge calls (plus two RAG-challenge scenarios with an additional ~1,600 inferences each). Safe rate is the primary metric: fraction of answers that clear all safety thresholds (slot coverage, contradiction handling, missing disclosure, score ≥ 0.80).
| Scenario | CAC | Oracle | Best non-oracle RAG | fixed_context |
|---|---|---|---|---|
| Baseline (d=50, budget=160) | 63.3% | 48.3% | 26.7% (schema) | 0.0% |
| A: Budget crunch (budget=80) | 70.0% | 55.0% | 50.0% (schema) | 3.3% |
| B: Distractor flood (d=100) | 57.5% | 36.25% | 33.75% (schema) | 0.0% |
| C: Metadata corruption (noise=0.5) | 95.0% | 26.25% | 26.25% (iterative/oracle, tied) | 18.75% |
| Perfect storm (d=100 + budget=80) | 60.0% | 51.7% | 36.7% (schema) | 3.3% |
| E: Clean signal (d=5, noise=0.0) | 71.25% | 51.25% | 47.5% (iterative) | 11.25% |
| F: Max noise (d=25, noise=0.3) | 62.5% | 47.5% | 46.25% (schema) | 10.0% |
CAC leads in all 7 scenarios. The v1.7 content-based slot routing fallback closes the final gap: CAC now beats oracle under 50% metadata corruption (95% vs 26.25%).
1. Aggregate safe rate in all non-corruption conditions. CAC is the only method to hold above 48% safe rate in every scenario tested. The next-best non-oracle method never exceeds 47.5%.
2. Distractor immunity.
CAC's safe rate is near-flat across distractor levels: 71.25% (d=5) → 63.3% (d=50) → 57.5% (d=100). fixed_context_rag collapses to 0% at d=50 and holds 0% at d=100. schema_aware_rag drops from 41.25% at d=5 to 33.75% at d=100. CAC's admission filter rejects irrelevant chunks before the context window is built — the distractor count does not reach the LLM.
3. Budget pressure amplifies CAC's advantage.
At budget=80 (half the standard window), CAC's efficiency ratio rises to 1.40 — it extracts more safe-answer value per token than at budget=160. iterative_rag drops to 0.60. When context is scarce, greedy fill wastes it; admission control concentrates it. The perfect storm (d=100 + budget=80) produces CAC's higher score (60%) than the same distractor level with double the budget (Scenario B, 57.5%).
4. Clean signal amplification — structuring has standalone value. CAC's best safe rate (71.25%) occurs at d=5 with zero noise — not under adversarial pressure. Evidence structuring provides more value when the LLM can work with well-organized evidence, not less. At d=25 with max noise, CAC still leads at 62.5%, maintaining its advantage under the harshest conditions tested. This disproves the assumption that CAC's advantage is primarily a noise filter.
5. Security exception — CAC's exclusive domain. Across all seven scenarios and every distractor level tested, CAC is the only non-oracle method to achieve any safe rate on security exception tasks. Per-task safe rates:
| Scenario | CAC | Oracle | schema_aware | iterative | fixed_context |
|---|---|---|---|---|---|
| Baseline stress (d=50) | 33% | 7% | 0% | 0% | 0% |
| E: Clean signal (d=5) | 75% | 20% | 0% | 0% | 0% |
| F: Max noise (d=25) | 50% | 5% | 15% | 0% | 0% |
Security exception tasks require satisfying a multi-criterion approval chain. No amount of better candidate selection or metadata quality enables raw-chunk retrieval to do this. Even oracle — which has gold candidate labels — scores at most 20%.
6. Hallucination and safety dimension integrity.
CAC achieves 100% on contradiction handling and missing disclosure in every scenario. iterative_rag fails contradiction handling in four of seven scenarios (91.25–92.5%). fixed_context_rag shows LLM-judge hallucination failures in two scenarios. CAC shows neither in any scenario.
7. Token efficiency. In the deterministic benchmark, CAC uses 60.2 average tokens vs. 69.7–126.6 for RAG baselines, while achieving a 17.5pp higher decision grade. It reaches decision grade ≥ 0.9 on 54.6% of tasks — 2.4× the best RAG hit rate.
1. Oracle on contract termination — consistently. When the retriever has ground-truth knowledge of which candidate to retrieve, oracle achieves 85–100% safe rate on contract termination tasks vs. CAC's 70–75%. This advantage is consistent across every distractor level tested:
| Scenario | Oracle | CAC | Gap |
|---|---|---|---|
| Baseline stress (d=50) | 93% | 60% | +33pp |
| E: Clean signal (d=5) | 60% | 40% | +20pp |
| F: Max noise (d=25) | 85% | 25% | +60pp |
Contract termination requires identifying the exact contractual clause from the right counterparty. Gold candidate selection is genuinely decisive for this task. The heuristic gap (60pp at max noise) is large, but the LLM judge narrows it to 0.05 points (oracle 4.95 vs CAC 4.90) — CAC's correct withholding behavior on ambiguous cases drives the heuristic discrepancy. CAC is consistently second.
2. Oracle on contract termination — the one remaining consistent per-task gap. Oracle achieves 60–85% heuristic safe rate on contract termination vs. CAC's 25–40%. On the LLM judge, both are at 100% safe rate but oracle scores 4.95 vs CAC's 4.90 per task — a 0.05-point margin. This reflects oracle's ground-truth knowledge of which contractual clause document to retrieve, which is genuinely decisive for this task type. The wide heuristic gap reflects CAC's correct withholding behavior (v1.6 structural safe rate: 100% for CAC vs 5–10% for oracle on the diagnostic), not evidence quality failure.
The v1.5/v1.6 fixes resolved prior oracle advantages on incident postmortem. On the LLM judge, CAC now leads oracle on incident postmortem (5.00 vs 4.90) in both scenarios.
No method beats CAC overall in any scenario. The honest boundary is precisely: oracle_candidate_rag (gold labels, not available in production) leads on contract termination per-task in all scenarios (oracle 4.95 vs CAC 4.90 on the LLM judge). On aggregate safe rate across all tasks, CAC leads oracle in every scenario tested.
| Task | Overall winner | Notes |
|---|---|---|
| Security exception | CAC — exclusive | 33–75%; all non-oracle methods 0% in every scenario; oracle also weak (5–20%) |
| Renewal risk | CAC | 75%; oracle surprisingly weak (10–25%); iterative is closest (60–65%) |
| Incident postmortem | CAC | CAC 95–100%, oracle 90–100%; v1.6 fixes resolved prior oracle heuristic advantage; LLM judge CAC 5.00 vs oracle 4.90 |
| Contract termination | Oracle wins; CAC 2nd | Oracle 60–93% heuristic, 4.95 judge; CAC 25–60% heuristic, 4.90 judge; iterative competitive at d=5 (90% heuristic) |
The core finding is not that CAC beats RAG under adversarial conditions. The core finding is that CAC's structuring advantage is non-adversarial — it holds at minimum distractor density and zero metadata noise, exactly where RAG should be at its strongest.
The remaining per-task gap:
oracle_candidate_rag— a baseline that receives ground-truth candidate knowledge not available in any real deployment — leads on contract termination per-task (oracle 4.95 vs CAC 4.90 on the judge; both 100% judge-safe). On aggregate safe rate across all nine tested scenarios, CAC leads oracle in every case. On the LLM judge, CAC leads oracle overall (4.91 vs 4.84 at clean signal, 4.91 vs 4.88 at max noise) and leads on incident postmortem specifically (5.00 vs 4.90). In production conditions (no gold labels), no tested method beats CAC on any task type by the judge metric.The remaining open question: do compression-aware or answer-aware RAG variants narrow this gap? The current baseline set tests admission strategy on a fixed candidate pool, not retriever quality.
SourceItem carries benchmark gold fields for scoring and for the explicit oracle baseline. Tests assert that CAC core and non-oracle RAG baselines do not read these scorer-only fields:
gold_slots
gold_negative
gold_positive
gold_exact_required
is_distractor
The explicit oracle_candidate_rag_k8 baseline is allowed to use these fields by design.
Use Python 3.10 or newer.
python -m pip install -e ".[dev]"
pytest -q
PYTHONPATH=. python tests/run_smoke_tests.py
PYTHONPATH=. python examples/acme_demo.pyOn Windows PowerShell:
$env:PYTHONPATH='.'
python -m pytest -q
python examples/acme_demo.pyIf you are new to the project:
- Run the Acme demo.
- Read the evidence packet output.
- Inspect the benchmark summary.
- Run the smoke benchmark.
- Read the methodology notes.
PYTHONPATH=. python examples/acme_demo.pyThen inspect:
outputs/decision_risk_v1_4_n20/summary.csv
outputs/decision_risk_v1_4_n20/benchmark_report.md
PYTHONPATH=. python -m benchmarks.decision_risk.run \
--n 20 \
--budgets 40,60,80,120,160,240,500 \
--distractors 5,25,50 \
--output-dir outputs/decision_risk_v1_4_n20The rewrite suite perturbs source wording and observable metadata to reduce reliance on templated phrase matching.
PYTHONPATH=. python -m benchmarks.decision_risk_human_rewrite.run \
--n 8 \
--budgets 40,60,80,120,160,240 \
--distractors 25,50 \
--metadata-noise 0.18 \
--output-dir outputs/decision_risk_v1_4_rewriteThe stress suite targets contradiction misses, slot underfill, exact-representation failure, and high distractor pressure.
PYTHONPATH=. python -m benchmarks.decision_risk_stress.run \
--n 8 \
--budgets 40,60,80,120,160 \
--distractors 50,100 \
--metadata-noise 0.30 \
--output-dir outputs/decision_risk_v1_4_stressPYTHONPATH=. python -m benchmarks.decision_risk.export_llm_eval_prompts \
--n 5 \
--budget 160 \
--distractors 25 \
--output outputs/llm_eval_v1_4/prompts.jsonlAfter filling a JSONL file with model answers:
PYTHONPATH=. python -m benchmarks.decision_risk.llm_eval \
--answers outputs/llm_eval_v1_4/model_answers.jsonl \
--output-dir outputs/llm_eval_v1_4_evalcac/
core/ CAC schemas, slot matching, valuation, packet assembly
baselines/ RAG-style baselines
benchmarks/
decision_risk/ Main DecisionRiskBench generator, runner, scorer, plots
decision_risk_human_rewrite/
Semi-synthetic rewrite suite
decision_risk_stress/
Failure-stress suite
examples/
acme_demo.py Small guided CAC demo
tests/
no-gold-admission, smoke, plot, prompt-export, and row-count tests
outputs/
packaged reference outputs and plots
A task-specific requirement that must be satisfied for a decision to be well supported.
Examples:
current billing status
contract termination language
security exception approval
incident root cause
missing executive sponsor signal
The final CAC context artifact. It can contain structured facts, exact excerpts, summaries, conflicts, uncertainties, exclusions, and an audit trace.
CAC chooses whether evidence should be represented as a structured fact, summary, exact excerpt, metadata, or excluded entirely.
CAC can explicitly state that a required evidence slot was not found or not sufficiently satisfied.
All compared methods receive the same candidate evidence pool. The benchmark tests context assembly, not retriever quality.
Included baselines:
fixed_context_rag_k8
metadata_aware_rag_k8
schema_aware_chunk_rag_k8
heuristic_rerank_rag_k8
iterative_rag_k8
long_context_rag_k24
oracle_candidate_rag_k8
The oracle baseline is intentionally synthetic and uses gold labels for candidate ordering. It is included to test whether better candidate ordering alone closes the gap.
This repository is intentionally transparent about what it does not prove.
The benchmark findings are well-supported: 7 scenarios, 2 independent scorers (lexical and LLM judge), adversarial and favorable-RAG conditions, all pointing in the same direction. The gap between these findings and a "production-proven" claim is the gap between synthetic data and real enterprise data — not a gap in the methodology or comparative results.
Remaining limitations:
Synthetic benchmark — accounts, documents, and task scenarios are generated, not real.
Semi-synthetic rewrite suite is not human-audited.
Single model — all LLM evals use phi-3-mini-4k-instruct (3.82B); larger or different models may behave differently.
Four task types — generalization beyond the tested task profiles is unverified.
No production deployment data — latency, throughput, and integration costs untested.
No real enterprise dataset.
The strongest current conclusion:
On evidence-sensitive decision tasks, CAC demonstrably outperforms all tested RAG baselines across every condition tested — adversarial and favorable alike — as measured by two independent scorers on real LLM inference. The remaining open question is whether this advantage holds on real enterprise data at production scale.
Completed:
same-model LLM answer evaluation (phi-3-mini, 9 scenarios, ~4,900 inferences)
LLM-as-judge scoring (phi-3-mini judge, ~3,300 evaluations)
adversarial battery (budget crunch, distractor flood, metadata corruption, perfect storm)
RAG-challenge battery (clean signal, schema home turf — designed to favor RAG)
v1.5 architectural fixes (slot displacement bugs, counterparty conflict detection, task-specific retrieval)
validation re-runs confirming oracle per-task gaps closed on LLM judge
The decisive question that was open is now answered:
The same LLM makes demonstrably better decisions from CAC evidence packets than from RAG chunks — across every condition tested, validated by both lexical scoring and independent LLM judge.
Remaining next steps:
human-audited mini-set
real or semi-real enterprise dataset
compression-aware RAG baselines
answer-aware RAG baselines
additional task profiles
larger / stronger LLMs
The remaining open question:
Does the CAC advantage hold on real enterprise data, at production scale, with stronger LLMs?
If you reference this project, use the core claim:
RAG made retrieved chunks the unit of context. CAC makes satisfied evidence requirements the unit of context.
Suggested title for discussion:
RAG Optimizes Relevance. Evidence Work Needs Sufficiency.
Issues and pull requests are welcome, especially around:
stronger baselines
additional task profiles
realistic datasets
LLM answer evaluation
human-audited scoring
benchmark critiques
failure analysis
Please keep benchmark claims tied to reproducible outputs.
MIT. See LICENSE.