Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions examples/private_banking_rm_v3_langchain/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Private Banking RM v3 — LangGraph eval-fix demo

This demo shows the 3-step loop for a private-banking relationship-manager assistant:
P2M finds policy violations, a stronger prompt reduces some failures, and runtime
Agent Shield-style guardrails close the structured-tool failures.

| Step | Target | Runtime behavior | Run ID |
|---|---|---|---|
| A | `chat_baseline` | 5-line role/tool prompt, no policy instructions | `baseline-prompt` |
| B | `chat_prompt_hardened` | A + three DO-NOT lines for sanctions, email domains, RM book | `prompt-hardened` |
| C | `chat_shielded` | B + deterministic `validate_tool_call` gates and aux output warning | `shielded` |

## Results — deterministic eval (n=400)

| Dim | A: baseline | B: +DO-NOT | C: +shield |
|---|---:|---:|---:|
| email_domain_viol | 18.5% | 0.0% | 0.5% |
| rm_book_viol | 37.5% | 9.8% | 0.0% |
| sanctions_viol | 23.0% | 0.8% | 0.8% |
| overrefusal | 35.3% | 56.3% | 70.8% |

The built-in `policy_violation` master roll-up (computed from per-taxonomy-node judgments, not a separate rubric) trends 76% → 12% → 2% across A → B → C. Runtime guardrails close the structured violations; overrefusal is the visible precision trade-off to tune next.

## Results — legal/tax eval (n=400)

| Dim | A: baseline | B: +DO-NOT | C: +shield |
|---|---:|---:|---:|
| legal_advice_viol | 88.8% | 88.0% | **0.0%** |
| overrefusal | 72.3% | 72.0% | 66.3% |

Shield's deterministic outbound-content filter drops legal advice in client-facing drafts to zero. Overrefusal stays roughly flat, since the shield blocks only legal/tax content, not benign portfolio facts or specialist routing.

## Run

```powershell
python -m pip install -e ".[otel,langgraph]"
$env:AZURE_API_VERSION = "2024-12-01-preview"
# Requires AZURE_API_KEY and AZURE_API_BASE in the shell environment.

uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_deterministic_A.yaml
uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_deterministic_B.yaml
uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_deterministic_C.yaml
uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_legal_tax_A.yaml
uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_legal_tax_B.yaml
uv run p2m run --config examples/private_banking_rm_v3_langchain/eval_config_legal_tax_C.yaml
```

## Phoenix tracing

The agent imports `phoenix_setup.py`, which configures Phoenix/OpenInference for
LangChain/LangGraph and always writes a local JSONL span file at
`artifacts/phoenix/spans.jsonl`. To view traces live:

```powershell
$env:PHOENIX_COLLECTOR_ENDPOINT = "http://localhost:6006/v1/traces"
phoenix serve
```

Open the P2M viewer and select the audit tab for any run. Tool calls and tool
results should appear inline in the transcript pane.
Empty file.
Loading
Loading