feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) by changliu2 · Pull Request #81 · responsibleai/ASSERT

changliu2 · 2026-05-22T15:30:10Z

Depends on #79 for the --override flag used in the README run snippets. Will rebase to no-op after that merges.

What

A new examples/private_banking_rm_v3_langchain/ demo showing the same 3-step eval-fix loop framing as the existing incident-triage demo, applied to a private-banking relationship-manager agent built on LangChain + Phoenix.

The agent has 4 tools (get_client_portfolio, check_sanctions_list, get_rm_book, send_email) and 3 progressively-hardened variants:

Variant	Target	What changes
A	`chat_baseline`	5-line role/tool prompt, no policy text
B	`chat_prompt_hardened`	A + three DO-NOT lines (sanctions, email domains, RM book)
C	`chat_shielded`	B + deterministic `validate_tool_call` gates + aux output warning

Two eval specs cover both failure surfaces:

eval_config_deterministic_*.yaml — structured-tool gates (sanctioned-country wires, off-book client lookups, external email domains)
eval_config_legal_tax_*.yaml — content-policy gates (legal-advice and tax-advice avoidance)

Why

Mehrnoosh''s RM agent in PR #69 was the right business scenario but had two issues for the //build joint demo:

The agent state machine didn''t actually loop through tool calls in multi-turn conversations, so all 8 turns of a scenario returned the same first response.
The deterministic vs LLM-classifier story wasn''t visually separable in the viewer.

This rebuild fixes (1) by switching the runtime to LangChain''s ToolCalling loop with persistent state, and addresses (2) by splitting the demo into two eval specs with non-overlapping dim sets so the deterministic-control story (Roni''s ask) and the LLM-judged content story (Sandeep''s ask) each have their own visualization.

Results (n=400)

Deterministic eval

Dim	A: baseline	B: +DO-NOT	C: +shield
email_domain_viol	18.5%	0.0%	0.5%
rm_book_viol	37.5%	9.8%	0.0%
sanctions_viol	23.0%	0.8%	0.8%
overrefusal	35.3%	56.3%	70.8%

Built-in policy_violation master: 76% → 12% → 2%.

Legal/tax eval

Dim	A: baseline	B: +DO-NOT	C: +shield
legal_advice_viol	88.8%	88.0%	0.0%
overrefusal	72.3%	72.0%	66.3%

The DO-NOT prompt barely helps on legal/tax (88% → 88%) — runtime guardrails are the one thing that drives it to zero. This is the headline finding for the //build "promise of structured controls" pitch.

Validation

200 prompts + 200 scenarios per variant per eval (n=400 × 6 configs)
All 25,999 tool-call spans captured in Phoenix and viewable in the viewer''s audit tab
741 pytest pass (unchanged from main)
Boundary audit clean

Scope

New: examples/private_banking_rm_v3_langchain/ (agent, tools, 6 configs, README)
No changes to core p2m/ runtime
No changes to viewer

Known follow-ups (not blockers)

overrefusal rate on C is 71% deterministic / 66% legal-tax — visible precision trade-off to tune in a future pass (likely judge rubric over-flagging shield-block messages as refusals).

LangGraph/LangChain RM agent with three variants: A: baseline 5-line system prompt (no policy) B: A + 3 DO-NOT lines for sanctions / domain / RM-book C: B + Agent Shield-style deterministic runtime gates Phoenix tracing via openinference-instrumentation-langchain shows tool calls in the viewer transcript pane and writes artifacts/phoenix/spans.jsonl. Verdict dimensions were renamed to negative connotation (*_viol) so 0% = green in the viewer. Adds p2m run --override support for smoke runs such as test_set.sample_size=10. Results (full eval): email_domain_viol: A=16.5% B=0.0% C=0.0% rm_book_viol: A=31.5% B=6.5% C=0.0% sanctions_viol: A=20.5% B=0.0% C=0.5% legal_advice_viol: A=83.0% B=79.0% C=0.0% Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ADME with n=400 The per-eval custom `policy_viol` LLM-judged dim duplicated P2M's built-in `policy_violation` master roll-up (auto-computed from per-taxonomy-node judgments in judge_normalization.py). The custom version routinely contradicted the deterministic gate evidence in shielded runs: - legal_tax C: built-in policy_violation ~0%, but custom policy_viol stuck ~92% - deterministic C: built-in policy_violation 2%, but custom policy_viol 0.5% (the two were already close on deterministic; the drift was concentrated on legal/tax content where the judge over-triggered on shield-block messages) Removing the custom dim: - keeps the built-in policy_violation column (master), which is internally consistent because it derives from per-node `violated` flags rather than a single-shot rubric judgment - eliminates the contradictory rubric and the duplicate column in the viewer's policy-violation tabs README updated with n=400 validation numbers showing the clean trend: - deterministic: policy_violation 76% → 12% → 2% (A → B → C) - legal/tax: legal_advice_viol 89% → 88% → 0% (A → B → C) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

changliu2 · 2026-05-23T00:02:12Z

Closing — converging to a single //build banking demo at #88 (bank-manager ACS port, LangGraph). The 4-axis behavior coverage and the layered prompt-fix arc will land in-place on #88. Branch private-banking-rm-v3-langchain preserved; if there's specific LangChain-vs-LangGraph parity work worth surfacing (e.g., to show ASSERT speaks both), that's a separate follow-up after the bank-manager demo ships.

changliu2 and others added 2 commits May 21, 2026 21:59

Copilot AI mentioned this pull request May 22, 2026

Add refund-agent eval-fix behavior pair (deterministic shield bypass + semantic policy reasoning) #82

Closed

changliu2 closed this May 23, 2026

Copilot AI mentioned this pull request May 27, 2026

Update PR #88 banking demo storyline for local-authoring flow #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81

feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81
changliu2 wants to merge 2 commits into
mainfrom
private-banking-rm-v3-langchain

changliu2 commented May 22, 2026

Uh oh!

changliu2 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

changliu2 commented May 22, 2026

What

Why

Results (n=400)

Deterministic eval

Legal/tax eval

Validation

Scope

Known follow-ups (not blockers)

Uh oh!

changliu2 commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant