Skip to content

feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81

Closed
changliu2 wants to merge 2 commits into
mainfrom
private-banking-rm-v3-langchain
Closed

feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81
changliu2 wants to merge 2 commits into
mainfrom
private-banking-rm-v3-langchain

Conversation

@changliu2

Copy link
Copy Markdown
Collaborator

Depends on #79 for the --override flag used in the README run snippets. Will rebase to no-op after that merges.

What

A new examples/private_banking_rm_v3_langchain/ demo showing the same 3-step eval-fix loop framing as the existing incident-triage demo, applied to a private-banking relationship-manager agent built on LangChain + Phoenix.

The agent has 4 tools (get_client_portfolio, check_sanctions_list, get_rm_book, send_email) and 3 progressively-hardened variants:

Variant Target What changes
A chat_baseline 5-line role/tool prompt, no policy text
B chat_prompt_hardened A + three DO-NOT lines (sanctions, email domains, RM book)
C chat_shielded B + deterministic validate_tool_call gates + aux output warning

Two eval specs cover both failure surfaces:

  • eval_config_deterministic_*.yaml — structured-tool gates (sanctioned-country wires, off-book client lookups, external email domains)
  • eval_config_legal_tax_*.yaml — content-policy gates (legal-advice and tax-advice avoidance)

Why

Mehrnoosh''s RM agent in PR #69 was the right business scenario but had two issues for the //build joint demo:

  1. The agent state machine didn''t actually loop through tool calls in multi-turn conversations, so all 8 turns of a scenario returned the same first response.
  2. The deterministic vs LLM-classifier story wasn''t visually separable in the viewer.

This rebuild fixes (1) by switching the runtime to LangChain''s ToolCalling loop with persistent state, and addresses (2) by splitting the demo into two eval specs with non-overlapping dim sets so the deterministic-control story (Roni''s ask) and the LLM-judged content story (Sandeep''s ask) each have their own visualization.

Results (n=400)

Deterministic eval

Dim A: baseline B: +DO-NOT C: +shield
email_domain_viol 18.5% 0.0% 0.5%
rm_book_viol 37.5% 9.8% 0.0%
sanctions_viol 23.0% 0.8% 0.8%
overrefusal 35.3% 56.3% 70.8%

Built-in policy_violation master: 76% → 12% → 2%.

Legal/tax eval

Dim A: baseline B: +DO-NOT C: +shield
legal_advice_viol 88.8% 88.0% 0.0%
overrefusal 72.3% 72.0% 66.3%

The DO-NOT prompt barely helps on legal/tax (88% → 88%) — runtime guardrails are the one thing that drives it to zero. This is the headline finding for the //build "promise of structured controls" pitch.

Validation

  • 200 prompts + 200 scenarios per variant per eval (n=400 × 6 configs)
  • All 25,999 tool-call spans captured in Phoenix and viewable in the viewer''s audit tab
  • 741 pytest pass (unchanged from main)
  • Boundary audit clean

Scope

  • New: examples/private_banking_rm_v3_langchain/ (agent, tools, 6 configs, README)
  • No changes to core p2m/ runtime
  • No changes to viewer

Known follow-ups (not blockers)

  • overrefusal rate on C is 71% deterministic / 66% legal-tax — visible precision trade-off to tune in a future pass (likely judge rubric over-flagging shield-block messages as refusals).

changliu2 and others added 2 commits May 21, 2026 21:59
LangGraph/LangChain RM agent with three variants:
  A: baseline 5-line system prompt (no policy)
  B: A + 3 DO-NOT lines for sanctions / domain / RM-book
  C: B + Agent Shield-style deterministic runtime gates

Phoenix tracing via openinference-instrumentation-langchain shows tool calls in the viewer transcript pane and writes artifacts/phoenix/spans.jsonl. Verdict dimensions were renamed to negative connotation (*_viol) so 0% = green in the viewer.

Adds p2m run --override support for smoke runs such as test_set.sample_size=10.

Results (full eval):
  email_domain_viol: A=16.5% B=0.0% C=0.0%
  rm_book_viol:      A=31.5% B=6.5% C=0.0%
  sanctions_viol:    A=20.5% B=0.0% C=0.5%
  legal_advice_viol: A=83.0% B=79.0% C=0.0%

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ADME with n=400

The per-eval custom `policy_viol` LLM-judged dim duplicated P2M's built-in
`policy_violation` master roll-up (auto-computed from per-taxonomy-node
judgments in judge_normalization.py). The custom version routinely
contradicted the deterministic gate evidence in shielded runs:

- legal_tax C: built-in policy_violation ~0%, but custom policy_viol stuck ~92%
- deterministic C: built-in policy_violation 2%, but custom policy_viol 0.5%
  (the two were already close on deterministic; the drift was concentrated on
  legal/tax content where the judge over-triggered on shield-block messages)

Removing the custom dim:
- keeps the built-in policy_violation column (master), which is internally
  consistent because it derives from per-node `violated` flags rather than
  a single-shot rubric judgment
- eliminates the contradictory rubric and the duplicate column in the viewer's
  policy-violation tabs

README updated with n=400 validation numbers showing the clean trend:
- deterministic: policy_violation 76% → 12% → 2% (A → B → C)
- legal/tax: legal_advice_viol 89% → 88% → 0% (A → B → C)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@changliu2

Copy link
Copy Markdown
Collaborator Author

Closing — converging to a single //build banking demo at #88 (bank-manager ACS port, LangGraph). The 4-axis behavior coverage and the layered prompt-fix arc will land in-place on #88. Branch private-banking-rm-v3-langchain preserved; if there's specific LangChain-vs-LangGraph parity work worth surfacing (e.g., to show ASSERT speaks both), that's a separate follow-up after the bank-manager demo ships.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant