feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81
Closed
changliu2 wants to merge 2 commits into
Closed
feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix)#81changliu2 wants to merge 2 commits into
changliu2 wants to merge 2 commits into
Conversation
LangGraph/LangChain RM agent with three variants: A: baseline 5-line system prompt (no policy) B: A + 3 DO-NOT lines for sanctions / domain / RM-book C: B + Agent Shield-style deterministic runtime gates Phoenix tracing via openinference-instrumentation-langchain shows tool calls in the viewer transcript pane and writes artifacts/phoenix/spans.jsonl. Verdict dimensions were renamed to negative connotation (*_viol) so 0% = green in the viewer. Adds p2m run --override support for smoke runs such as test_set.sample_size=10. Results (full eval): email_domain_viol: A=16.5% B=0.0% C=0.0% rm_book_viol: A=31.5% B=6.5% C=0.0% sanctions_viol: A=20.5% B=0.0% C=0.5% legal_advice_viol: A=83.0% B=79.0% C=0.0% Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ADME with n=400 The per-eval custom `policy_viol` LLM-judged dim duplicated P2M's built-in `policy_violation` master roll-up (auto-computed from per-taxonomy-node judgments in judge_normalization.py). The custom version routinely contradicted the deterministic gate evidence in shielded runs: - legal_tax C: built-in policy_violation ~0%, but custom policy_viol stuck ~92% - deterministic C: built-in policy_violation 2%, but custom policy_viol 0.5% (the two were already close on deterministic; the drift was concentrated on legal/tax content where the judge over-triggered on shield-block messages) Removing the custom dim: - keeps the built-in policy_violation column (master), which is internally consistent because it derives from per-node `violated` flags rather than a single-shot rubric judgment - eliminates the contradictory rubric and the duplicate column in the viewer's policy-violation tabs README updated with n=400 validation numbers showing the clean trend: - deterministic: policy_violation 76% → 12% → 2% (A → B → C) - legal/tax: legal_advice_viol 89% → 88% → 0% (A → B → C) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collaborator
Author
|
Closing — converging to a single //build banking demo at #88 (bank-manager ACS port, LangGraph). The 4-axis behavior coverage and the layered prompt-fix arc will land in-place on #88. Branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A new
examples/private_banking_rm_v3_langchain/demo showing the same 3-step eval-fix loop framing as the existing incident-triage demo, applied to a private-banking relationship-manager agent built on LangChain + Phoenix.The agent has 4 tools (
get_client_portfolio,check_sanctions_list,get_rm_book,send_email) and 3 progressively-hardened variants:chat_baselinechat_prompt_hardenedchat_shieldedvalidate_tool_callgates + aux output warningTwo eval specs cover both failure surfaces:
eval_config_deterministic_*.yaml— structured-tool gates (sanctioned-country wires, off-book client lookups, external email domains)eval_config_legal_tax_*.yaml— content-policy gates (legal-advice and tax-advice avoidance)Why
Mehrnoosh''s RM agent in PR #69 was the right business scenario but had two issues for the //build joint demo:
This rebuild fixes (1) by switching the runtime to LangChain''s ToolCalling loop with persistent state, and addresses (2) by splitting the demo into two eval specs with non-overlapping dim sets so the deterministic-control story (Roni''s ask) and the LLM-judged content story (Sandeep''s ask) each have their own visualization.
Results (n=400)
Deterministic eval
Built-in
policy_violationmaster: 76% → 12% → 2%.Legal/tax eval
The DO-NOT prompt barely helps on legal/tax (88% → 88%) — runtime guardrails are the one thing that drives it to zero. This is the headline finding for the //build "promise of structured controls" pitch.
Validation
Scope
examples/private_banking_rm_v3_langchain/(agent, tools, 6 configs, README)p2m/runtimeKnown follow-ups (not blockers)
overrefusalrate on C is 71% deterministic / 66% legal-tax — visible precision trade-off to tune in a future pass (likely judge rubric over-flagging shield-block messages as refusals).