Add refund-agent eval-fix behavior pair (deterministic shield bypass + semantic policy reasoning) by Copilot · Pull Request #82 · responsibleai/ASSERT

Copilot · 2026-05-22T15:33:07Z

This PR adds a demo-ready behavior pair for the refund-agent eval-fix loop: one deterministic behavior intended to be fully mitigated by Agent Shield, and one semantic behavior that captures policy reasoning quality.

What this enables
- Introduces a clean 2-behavior storyline for refund-agent demos:
  - deterministic, shield-fixable control bypass
  - semantic policy-application errors
Behavior specs added
- examples/behavior_specs/refund_authorization_bypass.md
  - Instruction-hijack / policy-bypass attempts that should be blocked consistently.
  - Focuses on forced refund execution paths (skip verification, bypass eligibility, user-supplied internal instructions).
- examples/behavior_specs/refund_policy_reasoning_errors.md
  - Non-adversarial policy interpretation failures.
  - Covers eligibility-window mistakes, wrong policy branch selection, partial-refund edge cases, and cross-turn inconsistency.
Reference index update
- examples/behavior_specs/README.md
  - Adds both new specs to the behavior catalog so they are discoverable and reusable.

Example usage in eval config:

behavior:
  name: refund_eval
  description: |-
    # Refund Authorization Bypass (Deterministic)
    ...

Original prompt

follow the demo storyline, find 2 beahvors (1 detemrinistic that agent shield can fix 100% and 1 semantics) to build a eval-fix loop demo on theis refud agent scenario. git@github.com:changliu2/refund-agent-a365.git

**Chronological Review:** 1. Session began with summarized prior context: //build 2026 demo prep, with prior work on validating RM-v3 and AS-MCP demos at n=400, and a planned 4-PR sequence. 2. User had just confirmed "1a yes 3 use the latest repo version and yes 4 yes" — green light on all 4 PRs, plus two viewer issues (duplicate policy_viol column, BY BEHAVIOR header overlap). 3. Previous work in this session had: removed `policy_viol` from 9 YAMLs, fixed viewer compare page label/truncation, added `agent_shield` extra to pyproject.toml, rewrote AS-MCP README. 4. Current session resumed by: updating RM-v3 README with n=400 numbers, creating cli-override-flag branch via cherry-pick from RM-v3, smoke-testing the override logic, committing + pushing 4 branches in parallel, drafting 4 PR bodies, getting user approval, opening 4 PRs via `gh pr create`, then cleaning up worktrees.

Intent Mapping:

User's final confirmed direction: "1a yes 3 use the latest repo version and yes 4 yes" + fix duplicate policy_viol column + fix header overlap
Earlier framing: open 4 PRs in sequence (viewer → CLI override → RM-v3 → AS-MCP)

Technical Inventory:

Git worktrees for parallel branch work
gh pr create for PR opening
PowerShell here-strings for PR body files
LF/CRLF git warnings on YAML files (cosmetic)
Repo rename: microsoft/adaptive-eval → microsoft/ASSERT (GitHub redirects transparent)
Personal rules enforced: boundary audit, no git add -A, no public PR posting without user approval, Copilot co-author trailer

Code Archaeology:

README.md RM-v3: replaced old table with n=400 numbers, removed policy_viol column
4 commits across 4 branches, all with detailed body messages
CLI override smoke test verified test_set.sample_size=10 → 5/5 split

Progress Assessment:

All 4 PRs opened: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78, feat(cli): add --override flag for nested config key overrides #79, feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80, feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81
Worktrees cleaned up (all 4 removed; only main checkout remains)
All boundary audits passed
All commits include Copilot co-author trailer

Context Validation:

PR numbers captured: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78 viewer, feat(cli): add --override flag for nested config key overrides #79 cli-override, feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80 AS-MCP, feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81 RM-v3
Merge order recommendation given: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78 + feat(cli): add --override flag for nested config key overrides #79 first, then feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80 + feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81
Repo rename noted; remote URL still works via GitHub redirect

Recent Commands Analysis:
Last commands were: (1) parallel gh pr create for #80 and #81, (2) verification of PR list, (3) worktree cleanup (one needed --force due to validate-as-mcp containing modified/untracked files from prior validation run).

1. Conversation Overview: - Primary Objectives: //build 2026 demo prep for Chang Liu (PM DRI). User's most recent confirmation: "1a yes 3 use the latest repo version and yes 4 yes" — green light to fix the rubric bug, fix duplicate viewer columns, and open 4 PRs in sequence (viewer → CLI override → RM-v3 → AS-MCP). - Session Context: Continuation after compaction. Picked up mid-Wave-1 with most files already edited but uncommitted. This session: updated RM-v3 README, created cli-override-flag branch, smoke-tested override, committed + pushed all 4 branches, drafted PR bodies, got user approval ("Post all 4 as drafted"), opened all 4 PRs successfully, cleaned worktrees. - User Intent Evolution: Simple linear execution of pre-approved 4-PR plan. No new asks during this session.

Technical Foundation:
- Git worktrees: 4 parallel worktrees (p2m-wt-viewer, p2m-wt-cli, p2m-validate-rm-v3, p2m-validate-as-mcp) plus main checkout — all 4 worktrees now removed.
- GitHub CLI: gh pr create --base main --head <branch> --title ... --body-file .pr-body.md pattern
- PowerShell here-strings (@'...'@) for multi-line PR bodies
- Repo rename: remote URL https://github.com/microsoft/adaptive-eval.git still works; GitHub redirects to actual repo microsoft/ASSERT
- Boundary audit regex on each commit's --cached --name-only (passed all 4)
Codebase Status:
- examples/private_banking_rm_v3_langchain/README.md (branch private-banking-rm-v3-langchain):
  - Updated this session: replaced old result tables with n=400 numbers
  - Deterministic table: email_domain_viol/rm_book_viol/sanctions_viol/overrefusal, with policy_violation master 76→12→2% note
  - Legal/tax table: legal_advice_viol 88.8%→88.0%→0.0%, overrefusal 72.3→72.0→66.3
  - Committed in 1b4746b with 6 YAML files (policy_viol block removal)
- p2m/cli.py + p2m/runner.py (branch cli-override-flag):
  - Cherry-picked from RM-v3 commit 8620b14 via git checkout origin/private-banking-rm-v3-langchain -- p2m/cli.py p2m/runner.py
  - +44/-2 across 2 files; committed as 4a8a5b8
  - _apply_config_overrides(raw, overrides) at runner.py:70 — supports test_set.sample_size=N shortcut (splits N half-half: cei...

Created from Copilot CLI via the copilot delegate command.

changliu2

Review — Request changes

The two spec files (refund_authorization_bypass.md, refund_policy_reasoning_errors.md) are well-scoped: clean deterministic / semantic split, useful as starting points for a refund-agent demo. README pointer is fine.

What's missing relative to the asked scope (build a runnable 3-step eval-fix story mirroring the banking demo on the refund-agent at https://github.com/changliu2/refund-agent-a365):

No examples/refund_agent_*/ directory (no agent code, no MCP/tool plumbing, no target.callable)
No eval_config_*.yaml files (A/B/C variants)
No guardrails.yaml with the deterministic shield gates the authorization_bypass spec implies
No README with the 3-step story / headline table
No measured numbers — nothing to validate the specs against

Net delivery is ~10% of the asked scope. Recommend one of:

Close this PR and re-delegate with a stricter prompt requiring agent + configs + a runnable n=100 baseline
Re-scope this PR to "behavior specs only" in the title + description and merge as scaffolding for the demo work to follow
Extend this PR with the missing pieces

Happy with any of the three — flagging because as-is this doesn't deliver the demo it was opened to build.

changliu2 · 2026-05-26T03:25:25Z

Closing in favor of #88, the bank-manager 4-axis demo that's now landed n=100 numbers across four variants (unguarded → ACS gates → naïve DO-NOT prompt → ACS + GEPA prompt-optimized) and is the converged //build 2026 demo storyline.

The refund-agent eval-fix pair here was a useful early exploration of the deterministic-vs-semantic shield pattern, but PR #88 covers the same story arc more completely in the banking domain — 9 judge dims, a real trade-off chart, and a GEPA notebook — and was where we converged the demo scope.

Closing this PR. Thanks for the work — the framing influenced PR #88's 3-act structure.

Initial plan

e54f781

Copilot AI assigned Copilot and changliu2 May 22, 2026

Copilot started work on behalf of changliu2 May 22, 2026 15:33 View session

Add refund eval-fix behavior spec pair

f10cbb8

Copilot AI changed the title ~~[WIP] Add eval-fix loop demo for refund agent scenario~~ Add refund-agent eval-fix behavior pair (deterministic shield bypass + semantic policy reasoning) May 22, 2026

Copilot finished work on behalf of changliu2 May 22, 2026 15:42

Copilot AI requested a review from changliu2 May 22, 2026 15:43

changliu2 requested changes May 22, 2026

View reviewed changes

Copilot started work on behalf of changliu2 May 22, 2026 20:02 View session

Copilot finished work on behalf of changliu2 May 22, 2026 20:07

Copilot AI requested a review from changliu2 May 22, 2026 20:07

changliu2 closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add refund-agent eval-fix behavior pair (deterministic shield bypass + semantic policy reasoning)#82

Add refund-agent eval-fix behavior pair (deterministic shield bypass + semantic policy reasoning)#82
Copilot wants to merge 2 commits into
mainfrom
copilot/build-eval-fix-loop-demo

Copilot AI commented May 22, 2026 •

edited

Loading

Uh oh!

changliu2 left a comment

Uh oh!

changliu2 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changliu2 left a comment

Choose a reason for hiding this comment

Review — Request changes

Uh oh!

changliu2 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 22, 2026 •

edited

Loading