Update PR #88 banking demo storyline for local-authoring flow by Copilot · Pull Request #118 · responsibleai/ASSERT

Copilot · 2026-05-27T23:24:56Z

The PR #88 banking demo narrative needed to reflect the intended local-authoring flow: baseline first, four axes with a fourth system-level emergent quality axis, prompt-only first fix, then CI regression and ACS mitigation. This also needed to make the “empty judge.dimensions => default dimensions” behavior explicit and documented.

Storyline source of truth
- Added docs/internal/lt-review/adr-demo-flow.md as the repo-side outline for the revised banking demo flow.
- Captures the intended sequence:
  - baseline run first
  - one behavior YAML per axis
  - four axes, including system-level emergent
  - local result inspection before mitigation
  - prompt-only first fix
  - CI regression against baseline
  - single ACS guardrails.yaml for the ACS-fit axes
  - no forced ACS mitigation for the fourth axis if it is not a good fit
PR examples(bank-manager): ACS-vs-unguarded comparison demo #88 cross-reference cleanup
- Updated the incident-triage README wording so the existing PR examples(bank-manager): ACS-vs-unguarded comparison demo #88 reference points to the bank-manager local-authoring storyline, not just a generic demo mention.
Config behavior coverage
- Added a focused regression test confirming that an empty pipeline.judge.dimensions mapping is accepted, which is the documented path for relying on ASSERT’s built-in default dimensions.

pipeline:
  judge:
    dimensions: {}

This keeps the docs aligned with the intended demo arc while tightening the contract around the default-dimensions path the storyline depends on.

Original prompt

an agent to modify PR88 storlyine and @docs\internal\lt-review\adr-demo-flow.md and the [📝 banking-demo-10min-flow.docx] Start from local authoring: - Run baseline – only show 1 per axis that has > 0 rates . the4th di axies will be called system-level emergent which is a quality dimension, that is basically agent task adherence to its own instruction (system prompt) that looks at task completion and the agent not overrefusing bengin in scope requests. one axis beahvior per yaml, and leave the judge.dimensions empty to trigger default dimensions -local results – discovered violation rates and validate by a few examples - Write a fix first with proDO NOT mpting - Raise PR to trigger CI test – assume CI test configured as a pipeline - CI test shows regressions to the baseline - fix with ACS (ACS guardrails.yaml should only be 1 yaml that contains all 4 axis singular mitigation - if the 4th dimension is not good for ACS - do not mitigate with ACS. - regression tests passed now with baseline . DO NOT prompting does not close vulnerabiilityes and the quality aspect is acceptable.

**Chronological Review:** 1. Session opened with a heavy prior context (carried in via summary) about //build 2026 demo prep. User had just confirmed "1a yes 3 use the latest repo version and yes 4 yes" with flagged viewer/data issues. The work-in-progress was: removing redundant `policy_viol` custom dim from 9 YAMLs, fixing viewer compare labels, adding agent-shield to pyproject, and preparing 4 PRs.

Agent picked up by updating RM-v3 README with n=400 numbers (dropped policy_viol column from results tables), then created a fresh worktree for the CLI override branch, cherry-picked p2m/cli.py + p2m/runner.py from RM-v3 branch, smoke-tested the --override flag (5/5 split confirmed).
Committed and pushed all 4 branches with boundary audits clean and Copilot co-author trailer. Drafted 4 PR bodies and asked user via ask_user. User chose "Post all 4 as drafted".
Opened 4 PRs successfully: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78 (viewer), feat(cli): add --override flag for nested config key overrides #79 (cli-override), feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80 (AS-MCP), feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81 (RM-v3 with dependency note on feat(cli): add --override flag for nested config key overrides #79). Cleaned up worktrees (force-removed one with untracked files). Noted repo was renamed microsoft/adaptive-eval → microsoft/ASSERT.
Then transitioned to scheduled daily sentinel reports. Six consecutive days of reports (2026-05-22 through 2026-05-27), all checking PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 (Science Regression Gate workflow) status. Pattern: workflow not yet on main, PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 sits idle, sentinel reports stale state.
Between sentinel reports, a system_notification indicated the leftover viewer dev server (shellId: viewer-dev) had completed; agent acknowledged without action.

Intent Mapping:

Original demo-prep goal: ship 4 clean PRs satisfying Roni (deterministic controls) and Sandeep (underspecified is fine) — completed.
Daily sentinel goal: surface any regression on main; in absence of main runs, report PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 movement and recommend pause/escalation if static.

Technical Inventory:

Tools used: PowerShell (gh CLI, git, uv), report_intent, ask_user, list_powershell, view, edit, grep
Frameworks: git worktrees, GitHub CLI, P2M (rebranded to assert-eval after refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108), LangChain, LangGraph, agent-shield, Phoenix tracing
Patterns: dependency-noted PR chains, boundary audits, Copilot co-author trailers

Code Archaeology:

README.md (RM-v3): tables updated to n=400 numbers, policy_viol column dropped
p2m/cli.py + p2m/runner.py: --override flag with _apply_config_overrides function (44 LOC)
All 9 eval YAMLs: policy_viol blocks removed
viewer compare/+page.svelte: runLabel split on : then /, per-run card truncate
pyproject.toml: agent_shield extra added

Progress Assessment:

DONE: All 4 PRs opened (fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78–feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81), worktrees cleaned, science sentinel running daily.
Pending: PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 review/merge (out of agent's control), eventual rebase of feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81 after feat(cli): add --override flag for nested config key overrides #79 merges.
Daily sentinel state: 6 consecutive "stale" reports, with main now drifting (5 merges including CLI rename in refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108).

Context Validation:
All continuation context for both demo PRs and sentinel are captured. CLI rename (#108) introduces new risk for PR #41 rebase.

Recent Commands Analysis:
Last operations were two read-only powershell calls for the 2026-05-27 sentinel:

gh run list ... --branch main --event schedule → [] (no scheduled runs)
gh run list ... --limit 5 → 5 historical runs, latest still 2026-05-22 cancellations
gh pr view 41 → state OPEN, mergeable now UNKNOWN (changed from MERGEABLE)
Follow-up gh pr list --state merged --base main --limit 5 → revealed 5 recent merges including refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108 CLI rename

Agent surfaced these findings as the 2026-05-27 sentinel report with TL;DR recommending nudge + Friday EOD pause.

1. Conversation Overview: - Primary Objectives: - "Post all 4 as drafted" — ship 4 PRs (viewer 3-way compare, CLI override, RM-v3 LangChain demo, AS-MCP banking demo) for //build...

Created from Copilot CLI via the copilot delegate command.

changliu2 · 2026-05-31T04:11:16Z

Superseded by direct work on PR #88. Closing.

Initial plan

12cdd86

Copilot AI assigned Copilot and changliu2 May 27, 2026

Copilot started work on behalf of changliu2 May 27, 2026 23:25 View session

Update PR88 banking demo storyline

4c76e0f

Copilot AI changed the title ~~[WIP] Modify PR88 storyline and update demo flow documentation~~ Update PR #88 banking demo storyline for local-authoring flow May 27, 2026

Copilot finished work on behalf of changliu2 May 27, 2026 23:33

Copilot AI requested a review from changliu2 May 27, 2026 23:33

changliu2 closed this May 31, 2026

changliu2 deleted the copilot/modify-pr88-storyline branch May 31, 2026 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PR #88 banking demo storyline for local-authoring flow#118

Update PR #88 banking demo storyline for local-authoring flow#118
Copilot wants to merge 2 commits into
mainfrom
copilot/modify-pr88-storyline

Copilot AI commented May 27, 2026 •

edited

Loading

Uh oh!

changliu2 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changliu2 commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 27, 2026 •

edited

Loading