Skip to content

Update PR #88 banking demo storyline for local-authoring flow#118

Closed
Copilot wants to merge 2 commits into
mainfrom
copilot/modify-pr88-storyline
Closed

Update PR #88 banking demo storyline for local-authoring flow#118
Copilot wants to merge 2 commits into
mainfrom
copilot/modify-pr88-storyline

Conversation

Copilot AI commented May 27, 2026

Copy link
Copy Markdown
Contributor

The PR #88 banking demo narrative needed to reflect the intended local-authoring flow: baseline first, four axes with a fourth system-level emergent quality axis, prompt-only first fix, then CI regression and ACS mitigation. This also needed to make the “empty judge.dimensions => default dimensions” behavior explicit and documented.

  • Storyline source of truth

    • Added docs/internal/lt-review/adr-demo-flow.md as the repo-side outline for the revised banking demo flow.
    • Captures the intended sequence:
      • baseline run first
      • one behavior YAML per axis
      • four axes, including system-level emergent
      • local result inspection before mitigation
      • prompt-only first fix
      • CI regression against baseline
      • single ACS guardrails.yaml for the ACS-fit axes
      • no forced ACS mitigation for the fourth axis if it is not a good fit
  • PR examples(bank-manager): ACS-vs-unguarded comparison demo #88 cross-reference cleanup

  • Config behavior coverage

    • Added a focused regression test confirming that an empty pipeline.judge.dimensions mapping is accepted, which is the documented path for relying on ASSERT’s built-in default dimensions.
pipeline:
  judge:
    dimensions: {}

This keeps the docs aligned with the intended demo arc while tightening the contract around the default-dimensions path the storyline depends on.

Original prompt

an agent to modify PR88 storlyine and @docs\internal\lt-review\adr-demo-flow.md and the [📝 banking-demo-10min-flow.docx] Start from local authoring: - Run baseline – only show 1 per axis that has > 0 rates . the4th di axies will be called system-level emergent which is a quality dimension, that is basically agent task adherence to its own instruction (system prompt) that looks at task completion and the agent not overrefusing bengin in scope requests. one axis beahvior per yaml, and leave the judge.dimensions empty to trigger default dimensions -local results – discovered violation rates and validate by a few examples - Write a fix first with proDO NOT mpting - Raise PR to trigger CI test – assume CI test configured as a pipeline - CI test shows regressions to the baseline - fix with ACS (ACS guardrails.yaml should only be 1 yaml that contains all 4 axis singular mitigation - if the 4th dimension is not good for ACS - do not mitigate with ACS. - regression tests passed now with baseline . DO NOT prompting does not close vulnerabiilityes and the quality aspect is acceptable.

**Chronological Review:** 1. Session opened with a heavy prior context (carried in via summary) about //build 2026 demo prep. User had just confirmed "1a yes 3 use the latest repo version and yes 4 yes" with flagged viewer/data issues. The work-in-progress was: removing redundant `policy_viol` custom dim from 9 YAMLs, fixing viewer compare labels, adding agent-shield to pyproject, and preparing 4 PRs.
  1. Agent picked up by updating RM-v3 README with n=400 numbers (dropped policy_viol column from results tables), then created a fresh worktree for the CLI override branch, cherry-picked p2m/cli.py + p2m/runner.py from RM-v3 branch, smoke-tested the --override flag (5/5 split confirmed).

  2. Committed and pushed all 4 branches with boundary audits clean and Copilot co-author trailer. Drafted 4 PR bodies and asked user via ask_user. User chose "Post all 4 as drafted".

  3. Opened 4 PRs successfully: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78 (viewer), feat(cli): add --override flag for nested config key overrides #79 (cli-override), feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80 (AS-MCP), feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81 (RM-v3 with dependency note on feat(cli): add --override flag for nested config key overrides #79). Cleaned up worktrees (force-removed one with untracked files). Noted repo was renamed microsoft/adaptive-evalmicrosoft/ASSERT.

  4. Then transitioned to scheduled daily sentinel reports. Six consecutive days of reports (2026-05-22 through 2026-05-27), all checking PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 (Science Regression Gate workflow) status. Pattern: workflow not yet on main, PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 sits idle, sentinel reports stale state.

  5. Between sentinel reports, a system_notification indicated the leftover viewer dev server (shellId: viewer-dev) had completed; agent acknowledged without action.

Intent Mapping:

Technical Inventory:

  • Tools used: PowerShell (gh CLI, git, uv), report_intent, ask_user, list_powershell, view, edit, grep
  • Frameworks: git worktrees, GitHub CLI, P2M (rebranded to assert-eval after refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108), LangChain, LangGraph, agent-shield, Phoenix tracing
  • Patterns: dependency-noted PR chains, boundary audits, Copilot co-author trailers

Code Archaeology:

  • README.md (RM-v3): tables updated to n=400 numbers, policy_viol column dropped
  • p2m/cli.py + p2m/runner.py: --override flag with _apply_config_overrides function (44 LOC)
  • All 9 eval YAMLs: policy_viol blocks removed
  • viewer compare/+page.svelte: runLabel split on : then /, per-run card truncate
  • pyproject.toml: agent_shield extra added

Progress Assessment:

Context Validation:
All continuation context for both demo PRs and sentinel are captured. CLI rename (#108) introduces new risk for PR #41 rebase.

Recent Commands Analysis:
Last operations were two read-only powershell calls for the 2026-05-27 sentinel:

  1. gh run list ... --branch main --event schedule[] (no scheduled runs)
  2. gh run list ... --limit 5 → 5 historical runs, latest still 2026-05-22 cancellations
  3. gh pr view 41 → state OPEN, mergeable now UNKNOWN (changed from MERGEABLE)
  4. Follow-up gh pr list --state merged --base main --limit 5 → revealed 5 recent merges including refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108 CLI rename

Agent surfaced these findings as the 2026-05-27 sentinel report with TL;DR recommending nudge + Friday EOD pause.

1. Conversation Overview: - Primary Objectives: - "Post all 4 as drafted" — ship 4 PRs (viewer 3-way compare, CLI override, RM-v3 LangChain demo, AS-MCP banking demo) for //build...

Created from Copilot CLI via the copilot delegate command.

Copilot AI changed the title [WIP] Modify PR88 storyline and update demo flow documentation Update PR #88 banking demo storyline for local-authoring flow May 27, 2026
Copilot AI requested a review from changliu2 May 27, 2026 23:33
@changliu2

Copy link
Copy Markdown
Collaborator

Superseded by direct work on PR #88. Closing.

@changliu2 changliu2 closed this May 31, 2026
@changliu2 changliu2 deleted the copilot/modify-pr88-storyline branch May 31, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants