Update PR #88 banking demo storyline for local-authoring flow#118
Closed
Copilot wants to merge 2 commits into
Closed
Update PR #88 banking demo storyline for local-authoring flow#118Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
Copilot
AI
changed the title
[WIP] Modify PR88 storyline and update demo flow documentation
Update PR #88 banking demo storyline for local-authoring flow
May 27, 2026
Collaborator
|
Superseded by direct work on PR #88. Closing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PR #88 banking demo narrative needed to reflect the intended local-authoring flow: baseline first, four axes with a fourth system-level emergent quality axis, prompt-only first fix, then CI regression and ACS mitigation. This also needed to make the “empty
judge.dimensions=> default dimensions” behavior explicit and documented.Storyline source of truth
docs/internal/lt-review/adr-demo-flow.mdas the repo-side outline for the revised banking demo flow.system-level emergentguardrails.yamlfor the ACS-fit axesPR examples(bank-manager): ACS-vs-unguarded comparison demo #88 cross-reference cleanup
Config behavior coverage
pipeline.judge.dimensionsmapping is accepted, which is the documented path for relying on ASSERT’s built-in default dimensions.This keeps the docs aligned with the intended demo arc while tightening the contract around the default-dimensions path the storyline depends on.
Original prompt
an agent to modify PR88 storlyine and @docs\internal\lt-review\adr-demo-flow.md and the [📝 banking-demo-10min-flow.docx] Start from local authoring: - Run baseline – only show 1 per axis that has > 0 rates . the4th di axies will be called system-level emergent which is a quality dimension, that is basically agent task adherence to its own instruction (system prompt) that looks at task completion and the agent not overrefusing bengin in scope requests. one axis beahvior per yaml, and leave the judge.dimensions empty to trigger default dimensions -local results – discovered violation rates and validate by a few examples - Write a fix first with proDO NOT mpting - Raise PR to trigger CI test – assume CI test configured as a pipeline - CI test shows regressions to the baseline - fix with ACS (ACS guardrails.yaml should only be 1 yaml that contains all 4 axis singular mitigation - if the 4th dimension is not good for ACS - do not mitigate with ACS. - regression tests passed now with baseline . DO NOT prompting does not close vulnerabiilityes and the quality aspect is acceptable.
**Chronological Review:** 1. Session opened with a heavy prior context (carried in via summary) about //build 2026 demo prep. User had just confirmed "1a yes 3 use the latest repo version and yes 4 yes" with flagged viewer/data issues. The work-in-progress was: removing redundant `policy_viol` custom dim from 9 YAMLs, fixing viewer compare labels, adding agent-shield to pyproject, and preparing 4 PRs.Agent picked up by updating RM-v3 README with n=400 numbers (dropped policy_viol column from results tables), then created a fresh worktree for the CLI override branch, cherry-picked
p2m/cli.py+p2m/runner.pyfrom RM-v3 branch, smoke-tested the--overrideflag (5/5 split confirmed).Committed and pushed all 4 branches with boundary audits clean and Copilot co-author trailer. Drafted 4 PR bodies and asked user via
ask_user. User chose "Post all 4 as drafted".Opened 4 PRs successfully: fix(viewer): support 3-way compare for >2 runs; truncate long callable target labels #78 (viewer), feat(cli): add --override flag for nested config key overrides #79 (cli-override), feat(banking-mcp): agent-shield banking 3-step eval-fix demo (LangGraph + MCP + Phoenix) #80 (AS-MCP), feat(banking-rm-v3): private-banking RM 3-step eval-fix demo (LangChain + Phoenix) #81 (RM-v3 with dependency note on feat(cli): add --override flag for nested config key overrides #79). Cleaned up worktrees (force-removed one with untracked files). Noted repo was renamed
microsoft/adaptive-eval→microsoft/ASSERT.Then transitioned to scheduled daily sentinel reports. Six consecutive days of reports (2026-05-22 through 2026-05-27), all checking PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 (Science Regression Gate workflow) status. Pattern: workflow not yet on
main, PR ci(science): regression-gate workflow + 6 efficacy metrics + Holm-Bonferroni gate #41 sits idle, sentinel reports stale state.Between sentinel reports, a
system_notificationindicated the leftover viewer dev server (shellId: viewer-dev) had completed; agent acknowledged without action.Intent Mapping:
Technical Inventory:
Code Archaeology:
--overrideflag with_apply_config_overridesfunction (44 LOC)policy_violblocks removedrunLabelsplit on:then/, per-run card truncateagent_shieldextra addedProgress Assessment:
Context Validation:
All continuation context for both demo PRs and sentinel are captured. CLI rename (#108) introduces new risk for PR #41 rebase.
Recent Commands Analysis:
Last operations were two read-only powershell calls for the 2026-05-27 sentinel:
gh run list ... --branch main --event schedule→[](no scheduled runs)gh run list ... --limit 5→ 5 historical runs, latest still 2026-05-22 cancellationsgh pr view 41→ state OPEN, mergeable now UNKNOWN (changed from MERGEABLE)gh pr list --state merged --base main --limit 5→ revealed 5 recent merges including refactor(package): rename p2m -> assert_eval (CLI is assert-eval) #108 CLI renameAgent surfaced these findings as the 2026-05-27 sentinel report with TL;DR recommending nudge + Friday EOD pause.
1. Conversation Overview: - Primary Objectives: - "Post all 4 as drafted" — ship 4 PRs (viewer 3-way compare, CLI override, RM-v3 LangChain demo, AS-MCP banking demo) for //build...
Created from Copilot CLI via the copilot delegate command.