Add evaluation cases for agent-factory, issue-ops-architect, and rule-distiller#3
Add evaluation cases for agent-factory, issue-ops-architect, and rule-distiller#3weslleycapelari wants to merge 3 commits into
Conversation
…-distiller; enhance MEMORY.md with sprint N4 Hardening records
There was a problem hiding this comment.
Pull request overview
This PR expands the repository’s AI governance/eval hardening by adding new eval cases for core agents, recording sprint decisions, and introducing a PR governance audit step.
Changes:
- Added bad-input and high-quality-output eval fixtures for
agent-factory,issue-ops-architect, andrule-distiller. - Added sprint N4/N5 mission and memory records, plus roadmap updates for locale sync and L4 stabilization.
- Added a marketplace-integrity workflow step to report agent eval coverage on PRs.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
ROADMAP.md |
Updates sprint status, completed N4 work, and N5 backlog planning. |
library/github-baseline/MEMORY.md |
Adds sprint decision-memory entries for N4/N5. |
library/evals/rule-distiller/case-02-expected-high-quality-output.md |
Adds rule-distiller high-quality-output eval. |
library/evals/rule-distiller/case-01-bad-input.md |
Adds rule-distiller bad-input eval. |
library/evals/issue-ops-architect/case-02-expected-high-quality-output.md |
Adds issue-ops high-quality-output eval. |
library/evals/issue-ops-architect/case-01-bad-input.md |
Adds issue-ops bad-input eval. |
library/evals/agent-factory/case-02-expected-high-quality-output.md |
Adds agent-factory high-quality-output eval. |
library/evals/agent-factory/case-01-bad-input.md |
Adds agent-factory bad-input eval. |
.github/workflows/marketplace-integrity.yml |
Adds PR eval coverage audit reporting. |
.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md |
Adds N5 locale-sync/L4 stabilization mission plan. |
.github/MISSIONS/roadmap-2026-05-12-sprint-n4-hardening.md |
Adds N4 hardening mission record. |
Comments suppressed due to low confidence (3)
.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md:30
- The mission protocol is defined as one delegation hop per mission file, but this mission assigns Tasks 1/2 to
documentation-stewardand Task 3 toproject-auditor. A validChild_Agentfield can contain only one child agent, so this should be split into separate mission files or represented as a single valid delegation hop before CI can pass.
## Parent Agent & Delegation
- **Parent Agent:** `@roadmap-steward`
- **Child Agents:**
- `@documentation-steward` — Tasks 1 & 2
- `@project-auditor` — Task 3
library/evals/rule-distiller/case-02-expected-high-quality-output.md:44
- This PASS criterion expects the section title
New Best Practicesfor English output, but the currentrule-distilleragent contract classifies best-practice findings underNovas Boas Práticas. Without updating the agent contract in the same PR, this eval will fail the repository's current rule-distiller behavior rather than validating it.
1. File is created at the exact path `library/instructions/laravel-11.md`.
2. Breaking Changes and New Best Practices are in separate sections (not mixed). When output language preference is English, all section titles must be in English — section must be titled "New Best Practices" (not "Novas Boas Práticas").
.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md:139
- This CI requirement checks for Spanish text in the EN locale, but the documented drift this sprint is meant to catch is Spanish wording in PT-BR files (for example,
Ejecute). As written, the planned validation could pass while the known PT-BR issue remains undetected.
2. Step should validate:
- Files in `docs/en/` have PT-BR equivalents (and vice versa)
- No Spanish text in EN locale (sanity check)
- Files contain valid locale sync declaration block
| **Mission ID:** `MISSION-N5-2026-05-26` | ||
| **Sprint:** N5 — Locale Sync & L4 Stabilization | ||
| **Duration:** 2026-05-26 to 2026-06-09 (2 weeks) | ||
| **Status:** Approved & Active |
| > This file is maintained by the `@roadmap-steward` agent. Do not edit manually unless updating strategic direction. See [Manual Edit Policy](#manual-edit-policy) below. | ||
|
|
||
| ## 🗓️ Current Sprint (MVP 1.9.x → 2.0.0) | ||
| ## 🗓️ Current Sprint (N5 — Locale Sync & L4 Stabilization — 2026-05-26 to 2026-06-09) |
| - **agent**: `issue-ops-architect` | ||
| - **case_id**: `IOA-01` | ||
| - **status**: `draft` | ||
| - **scenario_type**: `bad_input` |
| - **agent**: `rule-distiller` | ||
| - **case_id**: `RD-01` | ||
| - **status**: `draft` | ||
| - **scenario_type**: `bad_input` |
| Migration checklist: yes | ||
| Output language preference: English | ||
| May I create/modify files after preview: yes |
| | Sobrecarga operacional em time small | Média | Critérios binários em CI; não exigir perfeccionismo em evals de primeira versão. | | ||
| | Cobertura de evals pode ser genérica inicialmente | Média | Review gate obrigatório; iteração rápida em ciclo seguinte com feedback real. | | ||
| | MEMORY.md requer disciplina contínua | Baixa | Atribuir ownership ao context-steward; review semanal enxuta (15 min max). | | ||
| | Auditoria pode gerar falsos positivos | Média | Usar regras simples na primeira versão; refinar baseado em experience. | |
|
|
||
| **Why This Matters:** | ||
|
|
||
| - Post-N4 audit identified PT-BR desincronizado (typos in Spanish, stale L4 references) |
| - **ID:** MEM-20260513-01 | ||
| - **Date:** 2026-05-13 | ||
| - **Owner:** roadmap-steward | ||
| - **Record:** Sprint N5 (Locale Sync & L4 Stabilization) approved. Post-N4 audit identified PT-BR desincronizado (stale L4 references, typos em espanhol). Sync automation never operationalized. N5 scope: 3 tasks — sync PT-BR, update docs, operationalize CI validation. Target: 2026-06-09. Mission file created. |
|
|
||
| cases = [ | ||
| f for f in os.listdir(agent_dir) | ||
| if f.endswith(".md") and f != ".gitkeep" |
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/weslleycapelari/github-patterns/sessions/577201a0-68d1-468c-b38c-e3f9468cf6e8 Co-authored-by: weslleycapelari <28955078+weslleycapelari@users.noreply.github.com>
Implemented the requested updates from the review thread in commit |
Introduce evaluation cases to enhance the testing framework for agents, ensuring proper handling of bad input scenarios and expected high-quality outputs. Update MEMORY.md with records from sprint N4 Hardening and set up for N5 sprint with locale synchronization and L4 stabilization.