Skip to content

Add evaluation cases for agent-factory, issue-ops-architect, and rule-distiller#3

Open
weslleycapelari wants to merge 3 commits into
mainfrom
n5-locale-sync-setup
Open

Add evaluation cases for agent-factory, issue-ops-architect, and rule-distiller#3
weslleycapelari wants to merge 3 commits into
mainfrom
n5-locale-sync-setup

Conversation

@weslleycapelari
Copy link
Copy Markdown
Owner

Introduce evaluation cases to enhance the testing framework for agents, ensuring proper handling of bad input scenarios and expected high-quality outputs. Update MEMORY.md with records from sprint N4 Hardening and set up for N5 sprint with locale synchronization and L4 stabilization.

Copilot AI review requested due to automatic review settings May 13, 2026 13:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the repository’s AI governance/eval hardening by adding new eval cases for core agents, recording sprint decisions, and introducing a PR governance audit step.

Changes:

  • Added bad-input and high-quality-output eval fixtures for agent-factory, issue-ops-architect, and rule-distiller.
  • Added sprint N4/N5 mission and memory records, plus roadmap updates for locale sync and L4 stabilization.
  • Added a marketplace-integrity workflow step to report agent eval coverage on PRs.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
ROADMAP.md Updates sprint status, completed N4 work, and N5 backlog planning.
library/github-baseline/MEMORY.md Adds sprint decision-memory entries for N4/N5.
library/evals/rule-distiller/case-02-expected-high-quality-output.md Adds rule-distiller high-quality-output eval.
library/evals/rule-distiller/case-01-bad-input.md Adds rule-distiller bad-input eval.
library/evals/issue-ops-architect/case-02-expected-high-quality-output.md Adds issue-ops high-quality-output eval.
library/evals/issue-ops-architect/case-01-bad-input.md Adds issue-ops bad-input eval.
library/evals/agent-factory/case-02-expected-high-quality-output.md Adds agent-factory high-quality-output eval.
library/evals/agent-factory/case-01-bad-input.md Adds agent-factory bad-input eval.
.github/workflows/marketplace-integrity.yml Adds PR eval coverage audit reporting.
.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md Adds N5 locale-sync/L4 stabilization mission plan.
.github/MISSIONS/roadmap-2026-05-12-sprint-n4-hardening.md Adds N4 hardening mission record.
Comments suppressed due to low confidence (3)

.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md:30

  • The mission protocol is defined as one delegation hop per mission file, but this mission assigns Tasks 1/2 to documentation-steward and Task 3 to project-auditor. A valid Child_Agent field can contain only one child agent, so this should be split into separate mission files or represented as a single valid delegation hop before CI can pass.
## Parent Agent & Delegation

- **Parent Agent:** `@roadmap-steward`
- **Child Agents:**
  - `@documentation-steward` — Tasks 1 & 2
  - `@project-auditor` — Task 3

library/evals/rule-distiller/case-02-expected-high-quality-output.md:44

  • This PASS criterion expects the section title New Best Practices for English output, but the current rule-distiller agent contract classifies best-practice findings under Novas Boas Práticas. Without updating the agent contract in the same PR, this eval will fail the repository's current rule-distiller behavior rather than validating it.
1. File is created at the exact path `library/instructions/laravel-11.md`.
2. Breaking Changes and New Best Practices are in separate sections (not mixed). When output language preference is English, all section titles must be in English — section must be titled "New Best Practices" (not "Novas Boas Práticas").

.github/MISSIONS/roadmap-2026-05-26-sprint-n5-locale-sync.md:139

  • This CI requirement checks for Spanish text in the EN locale, but the documented drift this sprint is meant to catch is Spanish wording in PT-BR files (for example, Ejecute). As written, the planned validation could pass while the known PT-BR issue remains undetected.
2. Step should validate:
   - Files in `docs/en/` have PT-BR equivalents (and vice versa)
   - No Spanish text in EN locale (sanity check)
   - Files contain valid locale sync declaration block

Comment on lines +3 to +6
**Mission ID:** `MISSION-N5-2026-05-26`
**Sprint:** N5 — Locale Sync & L4 Stabilization
**Duration:** 2026-05-26 to 2026-06-09 (2 weeks)
**Status:** Approved & Active
Comment thread ROADMAP.md
> This file is maintained by the `@roadmap-steward` agent. Do not edit manually unless updating strategic direction. See [Manual Edit Policy](#manual-edit-policy) below.

## 🗓️ Current Sprint (MVP 1.9.x → 2.0.0)
## 🗓️ Current Sprint (N5 — Locale Sync & L4 Stabilization — 2026-05-26 to 2026-06-09)
Comment on lines +5 to +8
- **agent**: `issue-ops-architect`
- **case_id**: `IOA-01`
- **status**: `draft`
- **scenario_type**: `bad_input`
Comment on lines +5 to +8
- **agent**: `rule-distiller`
- **case_id**: `RD-01`
- **status**: `draft`
- **scenario_type**: `bad_input`
Comment on lines +23 to +25
Migration checklist: yes
Output language preference: English
May I create/modify files after preview: yes
| Sobrecarga operacional em time small | Média | Critérios binários em CI; não exigir perfeccionismo em evals de primeira versão. |
| Cobertura de evals pode ser genérica inicialmente | Média | Review gate obrigatório; iteração rápida em ciclo seguinte com feedback real. |
| MEMORY.md requer disciplina contínua | Baixa | Atribuir ownership ao context-steward; review semanal enxuta (15 min max). |
| Auditoria pode gerar falsos positivos | Média | Usar regras simples na primeira versão; refinar baseado em experience. |

**Why This Matters:**

- Post-N4 audit identified PT-BR desincronizado (typos in Spanish, stale L4 references)
Comment thread library/github-baseline/MEMORY.md Outdated
- **ID:** MEM-20260513-01
- **Date:** 2026-05-13
- **Owner:** roadmap-steward
- **Record:** Sprint N5 (Locale Sync & L4 Stabilization) approved. Post-N4 audit identified PT-BR desincronizado (stale L4 references, typos em espanhol). Sync automation never operationalized. N5 scope: 3 tasks — sync PT-BR, update docs, operationalize CI validation. Target: 2026-06-09. Mission file created.

cases = [
f for f in os.listdir(agent_dir)
if f.endswith(".md") and f != ".gitkeep"
Comment thread library/evals/agent-factory/case-02-expected-high-quality-output.md
@weslleycapelari
Copy link
Copy Markdown
Owner Author

@copilot apply changes based on the comments in this thread

Copy link
Copy Markdown
Contributor

Copilot AI commented May 13, 2026

@copilot apply changes based on the comments in this thread

Implemented the requested updates from the review thread in commit 07e2925. This includes mission protocol key compliance, eval matrix/index fixes, IOA/RD eval contract alignment (explicit strategy/CONFIRM behavior), wording/grammar fixes, workflow audit counting for non-.md case fixtures, and PR locale sync declaration fields (en_status, pt_br_status, pending_sync_tasks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants