[Ponytail] Add agentic benchmark proving MAP minimality is active and safe

## Source

Source repo: https://github.com/DietrichGebert/ponytail

Primary source files read:
- `README.md` from `DietrichGebert/ponytail@main`.
- `AGENTS.md` from `DietrichGebert/ponytail@main`.
- `benchmarks/results/2026-06-18-agentic.md` from `DietrichGebert/ponytail@main`.

Relevant source claims/ideas:
- Ponytail is not just a prose rule. It claims a measured agentic effect: same coding agent, real repo, baseline vs plugin arms, measuring `git diff` added LOC, token/cost/time, and adversarial safety checks.
- The benchmark explicitly corrected a false result where the plugin hook contaminated the baseline. That is directly relevant to MAP: prompt text and closed issues are insufficient; we need an isolated active-path eval that proves minimality is both enabled and effective.
- Ponytail's rule ladder is already conceptually integrated in MAP, but the source's strongest reusable idea is the benchmark design: prove smaller diffs without losing safety.

## Relevant source takeaways

- Benchmark unit should be a real agent run against a real fixture repo, not a single prompt completion.
- Baseline and treatment must be isolated so the minimality guidance cannot leak into the baseline.
- Metrics must include LOC/tokens/cost/time where available, but safety/correctness must be measured separately and must not regress.
- Tasks should include both over-build traps (where native/stdlib/reuse should win) and surgical safety tasks (where one-liner pressure can drop guards).
- The expected outcome is not universal LOC reduction: on irreducible tasks the arms may converge. The eval should accept "no bloat to cut" instead of forcing artificial line savings.

## Repo evidence

Local implementation is not merely prose-only:
- `src/mapify_cli/config/project_config.py` sets `MapConfig.minimality: str = "lite"` by default and documents the Phase 3 flip.
- `src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja` has `_load_minimality_level(..., default="lite")`, so standalone generated runner behavior matches config default.
- `build_context_block()` calls `_minimality_doctrine_block(minimality)` and injects `<MAP_Minimality_Doctrine>` when minimality is not `off`.
- `_minimality_doctrine_block()` contains the runtime Actor ladder and `map:simplification:` marker guidance.
- `build_review_prompts()` adds a `complexity_lens` prompt when `minimality != "off"`.
- `validate_blueprint_contract()` rejects non-empty `deferred_yagni` when minimality is not `full`/`ultra`, so pruning is gated rather than silently active under `lite`.
- Tests cover active plumbing: `tests/test_map_step_runner.py` checks default `minimality == "lite"`, context-block doctrine injection, invalid minimality fallback, review complexity lens insertion, and deferred_yagni gating; `tests/test_decomposition.py` checks config defaults/valid values/YAML `off`; `tests/test_minimality_report.py` checks telemetry report decisions.

What is missing:
- No Ponytail-style end-to-end A/B benchmark proves that MAP minimality actually reduces generated diff size/tokens while preserving safety/correctness.
- Existing `minimality-report` telemetry compares completed local runs, but it is not a reproducible isolated benchmark and does not run baseline/treatment arms on the same task corpus.

## Existing issue search

Commands/searches used:
- `gh issue list --state all --limit 120 --search "Ponytail OR minimality OR YAGNI OR stdlib OR native OR one-liner OR pruneable OR deferred_yagni OR reuse"`
- `gh issue list --state all --limit 120 --search "minimality benchmark OR Ponytail benchmark OR agentic benchmark OR safety rate OR LOC tokens minimality eval"`
- `gh issue list --state all --limit 120 --search "minimality telemetry OR minimality-report OR field telemetry OR default flip"`

Related issues checked:
- #180 closed epic: integrates Ponytail concept across MAP pipeline.
- #181 closed: Actor doctrine/config/Evaluator/Monitor/retry filter.
- #182 closed: `/map-review` what-to-delete lens.
- #183 closed: global default `off -> lite` after telemetry.
- #184 closed: decomposer pruning via `deferred_yagni`.

Why this is not a duplicate:
Those issues implement and gate the minimality doctrine. None adds a reproducible A/B harness equivalent to Ponytail's benchmark that isolates baseline vs minimality and measures LOC/tokens/safety on a task corpus.

## Why this is not already covered

The code path is active, but activation is not impact evidence. A prompt can be injected and still produce no measurable behavioral change, or worse, reduce lines by dropping guards. Ponytail's own benchmark history shows why this matters: they found and fixed contamination where the baseline secretly ran the plugin. MAP should have an equivalent active-path proof before treating minimality claims as settled.

## Problem

MAP currently has implementation evidence and local telemetry surfaces, but not a deterministic or reproducible eval that answers: "Does MAP minimality actually make the agent produce smaller sufficient diffs without losing safety?" Without that, future changes to prompts, hooks, or config can leave the feature apparently enabled but behaviorally inert.

## Proposed slice

Add a `minimality-eval` / benchmark harness that runs isolated baseline and treatment arms on a small MAP-style task corpus.

Suggested first slice:
- Use fixture repos/tasks that do not require external services.
- Arms: `minimality: off` vs `minimality: lite` at minimum; optionally `full` as an opt-in treatment.
- Metrics: added LOC from git diff, token usage if transcript/meter data is available, duration as advisory, pass/fail of task-specific tests, and safety checks for trust-boundary tasks.
- Corpus split:
  - Over-build traps: native/stdlib/reuse should avoid extra code.
  - Irreducible tasks: expected near-zero LOC delta, to prevent benchmark gaming.
  - Safety tasks: smaller code must still pass adversarial checks.
- Ensure each arm gets a fresh workspace and no shared session/plugin contamination.
- Produce a persisted report under `.map/eval-runs/minimality/` or a similar existing eval-artifact namespace.

## Acceptance criteria

- A maintainer can run one command locally to compare `minimality: off` vs `lite` on a fixture corpus.
- The report separates code-size/cost metrics from correctness/safety metrics.
- The eval fails or warns when `lite` reduces LOC by dropping required safety/correctness behavior.
- The eval detects or prevents baseline contamination by explicitly asserting the generated context/config for each arm.
- Docs explain that this benchmark validates behavioral effect; it does not replace normal workflow gates.
- Implementation reuses existing eval/report patterns where practical and keeps generated templates single-source.

## Guardrails

- Do not claim Ponytail-like percentages for MAP until MAP has its own benchmark data.
- Do not optimize for LOC alone; required behavior, security, accessibility, and data integrity are non-negotiable.
- Do not use shadow mode for rollout. Run explicit isolated eval arms.
- Do not require live production deploy or external services.
- Do not let `minimality: lite` silently prune explicit requirements; pruning remains `full`/`ultra` plus visible approval.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ponytail] Add agentic benchmark proving MAP minimality is active and safe #312

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Ponytail] Add agentic benchmark proving MAP minimality is active and safe #312

Description

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions