Skip to content

[Ponytail] Add agentic benchmark proving MAP minimality is active and safe #312

Description

@azalio

Source

Source repo: https://github.com/DietrichGebert/ponytail

Primary source files read:

  • README.md from DietrichGebert/ponytail@main.
  • AGENTS.md from DietrichGebert/ponytail@main.
  • benchmarks/results/2026-06-18-agentic.md from DietrichGebert/ponytail@main.

Relevant source claims/ideas:

  • Ponytail is not just a prose rule. It claims a measured agentic effect: same coding agent, real repo, baseline vs plugin arms, measuring git diff added LOC, token/cost/time, and adversarial safety checks.
  • The benchmark explicitly corrected a false result where the plugin hook contaminated the baseline. That is directly relevant to MAP: prompt text and closed issues are insufficient; we need an isolated active-path eval that proves minimality is both enabled and effective.
  • Ponytail's rule ladder is already conceptually integrated in MAP, but the source's strongest reusable idea is the benchmark design: prove smaller diffs without losing safety.

Relevant source takeaways

  • Benchmark unit should be a real agent run against a real fixture repo, not a single prompt completion.
  • Baseline and treatment must be isolated so the minimality guidance cannot leak into the baseline.
  • Metrics must include LOC/tokens/cost/time where available, but safety/correctness must be measured separately and must not regress.
  • Tasks should include both over-build traps (where native/stdlib/reuse should win) and surgical safety tasks (where one-liner pressure can drop guards).
  • The expected outcome is not universal LOC reduction: on irreducible tasks the arms may converge. The eval should accept "no bloat to cut" instead of forcing artificial line savings.

Repo evidence

Local implementation is not merely prose-only:

  • src/mapify_cli/config/project_config.py sets MapConfig.minimality: str = "lite" by default and documents the Phase 3 flip.
  • src/mapify_cli/templates_src/map/scripts/map_step_runner.py.jinja has _load_minimality_level(..., default="lite"), so standalone generated runner behavior matches config default.
  • build_context_block() calls _minimality_doctrine_block(minimality) and injects <MAP_Minimality_Doctrine> when minimality is not off.
  • _minimality_doctrine_block() contains the runtime Actor ladder and map:simplification: marker guidance.
  • build_review_prompts() adds a complexity_lens prompt when minimality != "off".
  • validate_blueprint_contract() rejects non-empty deferred_yagni when minimality is not full/ultra, so pruning is gated rather than silently active under lite.
  • Tests cover active plumbing: tests/test_map_step_runner.py checks default minimality == "lite", context-block doctrine injection, invalid minimality fallback, review complexity lens insertion, and deferred_yagni gating; tests/test_decomposition.py checks config defaults/valid values/YAML off; tests/test_minimality_report.py checks telemetry report decisions.

What is missing:

  • No Ponytail-style end-to-end A/B benchmark proves that MAP minimality actually reduces generated diff size/tokens while preserving safety/correctness.
  • Existing minimality-report telemetry compares completed local runs, but it is not a reproducible isolated benchmark and does not run baseline/treatment arms on the same task corpus.

Existing issue search

Commands/searches used:

  • gh issue list --state all --limit 120 --search "Ponytail OR minimality OR YAGNI OR stdlib OR native OR one-liner OR pruneable OR deferred_yagni OR reuse"
  • gh issue list --state all --limit 120 --search "minimality benchmark OR Ponytail benchmark OR agentic benchmark OR safety rate OR LOC tokens minimality eval"
  • gh issue list --state all --limit 120 --search "minimality telemetry OR minimality-report OR field telemetry OR default flip"

Related issues checked:

Why this is not a duplicate:
Those issues implement and gate the minimality doctrine. None adds a reproducible A/B harness equivalent to Ponytail's benchmark that isolates baseline vs minimality and measures LOC/tokens/safety on a task corpus.

Why this is not already covered

The code path is active, but activation is not impact evidence. A prompt can be injected and still produce no measurable behavioral change, or worse, reduce lines by dropping guards. Ponytail's own benchmark history shows why this matters: they found and fixed contamination where the baseline secretly ran the plugin. MAP should have an equivalent active-path proof before treating minimality claims as settled.

Problem

MAP currently has implementation evidence and local telemetry surfaces, but not a deterministic or reproducible eval that answers: "Does MAP minimality actually make the agent produce smaller sufficient diffs without losing safety?" Without that, future changes to prompts, hooks, or config can leave the feature apparently enabled but behaviorally inert.

Proposed slice

Add a minimality-eval / benchmark harness that runs isolated baseline and treatment arms on a small MAP-style task corpus.

Suggested first slice:

  • Use fixture repos/tasks that do not require external services.
  • Arms: minimality: off vs minimality: lite at minimum; optionally full as an opt-in treatment.
  • Metrics: added LOC from git diff, token usage if transcript/meter data is available, duration as advisory, pass/fail of task-specific tests, and safety checks for trust-boundary tasks.
  • Corpus split:
    • Over-build traps: native/stdlib/reuse should avoid extra code.
    • Irreducible tasks: expected near-zero LOC delta, to prevent benchmark gaming.
    • Safety tasks: smaller code must still pass adversarial checks.
  • Ensure each arm gets a fresh workspace and no shared session/plugin contamination.
  • Produce a persisted report under .map/eval-runs/minimality/ or a similar existing eval-artifact namespace.

Acceptance criteria

  • A maintainer can run one command locally to compare minimality: off vs lite on a fixture corpus.
  • The report separates code-size/cost metrics from correctness/safety metrics.
  • The eval fails or warns when lite reduces LOC by dropping required safety/correctness behavior.
  • The eval detects or prevents baseline contamination by explicitly asserting the generated context/config for each arm.
  • Docs explain that this benchmark validates behavioral effect; it does not replace normal workflow gates.
  • Implementation reuses existing eval/report patterns where practical and keeps generated templates single-source.

Guardrails

  • Do not claim Ponytail-like percentages for MAP until MAP has its own benchmark data.
  • Do not optimize for LOC alone; required behavior, security, accessibility, and data integrity are non-negotiable.
  • Do not use shadow mode for rollout. Run explicit isolated eval arms.
  • Do not require live production deploy or external services.
  • Do not let minimality: lite silently prune explicit requirements; pruning remains full/ultra plus visible approval.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions