[arXiv:2606.29742] Add boundary-quality eval for architectural decomposition plans

## Source

Paper: https://arxiv.org/pdf/2606.29742 — "MicroAgent: Context-Augmented Multi-Agent Framework for Automatic Microservice Decomposition" (arXiv:2606.29742v1, 2026-06-29).

Source-specific idea used here: MicroAgent is not directly a MAP feature request to decompose monoliths into microservices. The transferable mechanism is a review/eval layer for architecture decomposition quality: split a broad refactor into focused domain/boundary decisions, use multi-granularity structural context, run dependency/common-code tools, and measure whether boundaries are cohesive, loosely coupled, and grounded in actual code.

## Relevant source takeaways

- The paper identifies three failure modes for naive LLM decomposition: overlong repo-level context causes hallucinated entities, shallow semantic cues miss concrete usage/call relationships, and agents overlook domain/microservice principles such as cohesion/coupling.
- MicroAgent handles this by splitting decomposition into specialized stages: domain identification, clustering, merging, common-class assignment, and final review.
- It customizes context by stage: application/domain summaries for domain discovery, class/code/dependency context for clustering and refinement, and focused relationship retrieval for ambiguous classes.
- Its common-class tooling is especially relevant to MAP: it computes dependency-distribution signals, retrieves directly related classes, and propagates assignment along dependency graphs so shared code is assigned consistently instead of being guessed from names.
- Evaluation is not just a qualitative review. The paper reports decomposition quality and common-class assignment metrics, plus ablations showing dependency context, specialized tools, and multi-agent staging all materially affect results.

## Repo evidence

- MAP already has generic decomposition structure. `src/mapify_cli/templates_src/agents/task-decomposer.md.jinja:250-254` requires `coverage_map` ownership for acceptance criteria, and `docs/ARCHITECTURE.md:2000-2007` documents `blueprint.json` plus `validate_blueprint_contract` as the planning-time gate for subtask metadata, validation criteria, and requirement coverage.
- MAP already has stage-consumption and evidence surfaces. `docs/ARCHITECTURE.md:642-646` describes review bundles and `validate_prior_stage_consumption`; `docs/ARCHITECTURE.md:967-969` describes acceptance coverage and prior-stage consumption reports.
- MAP already has a workflow dependency graph, but it models subtask execution rather than code-domain architecture. `src/mapify_cli/dependency_graph.py:550-560` lints self-loops, cycles, thin edges, same-file wave overlap, full serialization, and redundant workflow edges. `src/mapify_cli/dependency_graph.py:666-701` flags thin workflow dependencies using IO/file overlap; it does not score architectural boundary quality, cohesion, coupling, or shared/common-code assignment.
- MAP's decomposer prompt has some symbol grounding and circular import checks. `src/mapify_cli/templates_src/agents/task-decomposer.md.jinja:672-700` asks for no circular imports, precise `affected_files`, and grep-verified symbols. This helps avoid obvious hallucinated file/symbol references, but it does not evaluate whether a large architectural/refactor plan's proposed boundaries align with domain usage or dependency purity.
- Related backlog exists for structural discovery, but it is provider/eval infrastructure, not boundary-quality review. #310 adds an optional local structural code-map provider for MAP research. #311 adds structural-discovery ROI eval. Neither defines an architecture decomposition quality gate that consumes structural evidence to review planned boundaries.

## Existing issue search

Commands/searches used:

- `gh issue list --state all --limit 120 --search "MicroAgent OR 2606.29742 OR microservice decomposition OR bounded context OR DDD OR common class OR service boundary"` returned no matches.
- `gh issue list --state all --limit 120 --search "structural code-map OR dependency graph OR boundary quality OR coupling OR cohesion OR partition OR weighted similarity OR code graph"` returned no direct boundary-quality issue.
- `gh issue list --state all --limit 80 --json number,title,state,labels,url` was reviewed for nearby issues.

Close issues checked:

- #310 `[CodeGraph] Add optional local structural code-map provider for MAP research` covers how to locate symbol/caller/import evidence. It does not decide whether proposed architectural boundaries are good.
- #311 `[CodeGraph] Add structural-discovery ROI eval for MAP research` covers discovery cost/quality benchmarking. It does not score boundary cohesion/coupling or common/shared code placement.
- #312 `[Ponytail] Add agentic benchmark proving MAP minimality is active and safe` covers minimality behavior, unrelated to decomposition boundary quality.
- #168/#249/#257/#258 cover blueprint validity, reconciliation, convergence, and numeric thresholds in other contexts, but none introduces a domain/dependency-purity review for large architecture plans.

## Why this is not already covered

Existing MAP gates prove that each subtask has an owner, validation criteria, and dependency order, and they can catch malformed workflow DAGs or ungrounded symbols. They do not answer the MicroAgent-style question: "Given this proposed decomposition of a large refactor, are the boundaries cohesive, low-coupled, grounded in actual code relationships, and is shared/common code assigned consistently?"

That gap matters for MAP because a plan can pass `coverage_map`, `affected_files`, and acyclic workflow checks while still slicing a refactor along bad architectural boundaries: shared models assigned to one subtask without consumers, domain-specific logic split by name similarity, or cross-boundary changes that force hidden coupling in later subtasks.

## Problem

For large architectural/refactor goals, MAP's planning-time decomposition is structurally valid but not architecture-quality-scored. A decomposer can produce an executable DAG with precise files and validation criteria, yet still choose poor boundaries that create coupling, misplaced shared code, or brittle cross-subtask dependencies. The current safeguards are generic; they do not use code relationship signals to review domain/boundary quality.

## Proposed slice

Add an optional boundary-quality eval/review for architecture-heavy plans.

Concrete first slice:

- Detect when a blueprint is architecture/refactor-heavy, e.g. many runtime files, cross-module changes, shared model/interface edits, explicit migration/refactor wording, or `concern_type`/risk metadata indicating architecture work.
- Build a `boundary_quality_report` from existing blueprint data plus available structural evidence. It can start with deterministic heuristics and later consume #310's structural code-map when available.
- Score/report at least:
  - grounded entities: every named class/function/module in AAG/validation criteria exists or is declared as newly created;
  - cross-boundary dependency pressure: files/symbols in different subtasks with import/call/reference links;
  - shared/common-code placement: shared interfaces/models/utilities have explicit owning subtask plus consumer validation;
  - cohesion/purity proxy: each subtask's files cluster by module/domain better than by arbitrary name similarity;
  - cycle/coupling risk: potential circular imports or bidirectional dependencies between planned boundaries.
- Surface the report in review bundle/run health for Monitor/Predictor, but keep the first implementation advisory unless a clear hard error exists (e.g. hallucinated existing symbol or impossible file path already covered by existing gates).
- Add fixture tests where a deliberately bad architectural plan passes generic blueprint validation but receives boundary-quality findings.

## Acceptance criteria

- A deterministic command or helper can produce `boundary_quality_report` for a fixture blueprint without network access or external model calls.
- The report distinguishes hard grounding errors from advisory quality warnings.
- Tests include at least one good split and one bad split involving shared/common code or cross-boundary dependencies.
- Review bundle or run-health output includes the report summary when present, without replacing existing `coverage_map` or `validate_blueprint_contract` gates.
- The implementation works without #310's optional structural provider, using current `affected_files`, imports, and grep/AST-style local evidence; integration with #310 can be a follow-up enhancement.
- Docs explain that this is for architecture-heavy plans, not every small task.

## Guardrails

- Do not turn MAP into a microservice migration tool; keep the slice as a general architecture-decomposition quality gate.
- Do not block small/simple tasks with expensive analysis.
- Do not use LLM-only scoring as the source of truth; persist concrete file/symbol/dependency evidence behind findings.
- Do not duplicate #310/#311; consume structural evidence when available, but keep this issue focused on boundary-quality evaluation.
- Do not treat name similarity as sufficient evidence for domain placement; usage/dependency evidence must dominate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arXiv:2606.29742] Add boundary-quality eval for architectural decomposition plans #316

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[arXiv:2606.29742] Add boundary-quality eval for architectural decomposition plans #316

Description

Source

Relevant source takeaways

Repo evidence

Existing issue search

Why this is not already covered

Problem

Proposed slice

Acceptance criteria

Guardrails

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions