A probe-backed scanner that scores how navigable a repository is for AI coding agents.
ANI turns a GitHub repository URL into a 0-100 Agent Navigability Index score, dimension scores, concrete evidence, prioritized recommendations, and machine-readable JSON. It is built for developers and AI tooling teams who believe a serious agent-readiness product should measure whether agents can find the right files, tests, configs, and boundaries with low context waste.
ANI PRD2 is a CLI-first developer product. The public product is simple:
GitHub repository URL -> ANI score + evidence-backed recommendations
Users do not need to connect issue trackers, run agent experiments, install graph services, or provide prior PRs. ANI may build temporary local indexes and probes internally, but it cleans temporary cloned repositories by default and does not require hosted graph or memory infrastructure.
AI coding agents fail less because they cannot write code at all and more because they cannot reliably answer repository-level questions:
- Where is the relevant code?
- What files are safe to change?
- Which tests or checks prove the change works?
- Which generated, vendored, deprecated, or noisy files should be ignored?
- What architecture boundaries and ownership rules matter?
ANI measures those navigability signals with static analysis plus repo-local navigation probes, then explains every recommendation in terms of the agent failure mode it should reduce.
ANI's narrow thesis is:
Better agent navigation in code leads to better agent performance.
This is grounded in repository-level agent and software-engineering research. ContextBench studies code-context recall, precision, retrieval efficiency, redundancy, and downstream task success. SWE-bench, SWE-agent, Agentless, AutoCodeRover, RepoBench, CrossCodeEval, and RepoCoder all reinforce the same practical point: realistic software tasks depend on localization, context selection, repository relationships, and verification behavior, not only code generation. Classic program-comprehension and modularity research adds the older lesson that boundaries, coupling, and information hiding affect how quickly a maintainer can find the right place to change.
ANI does not claim that its current formula is universally validated. It claims that navigability is a measurable component of agent performance, and it includes a paired before/after validation harness to test whether ANI recommendations move the metrics they target.
- Scores public GitHub repositories at exact refs or commits.
- Supports deterministic local fixture scans through
--local-path. - Performs safe static analysis: file inventory, repo classification, docs, CI/tests, ownership, generated/vendor detection, JS/TS and Python symbol extraction, best-effort dependency graph, and bounded git history metrics.
- Generates repo-local navigation probes from imports, symbols, tests, configs, routes, commands, and error strings.
- Runs a fixed lexical/path/symbol/graph retriever and measures target recall, context precision, noise in top-k, test discoverability, command confidence, and documentation utility.
- Computes a versioned ANI score, confidence, navigation dimension scores, static support dimensions, evidence cards, penalties, bonuses, and measurable recommendation contracts.
- Emits JSON and Markdown reports.
- Includes an internal recommendation-validation harness with positive/negative controls, headroom and adoption gates, metadata-only traces, and automatic deletion of cloned repositories after each case.
- Reads CSV or JSONL benchmark manifests and optional outcome files.
- Joins scans to outcomes by
outcome_id. - Produces grouped splits/folds, correlations, confidence intervals, quartile lift, baseline-vs-ANI comparison, dimension importance, and failure analysis.
git clone https://github.com/<your-org>/agent-navigability-index.git
cd agent-navigability-index
python -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev]"
python -m unittest discover -vScore a public repository at an exact ref:
ani score https://github.com/org/repo \
--ref 0123abcd4567ef \
--format markdown \
--output ani-report.mdEmit machine-readable JSON:
ani scan https://github.com/org/repo \
--ref 0123abcd4567ef \
--format json \
--output ani-report.jsonRun a benchmark validation manifest with public GitHub repos and exact refs:
ani validate path/to/benchmark_manifest.jsonl \
--outcomes path/to/outcomes.jsonl \
--out-dir ani_validation_artifactsRun calibration before any public paired before/after recommendation validation:
ani validate-calibration validation/calibration_repos/manifest.jsonl \
--out-dir validation/runs \
--repeats 3 \
--real-agent-backend codexRun internal paired before/after recommendation validation only after calibration reports calibration_ready_for_public_a_b:
ani validate-recommendations validation_manifest.jsonl \
--out-dir validation/runs \
--workers 3 \
--real-agent-backend codex \
--calibration-report validation/runs/<calibration-run>/report.jsonRecommendation validation is fail-closed. Deterministic navigation-agent runs are calibration controls; public before/after A/B mode requires a passing calibration report. ANI does not claim public recommendation efficacy unless a real LLM coding-agent trace adapter such as Codex CLI is explicitly enabled and the calibration, controls, headroom, adoption, and paired significance gates all pass.
Render or explain an existing scan:
ani report examples/sample_scan.json
ani explain examples/sample_scan.json{
"schema_version": "2.0",
"model_version": "ani-probe-v2",
"repo": {
"url": "https://github.com/example/ani-sample-repo",
"ref": "62a9ec...",
"commit_sha": "62a9ec..."
},
"scores": {
"ani_score": 66,
"distance_from_ideal": 34,
"confidence": 96,
"grade": "C",
"dimension_scores": {
"entry_point_clarity": 0.90,
"verification_affordance": 0.85
},
"navigation_dimension_scores": {
"target_recall": 0.82,
"context_precision": 0.41,
"verification_discoverability": 0.70
},
"navigation_metrics": {
"file_recall_at_10": 0.83,
"precision_at_10": 0.30,
"noise_in_top_k": 0.12
}
},
"probe_results": [
{
"family": "source_to_test",
"query": "greet relevant tests",
"target_artifacts": ["examples/sample_repo/tests/index.test.ts"],
"file_recall_at_5": 1.0
}
],
"evidence": [
{
"dimension": "context_retrievability",
"polarity": "negative",
"claim": "Probe retrieval precision@10 is 30%, meaning agents would read substantial irrelevant context."
}
],
"recommendations": [
{
"title": "Add package-local source-to-test anchors for failed queries",
"failure_mode": "For query `greet relevant tests`, the relevant test target ranks outside the top 5, so agents may skip the right regression check.",
"observed_failures": [
{
"query": "greet relevant tests",
"target_artifacts": ["examples/sample_repo/tests/index.test.ts"],
"first_relevant_rank": 12,
"top_irrelevant_results": [{"path": "examples/sample_repo/src/generated.ts", "rank": 2, "noise": true}]
}
],
"why_agents_waste_context": "The target test is buried behind generated and unrelated files, so an agent is likely to spend reads and tokens before reaching it.",
"suggested_change": "Add `examples/sample_repo/tests/AGENTS.md` with the failed query terms and direct anchors to `examples/sample_repo/tests/index.test.ts`.",
"expected_metric_movement": {
"metric": "source_to_test_recall_at_5",
"direction": "up",
"target": "raise to at least 0.75 or improve by 20%"
},
"expected_real_agent_behavior_movement": {
"target_file_recall": "up >= 10 percentage points",
"files_read": "down >= 10%"
},
"validation_status": "probe_backed",
"evidence_ids": ["ev_governance_safety_missing_codeowners_..."]
}
]
}See examples/sample_scan.json for a complete generated fixture output. The checked-in fixture artifacts use --local-path; see examples/README.md to regenerate them.
# Agent Navigability Index: example/ani-sample-repo
## Executive Summary
ANI scanned `example/ani-sample-repo` at commit `...` and produced an evidence-backed score of **66/100**.
## Score Card
| Metric | Value |
|---|---:|
| ANI score | 66 |
| Distance from ideal | 34 |
| Confidence | 96 |
## What To Fix First
Start with **Expose runtime and configuration contracts**...See examples/sample_report.md for the complete sample report.
ANI includes two validation paths:
- Benchmark correlation validation asks whether ANI is associated with external agent outcomes. The checked-in 10-row pilot is small and inconclusive.
- Recommendation validation asks whether applying one ANI recommendation improves before/after agent behavior. It first uses static probes to select a causal candidate, then runs the same instrumented navigation agent on the original and patched checkout under fixed budgets.
The current checked-in public pilots support harness reproducibility and score sensitivity. They do not prove statistically significant external LLM-agent performance improvement. Until a paired recommendation run passes the real-agent significance gate, ANI should claim probe-backed navigability diagnosis, not validated agent outcome improvement.
See Validation Methodology, Recommendation Validation, and Benchmark Validation Pilot.
ANI PRD2 scores eight navigation-first dimensions from 0-1:
- Target Recall
- Context Precision
- Relationship Recoverability
- Verification Discoverability
- Boundary Clarity
- Noise Resistance
- Command Confidence
- Documentation Utility
The scanner also keeps ten static support dimensions for explainability and confidence. The headline score is probe-first when enough probes exist, then blended with static support signals. Weights are defined in ani/scoring_weights.json.
ANI separates score from confidence. Confidence answers: "How much of the repository did the scanner understand well enough for this scan to be usable?"
Inputs include clone/ref success, file coverage, parser coverage, dependency graph completeness, probe count/diversity, test/CI detection, generated/vendor exclusion, git history availability, and framework classifier confidence. If confidence drops below 50, ANI suppresses the grade and marks the scan as directional only.
ANI is safe by default:
- It clones/fetches repositories and reads files.
- It parses static metadata, source text, manifests, docs, and git history.
- It creates and runs local navigation probes over a temporary index.
- Internal recommendation validation runs an instrumented navigation agent that searches and reads files under fixed budgets.
- Real-agent recommendation validation, when explicitly enabled, runs Codex CLI in temporary checkouts and stores metadata-only traces.
- It does not install dependencies.
- It does not run package scripts.
- Public scoring and deterministic calibration do not run tests from the target repository; explicit real-agent validation may capture test commands in temporary checkouts.
- It does not import target Python modules.
- It does not execute target binaries.
- It does not call external LLM APIs with repository source.
- It deletes temporary cloned repositories by default.
- Validation artifacts store metadata, metrics, changed paths, and summarized tool/file-read traces, not source snapshots.
- ANI supports public GitHub repositories and local fixture/debug scans. Private repo auth is not implemented.
- JS/TS and Python have the best static symbol extraction today. Other languages reduce confidence and fall back to file/package-level signals.
- Probe quality depends on the repository exposing enough self-supervised signals. Sparse repositories may receive directional scores.
- The validation harness is included, but ANI is not globally benchmark-validated until representative paired validation passes the declared significance gate.
- GitHub stars and some hosted metadata are represented as nullable baseline fields, not fetched by default.
- The model is deterministic and transparent, but still heuristic until calibrated against substantial outcomes.
- No web UI, SSO, RBAC, audit logs, retention controls, remediation PRs, or hosted scan history are included in v0.
- Architecture
- Scoring Model
- Recommendation Contracts
- Validation Methodology
- Benchmark Validation Pilot
- Recommendation Validation Pilot
- Enterprise Readiness
- Runtime Configuration
- Self Assessment
- Larger benchmark runs across public corpora.
- Weight calibration from held-out benchmark outcomes.
- Richer language support and dependency graph coverage.
- GitHub metadata enrichment with rate-limit-aware caching.
- Hosted report viewer and comparison workflow.
- Private repo support with explicit auth, retention, audit, and enterprise controls.
- Remediation planning and optional PR workflows after validation.
MIT. See LICENSE.