Skip to content

feat(audit): AI-SDLC measurement engine — weighted-category scoring + deterministic engine + 3-tab report (full re-architecture)#139

Open
AlexanderMakarov wants to merge 238 commits into
mainfrom
feat/ai-sdlc-metrics
Open

feat(audit): AI-SDLC measurement engine — weighted-category scoring + deterministic engine + 3-tab report (full re-architecture)#139
AlexanderMakarov wants to merge 238 commits into
mainfrom
feat/ai-sdlc-metrics

Conversation

@AlexanderMakarov

@AlexanderMakarov AlexanderMakarov commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Re-architects /awos:ai-readiness-audit from a fixed-ceiling A–F/0–100 grade into an additive, weighted, file-defined capability model with a deterministic measurement engine, and adds a single-page drill-down HTML report. (Supersedes the original Phase 0+A scope of this PR — now the full re-architecture.)

What changed

  • Weighted-category scoring (Phase A): references/standards.toml defines every capability category (code/weight/definition/applies_when/source). Score = Σ awarded weights (uncapped) + a coverage % "relative to today's standard". No grade, no 0–100.
  • Determinism — the headline fix. Every category declares a method: computed/detected are evaluated by a deterministic TypeScript engine (the auditor uses the verdict verbatim); only genuinely-semantic judgment checks use an LLM against a fixed rubric. This eliminates the ~40-point run-to-run variance the old LLM-judged audit produced.
  • TypeScript engine bundled by esbuild to a single committed dist/cli.js (verbs: collect/detect/metric/standards/render/rollup/audit-core/aggregate/enrich/progress) — runs with plain node, no install at audit time. Layers: 4 collectors (git/ci/tracker/docs) · per-dimension detectors · ADP metrics (DORA-style: lead time, deploy freq, change-fail, MTTR git-proxy, AI-attribution, work-mix, …) · complexity/scale via bundled web-tree-sitter (multi-language).
  • JSON is the source of truth. audit-core writes per-dimension JSON + audit.json in one pass; report.md + the self-contained single-page report.html (hover hint on every number, hash-routed drill-down per dimension) are rendered deterministically from JSON — nothing dropped.
  • Org mode (multi-repo discover → portfolio metrics + rollup), monthly history (value_series), progress/ETA (interactive + headless), headless-first (auto-generates HTML).
  • Plugin + marketplace bumped to 2.3.0.

Metric correctness & taxonomy restructure

Min/max fixture testing surfaced three classes of metric defects (evidence dossier: docs/audit-metric-issues/). All fixed in four phases, with the guiding acceptance criterion that every scored metric must reach 0 on a worst-case project and its max on a best-case one:

  1. Explicit bugs — vitest/supertest no longer classify unit suites as e2e (QA-05); test files excluded from the ARCH-05 naming check; the audit no longer scans its own context/audits/ output (self-pollution inflated every run by +12 pts); orchestrator score patches are clamped to [0,1] and reconciled with status; DOC-06 carries its own evidence line; duplicate SBP-06 check id split; object values render as k=v, never [object Object].
  2. SKIP-on-absence — ARCH-02, SDD-03/05/06, SEC-05, SBP-08 emit SKIP (excluded from coverage) when their precondition is absent, so an empty repo no longer reads as compliant.
  3. Taxonomy restructure — new Delivery Flow dimension (the DORA family: DF-01..07, out of the AI-SDLC Adoption grab-bag) and a new unscored Descriptors dimension (contributors/churn/complexity/scale/deps, INFO badge, weight 0 — size describes a repo, it doesn't grade it). The broad Security dimension dissolved into Application Security (AS-12..14) and AI Security (renamed from Prompt & Agent Integrity, AIS-01..07); End-to-End Delivery dissolved into SBP/Documentation/Code Architecture (SBP-09/10, DOC-07, ARCH-07). Dimension order is data ([meta].dimension_order): industry-standard engineering first, AI-frontier last, descriptors at the end; each dimension's description renders as a hover tooltip on the report summary. Weight spread across scored dimensions went from 16–139 to 27–86.
  4. Squash-merge awareness — squash/rebase-merged PRs (GitHub/Azure DevOps/Bitbucket/GitLab formats) now count as merge events attributed to the PR author, with the workflow detected as merge-commit/squash/mixed; merge-record proxies (lead time, PR cycle, review rework) SKIP with a connector-pointing reason on squash repos instead of reporting confident wrong numbers. On the real evidence repo this turned "19 merges, all one maintainer" into 114 merge events across the 4 actual PR authors.

Validation

Full gate green: engine 1002 / lint 83 / installer 42 / fixtures 5, prettier + build clean. Run end-to-end on a real repo (onex-discovery-api): produces the full JSON artifact set + report.html; SBP-08 caught 2 real except A,B: syntax bugs; squash-merge detection validated against the same repo's Azure DevOps history.

Notes

  • Committed dist/ is ~24 MB (broad web-tree-sitter grammar set, by request) — trimmable to a core set later.
  • Engine is validated on Node and SKILL.md preflights/invokes node as the supported runtime; the bundled dist/cli.js also smoke-runs under Bun (incl. the tree-sitter wasm path), but the dev toolchain (node:test/tsx/esbuild) is Node-only. ${CLAUDE_SKILL_DIR} resolves the bundled CLI at audit time.
  • Follow-up: refresh the awos-qa min/max fixtures for the new check ids (DF-, DESC-, AIS-*, AS-12..14, SBP-08..10, DOC-07, ARCH-07) to prove the 0→max criterion end-to-end.

See https://provectus.slack.com/archives/C09GCR80NC8/p1782392880789469 for more details.

Aleksandr Makarov added 16 commits June 23, 2026 13:22
Self-contained, zero-context implementation spec capturing the full
approved design (decision log, standards.toml schema, collector/metric
contracts, phased task list) plus the CEO/CTO exec-deliverable mock.
Supersedes the pre-pivot draft under docs/superpowers/ (git-ignored).
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

AWOS audit metadata is bumped to 2.5.0, and the audit engine is expanded with deterministic collectors, registries, detectors, metrics, rollup logic, and HTML/Markdown rendering. Documentation, standards, and tests are updated to match the new weighted-category scoring and provenance flow.

Changes

AWOS audit engine refresh

Layer / File(s) Summary
Release framing, scoring model, and build wiring
.claude-plugin/marketplace.json, plugins/awos/.claude-plugin/plugin.json, package.json, .prettierignore, .gitattributes, .gitignore, .github/workflows/quality-check.yml, docs/design/*, plugins/awos/README.md, CLAUDE.md, plugins/awos/skills/ai-readiness-audit/{SKILL.md,scoring.md,output-format.md,report-template.md}, docs/design/2026-06-25-audit-hardening-design.md, docs/design/2026-06-26-audit-hardening-plan.md, docs/design/2026-06-26-audit-fairness-and-report-v2-design.md, docs/design/2026-06-26-audit-fairness-and-report-v2-plan.md, docs/design/2026-06-26-report-honesty-and-provenance-design.md, docs/design/2026-06-26-report-honesty-and-provenance-plan.md
version is updated to 2.5.0, build and workflow scripts are adjusted, and the design and product docs are rewritten around the deterministic engine, weighted scoring, and provenance-oriented reporting.
Engine foundations and shared registries
plugins/awos/skills/ai-readiness-audit/{generated.ts,generated.test.ts,frameworks.ts,frameworks.test.ts,ci_platforms.ts,ci_platforms.test.ts,languages.ts,languages.test.ts,progress.ts,agent_tools.ts,agent_tools.test.ts}, plugins/awos/skills/ai-readiness-audit/metrics/_ast.ts, plugins/awos/skills/ai-readiness-audit/detectors/_base.ts, plugins/awos/skills/ai-readiness-audit/collectors/README.md, plugins/awos/skills/ai-readiness-audit/tests/collector-base.test.ts
Adds shared generated-path, framework-auth, CI, language, progress, agent-tool, and Tree-sitter helpers plus the base detector and collector contracts used across the engine.
Collectors, CLI wiring, and audit orchestration
plugins/awos/skills/ai-readiness-audit/{collectors/_base.ts,collectors/ci.ts,collectors/git.ts,collectors/docs.ts,collectors/tracker.ts,cli.ts,audit_core.ts}, plugins/awos/skills/ai-readiness-audit/tests/{ci-collector.test.ts,cli.test.ts}, plugins/awos/skills/ai-readiness-audit/ci_platforms.test.ts
Adds collector implementations, CLI subcommands, audit-core aggregation, and command/collector smoke coverage.
AI tooling, architecture, documentation, and delivery detectors
plugins/awos/skills/ai-readiness-audit/detectors/{ai_development_tooling.ts,code_architecture.ts,documentation.ts,end_to_end_delivery.ts}, plugins/awos/skills/ai-readiness-audit/detectors/{ai_development_tooling_ai04.test.ts,code_architecture_arch06.test.ts,documentation_doc04.test.ts,det-end-to-end-delivery.test.ts}, plugins/awos/skills/ai-readiness-audit/dimensions/{ai-development-tooling.md,code-architecture.md,documentation.md,end-to-end-delivery.md}
Adds the AI tooling, architecture, documentation, and delivery detectors, their tests, and the matching category metadata in the dimension specs.
Prompt integrity, repository security, and application security detectors
plugins/awos/skills/ai-readiness-audit/detectors/{prompt_agent_integrity.ts,security.ts,application_security.ts}, plugins/awos/skills/ai-readiness-audit/detectors/{det-prompt-agent-integrity.test.ts,prompt_agent_integrity_local.test.ts,security_sec05.test.ts,application_security_as03.test.ts,application_security_as06.test.ts,det-application-security.test.ts}, plugins/awos/skills/ai-readiness-audit/dimensions/{prompt-agent-integrity.md,security.md,application-security.md}
Adds the prompt integrity, repository security, and application security detectors with matching dimension metadata and coverage.
QA, best-practices, SDD, and supply-chain detectors
plugins/awos/skills/ai-readiness-audit/detectors/{quality_assurance.ts,software_best_practices.ts,spec_driven_development.ts,supply_chain_security.ts}, plugins/awos/skills/ai-readiness-audit/detectors/{det-code-architecture.test.ts,det-documentation.test.ts,det-prompt-agent-integrity.test.ts,spec_driven_development_sdd05.test.ts}, plugins/awos/skills/ai-readiness-audit/dimensions/{quality-assurance.md,software-best-practices.md,spec-driven-development.md,supply-chain-security.md}
Adds the QA, software-practice, spec-driven, and supply-chain detectors and updates the related dimension category metadata and tests.
Metrics, rollup, and rendering
plugins/awos/skills/ai-readiness-audit/{metrics/*,render.ts}, plugins/awos/skills/ai-readiness-audit/tests/{adp_g13_doc_coverage.test.ts}, plugins/awos/skills/ai-readiness-audit/tests/cli.test.ts
Adds metric helpers, ADP metrics, org rollup, the doc-comment coverage metric test, and deterministic Markdown/HTML rendering.
Validation and engine tests
tests/lint-prompts.test.js, .claude-plugin/marketplace.json, plugins/awos/skills/ai-readiness-audit/tests/*.test.ts
Exercises the prompt contracts, version alignment, and detector/collector/metric wiring across the new audit dimensions.

Sequence Diagram(s)

sequenceDiagram
  participant cli_ts
  participant audit_core_ts
  participant collectors_ts
  participant metrics_ts
  participant render_ts
  cli_ts->>audit_core_ts: auditCore(repoPath, outDir, DETECTORS, METRICS, standardsPath)
  audit_core_ts->>collectors_ts: write collected/*.json
  audit_core_ts->>metrics_ts: compute per-dimension JSON
  audit_core_ts->>cli_ts: return AuditCoreSummary
  cli_ts->>render_ts: renderMarkdown(audit.json) or renderHtml(audit.json)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

  • provectus/awos#115: Extends the same ai-readiness-audit dimension metadata and category tagging that this PR continues to refine.
  • provectus/awos#116: Also changes the AWOS audit orchestration and prompt contracts that this PR updates.

Suggested labels

enhancement

Suggested reviewers

  • kmakarychev-dev
  • workshur

Poem

I thumped through versions, bright and new,
🐰 With weighted scores and render glue.
I tucked the metrics in my burrow,
Then hopped through tests from dawn till morrow.
The audit garden blooms in code.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the PR’s main theme: weighted scoring, deterministic engine work, and report redesign.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ai-sdlc-metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
tests/lint-prompts.test.js (1)

1238-1241: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Parse only check headings when building dimension blocks.

Line 1239 splits on every ### heading, but the contract is about ### CODE-NN check blocks. This is brittle if non-check subheadings are added later.

Suggested fix
-    // Split into check blocks by the "### CODE-NN:" headings.
-    const blocks = body.split(/^### /m).slice(1);
+    // Split into check blocks by the "### CODE-NN:" headings.
+    const blocks = body.split(/^###\s+[A-Z]+-\d+:/m).slice(1);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/lint-prompts.test.js` around lines 1238 - 1241, The block parsing in
the lint prompt test is too broad because it splits on every “###” heading
instead of only the check-block headings. Update the logic around body.split and
the subsequent block-processing loop in the test to match only “### CODE-NN”
headings, so extra non-check subheadings are ignored. Keep the rest of the
dimension-block extraction flow the same, but ensure the heading filter is
specific to the contract enforced by this test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/ai-sdlc-exec-deliverable.md`:
- Line 16: The Markdown ASCII diagram fences in the document are unlabeled,
which breaks lint/tooling consistency; update the fenced blocks in the affected
sections to use a language tag such as text. Make this change for each ASCII
block by editing the relevant fenced blocks in the document so the existing
content stays the same but the fences are explicitly labeled, matching the style
used by other Markdown examples.

In `@docs/design/ai-sdlc-measurement-and-scoring-plan.md`:
- Line 40: The implementation contract in the metrics section uses a
machine-local absolute path, which should be removed for portability. Update the
reference in the metrics/ description to point generically to the existing
complexity scanner using a repo-relative or tool-agnostic identifier instead of
/Users/aleksandrmakarov/code/scripts/complexity-scan.py, while keeping the rest
of the metrics contract unchanged.

In
`@plugins/awos/skills/ai-readiness-audit/references/ai-sdlc-metrics-catalog.md`:
- Line 31: The ADP-G7 revert pattern is too narrow because the `^Revert"` match
misses standard revert subjects like `Revert "..."`, which can undercount
failures in the metrics guidance. Update the revert/rollback pattern in the
ADP-G7 entry of the metrics catalog so it matches the common spaced form used by
revert commits, while keeping the existing hotfix and rollback terms intact.

In `@plugins/awos/skills/ai-readiness-audit/references/data-sources.md`:
- Around line 98-101: The global SKIP rule in data-sources.md conflicts with
metric-specific requirements such as MTTR’s need for a real incident source.
Update the wording around the partial source and SKIP rule sections to state
that the generic fallback applies only unless a metric defines stricter
required-source contracts, and explicitly note that metric-specific rules like
MTTR override the default behavior. Keep the guidance aligned with the existing
metrics/ layer language so readers can tell when a metric should still skip
despite partial source availability.

In `@plugins/awos/skills/ai-readiness-audit/scoring.md`:
- Around line 20-22: Add a language tag to each unlabeled fenced code block in
the scoring markdown so it passes MD040. Update the fenced examples around the
dimension_score, coverage_ratio, and audit_total formulas to use a consistent
annotation such as text, keeping the existing content unchanged. Use the
markdown sections containing those formulas as the target for the fix.

In `@tests/lint-prompts.test.js`:
- Line 1118: The lint prompt test is using an overly broad regex that can match
unrelated words like “degraded” or “upgrade.” Update the assertion in the test
around assert.doesNotMatch in tests/lint-prompts.test.js to use a bounded
“grade” pattern that only matches the standalone word, keeping the intent of
dimension-auditor must not emit a grade while avoiding false failures.
- Around line 1008-1030: The current check in lint-prompts.test.js only verifies
that required keys exist somewhere in standards.toml, so malformed [category.*]
tables can still pass. Update the assertion logic around the existing
[category.*] scan to validate each category table block individually, using the
same symbols/loop in the test, and ensure every table contains all required keys
within its own section rather than relying on file-wide matches.

---

Nitpick comments:
In `@tests/lint-prompts.test.js`:
- Around line 1238-1241: The block parsing in the lint prompt test is too broad
because it splits on every “###” heading instead of only the check-block
headings. Update the logic around body.split and the subsequent block-processing
loop in the test to match only “### CODE-NN” headings, so extra non-check
subheadings are ignored. Keep the rest of the dimension-block extraction flow
the same, but ensure the heading filter is specific to the contract enforced by
this test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 63f87513-b53d-4915-871a-9d0780bf7a55

📥 Commits

Reviewing files that changed from the base of the PR and between 7df798b and a453584.

📒 Files selected for processing (25)
  • .claude-plugin/marketplace.json
  • docs/design/ai-sdlc-exec-deliverable.md
  • docs/design/ai-sdlc-measurement-and-scoring-plan.md
  • plugins/awos/.claude-plugin/plugin.json
  • plugins/awos/agents/dimension-auditor.md
  • plugins/awos/skills/ai-readiness-audit/SKILL.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/ai-development-tooling.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/code-architecture.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/documentation.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/end-to-end-delivery.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/project-topology.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/prompt-agent-integrity.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/quality-assurance.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/security.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/software-best-practices.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/spec-driven-development.md
  • plugins/awos/skills/ai-readiness-audit/dimensions/supply-chain-security.md
  • plugins/awos/skills/ai-readiness-audit/output-format.md
  • plugins/awos/skills/ai-readiness-audit/references/ai-sdlc-metrics-catalog.md
  • plugins/awos/skills/ai-readiness-audit/references/data-sources.md
  • plugins/awos/skills/ai-readiness-audit/references/standards.md
  • plugins/awos/skills/ai-readiness-audit/references/standards.toml
  • plugins/awos/skills/ai-readiness-audit/report-template.md
  • plugins/awos/skills/ai-readiness-audit/scoring.md
  • tests/lint-prompts.test.js

Comment thread docs/design/ai-sdlc-exec-deliverable.md Outdated
Comment thread docs/design/ai-sdlc-measurement-and-scoring-plan.md Outdated
Comment thread plugins/awos/skills/ai-readiness-audit/references/ai-sdlc-metrics-catalog.md Outdated
Comment thread plugins/awos/skills/ai-readiness-audit/references/data-sources.md
Comment thread plugins/awos/skills/ai-readiness-audit/scoring.md Outdated
Comment thread tests/lint-prompts.test.js
Comment thread tests/lint-prompts.test.js Outdated
Aleksandr Makarov added 11 commits June 24, 2026 14:32
… bundle + CI job)

- Install devDeps: typescript, tsx, esbuild, @types/node; runtime: smol-toml
- Add tsconfig.json (NodeNext, strict, allowImportingTsExtensions, noEmit)
- Add tests/helpers.ts: loadStandards() via smol-toml, writeCollected() fixture helper
- Add tests/smoke.test.ts: asserts meta.monthly_bucket_days===30, max_lookback_days===730
- Add scripts/build-engine.mjs: esbuild driver bundling collectors/detectors/metrics entrypoints
- Create collectors/, detectors/, metrics/, dist/ scaffold dirs with .gitkeep
- Add test:engine and build:engine scripts; fold test:engine into npm test
- Add node-engine CI job to quality-check.yml
- Add TS scaffold presence lint check to tests/lint-prompts.test.js (56 tests, all pass)
…mplify build-engine

- Add `npm ci` step to the `test` job in quality-check.yml so tsx is available
  when `npm test` chains into `test:engine` (fixes MODULE_NOT_FOUND on CI).
- Remove `allowScripts` from package.json (Bun-only convention; npm ignores it).
- Replace dynamic `import('node:fs')` in build-engine.mjs with a direct
  `writeFileSync` call, adding it to the existing static import.
…categories + lint + schema test

Classifies every [category.*] table in standards.toml with a method
field: computed (numeric result), detected (deterministic boolean from
regex/glob/AST/config), or judgment (semantic sampling required).
The 7 judgment categories additionally carry rubric and evidence_required.

Both the JS regex lint (tests/lint-prompts.test.js) and a new TypeScript
engine schema test (standards-schema.test.ts) guard the vocabulary and
the judgment-requires-rubric contract. standards.md documents the Method
section. Prettier clean.
Adds collectors/git.ts (always-available Tier-G collector) that shells to
git via execFileSync to gather default_branch, monthly_buckets,
merge_records, revert_merges, total_merges, ai_marked_commits,
total_commits, tooling_paths, and numstat_totals. Hermetic node:test
suite builds a throwaway git repo with pinned GIT_AUTHOR_DATE /
GIT_COMMITTER_DATE for fully deterministic assertions. Also adds
plugins/awos/skills/ai-readiness-audit/dist/ to .prettierignore so
generated bundles are not flagged by prettier.
… (drop Date.now), remove dead code

- buildMonthlyBuckets: window end is now the latest committer date from git
  (git log --all --format=%cI --max-count=1); since = latestCommitDate − lookback_days.
  Date.now() is gone — buckets are a pure function of git history + period params.
- Removed dead rangeOut call in getMergeRecords (fired nonsensical sha^2..sha^2 range,
  result was discarded immediately).
- Removed unused countLines helper.
…ause syntax fix

- Add detectors/software_best_practices.ts with three detectors:
  detectExceptClauseDefect (2706): FAILs on Python-2 `except A, B:` syntax
  detectErrorHandling (2704): heuristic over catch/except blocks — FAIL/WARN/PASS
  detectLockfiles (2705): PASS if any recognised lockfile present
- Add DETECTORS map: { 2704, 2705, 2706 } → detect functions
- Add [category.sbp_except_clause_syntax] (code 2706, method=detected) to standards.toml
- Update SBP-06 in software-best-practices.md: Category 2704, 2706
- 14 hermetic unit tests; all gates green (39 engine, 56 lint-prompts, build:engine, prettier)
…llect/detect/metric)

Replace per-module multi-entrypoint bundling with a single cli.ts that esbuild
inlines into dist/cli.js. The dispatcher handles `collect <source> <repoPath>`,
`detect <code> <repoPath>`, and `metric <id>` (stubbed, exits non-zero until
metric modules land). build-engine.mjs cleans dist/ before building so all
stale flat + nested files are removed. Adds hermetic smoke tests in cli.test.ts.
Aleksandr Makarov added 30 commits July 2, 2026 10:22
A check whose precondition is absent used to award a vacuous PASS,
letting an empty or low-maturity repo read as compliant (dossier
02-metric-range-and-interference.md §A3). Absence now emits SKIP,
which excludes the check from the coverage denominator:

- ARCH-02: no source files, or no files under recognised layer dirs
- SDD-03: no architecture document to match against
- SDD-05/SDD-06: no spec directories
- SEC-05: no sensitive file types relevant to the stack
- SBP-08: no Python source (belt-and-braces with topology.has_python)

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Restructures the dimension taxonomy so each dimension is a coherent
capability area and the report reads industry-standard engineering
first, AI-frontier last (scores change; audit is unreleased):

- New Delivery Flow dimension: the DORA family moves out of the
  ai-sdlc-adoption grab-bag — DF-01 deploy frequency, DF-02 lead time,
  DF-03 PR cycle time, DF-04 change-failure rate, DF-05 review rework,
  DF-06 rework rate, DF-07 MTTR.
- New Descriptors dimension (unscored, rendered last): DESC-01
  contributors, DESC-02 churn, DESC-03 complexity, DESC-04 scale,
  DESC-05 dependency counts. All weight 0 with a neutral INFO badge —
  size/activity describe a repo, they don't grade it — and the headline
  Merges/LOC throughput echo moves onto this page. Fixes the
  cannot-reach-zero saturators (dossier §A2/C2).
- Security dimension dissolved: SEC-01/03/05 → Application Security
  (AS-12 env gitignored, AS-13 env template, AS-14 sensitive-file ignore
  coverage); SEC-02 → AI Security (AIS-07 agent guardrail hooks);
  SEC-04 dropped — it duplicated AS-05 (no hardcoded secrets).
- End-to-End Delivery dimension dissolved: E2E-01/04 → Software Best
  Practices (SBP-09 vertical delivery, SBP-10 no orphaned artifacts),
  E2E-03 → Documentation (DOC-07 spec traceability), E2E-05 → Code
  Architecture (ARCH-07 cross-layer tooling). Mis-seeded source fields
  corrected (SBP-09 is a git signal, not a DORA citation).
- prompt-agent-integrity renamed to ai-security (PAI-NN → AIS-NN).
- Dimension order is data: standards.toml [meta].dimension_order;
  audit-core stamps order/title/description onto every artifact,
  aggregate preserves it, and the renderer shows each dimension's
  description as a hover tooltip on the summary row and dim page.
- standards.toml physically regrouped by the new order.

Weight spread across scored dimensions is now 27-86 (was 16-139).

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Squash/rebase-merging a PR leaves no merge commit, so every metric built
on `git log --merges` silently read 0 on healthy repos, and per-author
merge counts credited only whoever clicked the merge button (dossier
03-squash-merge-blind-spot.md).

- The git collector now counts squash-merged PRs as merge events:
  first-parent non-merge trunk commits carrying a forge PR ref — GitHub
  "Title (#123)", Azure DevOps "Merged PR 123: …", Bitbucket
  "(pull request #12)", GitLab "See merge request …!45" in the body —
  attributed to the commit author, which for a squash merge IS the PR
  author. window_stats gains merge_commits / squash_merges /
  merge_strategy (merge-commit | squash | mixed | unknown).
- Deploy frequency, change-failure rate, and rework rate now measure
  squash repos correctly (their revert/fix keyword filters cover the
  squashed subjects too).
- Lead time, PR cycle time, and review rework are merge-record proxies
  that cannot exist without real merge commits: on a squash-strategy
  repo they SKIP with a connector-pointing reason instead of reporting
  a confident number from unrepresentative residue. The MTTR git proxy
  stays included (per its contract) but degrades confidence with an
  explanatory note.
- activeContributors: merge-share now includes squash events, restoring
  the safety valve on squash repos; when a repo has no merge events at
  all, commit-share replaces merge-share so the rule can't degenerate
  to LOC-share-only (the '1 active of 9' collapse).

Validated on the dossier's evidence repo: 19 merges credited to one
maintainer → 114 merge events across the 4 real PR authors,
strategy=squash.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Report fixes from the barley validation run:

- Headline VALUES now carry tooltips: the underlying check's evidence
  (how the number was derived) with the check id + status as meta; an
  absent value ('—') explains itself with the check's skip reason
  instead of standing bare.
- Metric-routed SKIPs surface the metric's own reliability note (e.g.
  'squash-merge workflow: no branch merge records…') instead of the
  generic 'required data was not available'.
- Active Contributors tooltip states the real share-based rule (active
  unless BOTH merge-share and LOC-share fall below the 5% threshold),
  not an invented ≥2-commits heuristic.
- Change-failure definition describes what is actually computed (window
  keyword proxy), dropping the unresolved 'within N days' placeholder;
  dependency-count definition drops its SKIP boilerplate.
- Spec coverage tooltip is plain language (no 'check SDD-04' jargon).
- SDD-04 denominator is now MERGED feature work (first-parent merge
  commits + squash-merged PRs, 90d window, first-parent diffs) — on
  repos whose CI deletes branches after merge, live refs undercounted
  badly (barley: 11 surviving branches vs ~280 merged PRs). Live-branch
  evaluation remains only as a fallback for merge-less workflows.

Orchestration cost (profiled: Step 6 hand-patching was ~35 of 47 serial
model turns):

- New engine verb 'patch-judgment <outDir> <patches.json|->' applies
  ALL judgment verdicts in one call — validates ids, refuses
  non-judgment checks, clamps scores, derives weight_awarded, and
  re-aggregates itself. SKILL.md and the repo-auditor agent now mandate
  it (no hand-edited dimension JSONs, no separate aggregate).
- SKILL.md hardens the engine-call budget (one enrich, one
  patch-judgment, one render --format both) and forbids interleaving
  shell processing between Jira pages.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
'sonnet' resolves to the best available Sonnet at run time, so the
harness doesn't silently pin audits to an outdated version id.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…eup polling

A measured run spent most of its 19 minutes in ScheduleWakeup wait
loops polling background Jira/Confluence fetch agents instead of just
making the calls. SKILL.md and the repo-auditor agent now state that
connector fetches are inline parallel tool calls, the judgment
subagent is a single foreground Agent call, and ScheduleWakeup /
background tasks / fallback polling are never used in this skill.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
The old definition — fraction of repos with ≥1 reachable data-source
collector — always read 100% because git is always reachable, so the
portfolio card carried no information. It is now the mean of the
per-repo coverage ratios (awarded ÷ applicable weight): how much of the
current standard the reachable sources could actually score. On the
sample-org run this reads 69% instead of a vacuous 100%.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…esult segments

- SKILL.md org mode: the wait-for-subagents barrier is satisfied by the
  foreground Agent calls returning — never background tasks polled via
  ScheduleWakeup or filesystem checks. A measured org run doubled its
  wall time AND cost that way (9 resume segments, each reloading full
  context uncached).
- Harness: a session split into resume segments emits one stream-json
  result event per segment, and reading only the last one reported a
  94-turn / 18m47s hops run as '9 turns / 78s'. stream_run now
  aggregates across ALL result events (turns/durations summed, cost and
  usage from the cumulative maximum, is_error OR-ed), measures true
  start-to-finish wall_ms itself, and records result_segments so a
  split session is visible in run-meta.json.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Mindset shift: Coverage (share of the current standard in place) is what
a reader should judge a project by — Points is the raw material.

- The report headline is now '##.#% Coverage' with Points secondary, in
  both HTML and Markdown; the tooltip cites the standard's provenance:
  'Average software project score among all applicable metrics by
  industry standards on <date>', where <date> is the max last_verified
  across standards.toml categories, stamped into audit.json as
  standards_meta by audit-core.
- Dimensions and Repositories tables swap the Coverage/Points columns
  (coverage leads); per-check Points cells show the ratio too:
  '1.3/8 (16.3%)'; weight-0 descriptor rows show '—'.
- Active Contributors tooltip interpolates the real threshold from
  standards.toml ({threshold} placeholder resolved from standards_meta)
  instead of hardcoding '5% by default' prose.
- Org report parity: per-repo reports under per-repo/ get an automatic
  '← Back to org report' link (detected from the render out-dir); the
  org rollup passes merged source_windows + standards_meta through so
  the org header shows the measurement window; Connections & Sources
  uses the same Connected/Missed template as a per-repo report with
  (n/N) repo counts ('CI runs (3/8)'), and Tech Stack lists org items
  the same way; Repositories column headers all carry tooltips.
- SKILL.md: org audit.json carries source_windows/standards_meta
  through, and 'project' is the portfolio name only — never the inlined
  repo list (the Repositories table already lists them).

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…ime fix

Org portfolio cards — one weighting, clear names, coverage first:
- All three cards now use the SAME weighting: by active contributors
  per repo (a 40-person repo moves the portfolio average more than a
  2-person one), falling back to equal weights when counts are missing;
  every card's description states which was used.
- Reordered and renamed: 1) 'Standards coverage' (was 'Measurement
  coverage') — the contributor-weighted mean of the per-repo coverage
  headlines, so the org card and the per-repo headline are explicitly
  the same concept; 2) 'Capability score' (now contributor-weighted);
  3) 'Repos with AI tooling' with the plain 'X of Y repositories' count
  in its tooltip. The per-repo headline is renamed to
  '##.#% Standards coverage' too (table columns keep the short
  'Coverage').

Tooltip quality — written for a non-technical reader:
- Org headline matrix labels ('Merges / active contributor',
  'LOC / active contributor') had no dictionary entry and echoed the
  label as its own tooltip; they and the Repositories column headers
  now carry 2-3 plain sentences each.
- Every metric-label tooltip ends with 'Standard last verified <date>'
  — per-check dates from the category's last_verified (now carried on
  CheckRecord), the overall standards date elsewhere.
- Every Coverage tooltip cites 'industry standards on <date>',
  including the org card and both report headlines.

Misleading skip reasons (QA-session finding): buildSkipReason let the
'connect a <source>' template win over applicability because the
pseudo-source 'audit' counted as connectable — a TypeScript library
showed ~16 checks demanding 'connect a audit source' when the truth was
'this repo has no web app'. Applicability now wins, only real connector
sources (tracker/docs/ci/incident) prompt for a connector, and the
article agrees ('an incident source').

Org-mode cycle time (investigated in a separate session): the
repo-auditor agent's 'tools:' frontmatter restricted it to file tools,
so per-repo audits structurally could not fetch Jira/Confluence via MCP
— every org repo showed 'no tracker connector provided' even when the
orchestrator's own probe reached Jira. The restriction is removed (the
agent inherits the full toolset), the connector shape gains an optional
in_progress_at (Jira: expand=changelog) so the cycle-time headline can
actually compute median in-progress→done, and SKILL.md explains the
computation. MTTR behavior confirmed correct (incident connector only).

SKILL authoring hygiene: org reach 'ai_tooling' summarises with counts
only (never repo-name enumeration); recommendations must cite the check
they actually remediate; every authored ratio renders as a percentage
with one decimal, never a raw float.

Harness: claude -p sessions see the operator's user-scope MCP servers
even with --setting-sources project, so a test audit could pull live
Jira data. The harness now passes --strict-mcp-config by default
(--allow-user-mcp opts out).

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…s --target

The harness previously died on a non-git --target, so org-mode audits
could only be launched by hand — bypassing the blank-slate phase prep
and engine-compliance guard (which is how a same-day rerun silently
re-presented a previous engine's artifacts). Org folders are now
first-class: children with .git are enumerated, and provenance records
the repo count instead of a target commit.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…confidently

A broken measurement must read as "couldn't measure", never as a measured
verdict. audit_core: a metric that throws or an unknown metric= id now routes
its categories to SKIP with a metric-error note and confidence 0 (was: silent
FAIL at confidence 1 with empty evidence); aggregate() treats any finite score
as explicit so a patched score of 0 is no longer re-inflated to the status
default; patchJudgments rejects statuses outside PASS/WARN/FAIL/SKIP;
unparseable dimension JSON and unreadable prior audit.json are logged instead
of silently dropped; corrupted collector artifacts report "unreadable", not
"not found"; coverage is null (not 0) when no weight is applicable; CI platform
naming defers to ciPlatformName (single source of truth); reliability
confidence unified on HIGH/MED/LOW.

git collector: collect() probes `git rev-parse --git-dir` first and emits
available:false with the real error on broken environments (was: available:true
with all-zero stats scored at full confidence); run() classifies expected-empty
exit 1 (silent) vs fatal 128/ENOENT/EACCES/ENOBUFS (logged + artifact flipped
unavailable); commit-less repos no longer spam breadcrumbs. AI-attribution
grep runs under --extended-regexp so the (Windsurf|Cascade) alternation
matches. ci collector: connector-only path no longer claims a CI config was
detected. languages: evidence labels derive from the extensions actually
matched. New Bitbucket "(pull request #N)" squash fixture.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Empirically verified false signals, each now pinned by a regression test:
.env pattern no longer matches process.env/os.environ (AIS-07 false PASS);
gitignore check accepts .env* / .env*.local (Next.js/CRA defaults); dead
\b-before-@ regexes fixed so @app.route/@RestController/express()/FastAPI()
register (DOC-03 wrong SKIP, validation/rate-limit detectors); catch-opener
matches K&R `} catch (`; PEP 508 env markers no longer read as the literal
"$1" (unpinned); RAW_SQL requires SQL continuations so UI copy like "Delete
item" stops failing ARCH-04; mutation routes require a router-ish receiver
(cache.delete/axios.post excluded); RSpec/Jasmine spec/ dirs no longer earn
SDD-04 spec-driven credit; prose "go"/"node" no longer count as tech mentions
(SDD-03); import-graph layering WARNs on 1 violation and FAILs on 2+ as
documented; single-token filenames count as compatible with every lowercase
convention; single-service repos SKIP the per-service-README check; detached-
HEAD pseudo-branch entries filtered; schema/namespace hosts exempt from the
TLS origin count; md5/sha1 flag requires password context, not any "hash";
pandas/numpy alone no longer classify a repo as an ML project.

iterFiles: 512 MB maxBuffer (ENOBUFS on big monorepos silently flipped
topology flags off), path-containing globs now work via find -path, and the
promised JS-walk fallback exists. makeResult clamps score/confidence to [0,1].
Message-less assert.ok calls in the PAI/QA test files now name their contracts.
Unused RANGED_RX deleted.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…e/MTTR

Three metrics hardcoded score 1.0 and awarded full weight regardless of the
measured value: review rework (DF-05) now bands avg commits/PR, issue
throughput (ADP-18) scores 0 at zero resolved and bands resolved/week,
pipeline duration (ADP-16) SKIPs on null and bands the duration. CI pass rate
guards empty runs (0/0 NaN poisoned audit_total). MTTR keeps the git proxy's
minimal reliability even when an incident source exists — the value never came
from incident data. Sub-task split averages over all parent-eligible tickets
so the best-case anchor is reachable and one epic can't dominate. Doc coverage
loses its 0.8/0.6 award cliff (continuous score modulates the weight). Tooling
depth path matching is boundary-safe (.awos-legacy no longer matches .awos).

Shared readArtifact() in metrics/_base.ts: every metric now degrades to SKIP
with the parse error in its note when a collector artifact is corrupt, instead
of throwing into a confident FAIL. AST init failures carry the real error into
the SKIP note. categories_awarded typed number[].

Org report: rollup carries each repo's cycle_time and mttr display values into
the portfolio per_repo rows (the Repositories table rendered these columns
100% blank because the fields were never populated). PR cycle time prefers
tracker tickets with changelog-derived in_progress_at -> resolved_at intervals
and falls back to the git branch-lifetime proxy — which also rescues
squash-merge repos; connector-shapes.md documents the field.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…isibility

Markdown delivery table no longer prints the literal "undefined" for a
valueless ungated row; the evidence tooltip drops its double HTML-escape
(tip() already escapes); a new mdCell() helper escapes pipes/newlines in every
untrusted Markdown table cell so an LLM-authored title can't corrupt the
table; the org Repositories table renders the newly-carried cycle_time/mttr
values; Merges/LOC tooltips say "per week" to match the displayed rate; the
header schema comment documents every top-level field the orchestrator must
preserve and points at the design doc that exists.

PENDING_JUDGMENT is counted and surfaced explicitly in both report formats
(amber chip in HTML, suffix in MD) instead of masquerading as SKIP — an
unpatched headless run is now visibly unfinished. Unknown statuses warn on
stderr. cli.ts: local AuditJson duplicate renamed to a parse-boundary
ParsedAudit; usage strings list all 11 verbs.

New adversarial escaping suite (hostile strings through evidence, insights,
recommendations, repo names — pins single-escaping and intact MD tables) and
patch-judgment/render CLI dispatch tests (stdin "-", invalid JSON, bad
--format, missing --out-dir).

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Nothing typechecked the engine (tsx and esbuild both strip types), and the
JSON-artifact shapes were declared independently in audit_core, render, cli,
and org_rollup — with live drift: render's status union lacked INFO and
PENDING_JUDGMENT, cli redeclared AuditJson, render read undeclared fields
through index-signature casts. `npm run typecheck` (tsc --noEmit) now runs in
CI next to the dist gate, and a new artifact_types.ts is the single source for
CheckStatus, reliability vocabularies, Check/DimensionArtifact/AuditJson/
PerRepoSummary/OrgConnections (type-only imports, zero bundle cost).

The typecheck immediately caught a real bug: `cli.js collect tracker <repo>`
passed the git tunables object as the connector payload, fabricating an
available:true tracker artifact with zero tickets; the collector registry is
now typed so only git receives its options.

CI dist gate stages the directory first (git add -A + diff --cached) so an
untracked build output — a newly bundled grammar — can no longer slip past
the sync check. dist/ rebuilt from the fixed sources.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
plugins/awos/README.md described the removed architecture — per-dimension
context windows, a PASS/WARN/FAIL deduction table, A-F grades, an
8-dimension list; rewritten from the engine model (13 dimensions, additive
weighted scoring, JSON artifacts, both reports always rendered). CLAUDE.md
corrected: patch-judgment added to the verb list, dist gate described as
currently non-blocking, injection described truthfully (present but does not
fire in plugin skills — Step 5 is the mechanism), four test layers, no more
"3-tab"/per-dimension-phase/history annotations.

ai-sdlc-adoption.md check ids now match what the engine emits (ADP-01..06,
ADP-14..25 — the doc used a private ADP-G*/C*/I* scheme a report reader could
never find). SKILL.md: contiguous step numbering, "no per-dimension fan-out"
qualified (org repo-auditors and the judgment pass are sanctioned), one bold
rule kept, org headline figures copied from the rollup so the header can't
disagree with the portfolio cards. Judgment arrays are written inside the
audit output dir — repo-auditor.md previously had every concurrent org
auditor share /tmp/judgments.json, letting one repo's verdicts patch another.
Dissolved end-to-end-delivery/security references replaced; detectors/ and
metrics/ READMEs match the real registry and result shapes; ORG-02/ORG-03
standards definitions match the contributor-weighted rollup; harness README
documents the per-repo/<repo>/ layout, org-folder --target, --model and
--allow-user-mcp.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
A claude run that exited non-zero or errored printed "run complete" and
exited 0 while the fallback render polished its partial artifacts; it now
prints an INCOMPLETE banner, records partial:true and judgments_patched in
run-meta, and exits 1. Engine compliance counts actual audit-core tool_use
events (plus the executed-injection marker) instead of substring hits the
prompt text itself satisfies. Marketplace repoint failures die with the
captured stderr and restore failures warn loudly; restore writes each config
file's own original value instead of km_source into both. Org repo count
globs per-repo/*/audit.json (was per-repo/*.json — always 0).

standards-linkcheck classifies a deep link that redirects to the site root as
DEAD (the most common way doc links die previously passed). build-engine
exits 1 when the core wasm or any bundled grammar is missing instead of
warning and shipping a broken bundle. lint-prompts assertions that could not
fail now pin real content: the cli render invocation, the no-individual-
attribution privacy sentence, additive weighted points, org-portfolio.json.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
The pr-review/pr-comments-address skills write their drafts and plans to
review/; ignore the folder so a local review trail can't ride into a PR.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
… honest gated notes, org-parent guard

Four org-report defects from the provectus-barhopping runs, root-caused against
the live Atlassian MCP:

Gated rows lied about the cause: "— (needs ticketing connector)" rendered even
when Jira was connected but tickets lacked status-transition history.
DeliveryMetric gains an optional note rendered as "— (<note>)"; SKILL.md
instructs authoring the real state (history not fetched / zero resolved in
window / partial fetch) and forbids the needs-connector default when a tracker
was reachable. Cycle time and MTTR now carry reader-grade tooltips on the
headline rows and the org Repositories column headers, explaining exactly what
data the metric needs and why it can be gated with a tracker connected. The g5
squash-repo SKIP reason names the changelog gap precisely.

The Jira MCP provides everything the metrics need — the prompts just never
asked for it: searchJiraIssuesUsingJql caps at 100/page (the "exactly 100
tickets" and the run-to-run count drift) and paginates via nextPageToken;
changelogs come only from per-issue getJiraIssue(expand: "changelog").
connector-shapes.md now carries the concrete recipe: paginate to completion
with stable ordering + computeIssueCount, then a parallel changelog pass over
the ~50 most recently resolved tickets, mapping in_progress_at to the first
transition into an In-Progress-CATEGORY status (verified: tickets go
Backlog -> To Do -> In Review -> Done without a literal "In Progress").
Tracker artifacts must record a fetch_meta completeness block; every
tracker-consuming metric surfaces "partial tracker fetch: X of Y tickets" in
its reliability note. repo-auditor.md states both requirements inline so org
runs stop shipping single-page fetches.

judgments_patched=false root cause: the pre-scope audit-core pass audited the
non-git org PARENT folder, leaving an unpatchable stray audit at the output
root. audit-core now detects an org parent (not a work tree, >=2 immediate git
children), prints why, and exits 0 without writing artifacts.

Org headline reach.contributors is pinned to the single-repo shape
"<active> active (of <total> in window, 90d)" (sums across repos) instead of
the improvised "39 active across 8 repos".

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
What the skill does and the engine/orchestrator split, standards.toml as the
scoring model in data, headless testing via the audit-test-harness, and the
maintenance paths (standards-refresh skill, adding dimensions/metrics, prompt
edits, versioning).

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
Root scripts/ is in the installer copy table (scripts/ -> .awos/scripts/,
overwritten on update), so build-engine.mjs and the standards-linkcheck pair
were being copied into every user's project despite being maintainer-only
tooling — users run the prebuilt dist/ and never lint standards.toml. Moved
all three to tools/ai-readiness-audit/ (dev-only, never copied), fixed their
internal repo-root/default-path resolution for the new depth, and wired the
linkcheck tests into npm test as test:audit-tools (previously they matched no
test glob and only ran by hand). create-spec-directory.sh stays — it is
genuine framework product.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
run_audit_test.py and compare_audit_runs.py become run_audit_test.ts /
compare_audit_runs.ts (plus harness_lib.ts for the pure helpers), run via the
repo's existing tsx toolchain — same language as the engine, shared shapes,
one less runtime prerequisite. All CLI flags, the marketplace
repoint/finally-restore, phase seeding/stashing, the engine-compliance guard
with retry/salvage, partial-run detection, token accounting across result
segments, and the archive/run-meta layout are ported unchanged so old and new
runs stay comparable. One deliberate fix over a literal port: die() throws
instead of process.exit, which in Node would skip the finally block and leave
the marketplace pointed at a worktree.

New: a live progress log ([MmSSs]-stamped Bash/Agent tool events, the skill's
progress emissions, a 60s heartbeat; --quiet to silence), and a final summary
block with wall time (NmSSs), token counts, cost in dollars, turns,
compliance/judgments verdicts, and the absolute archived report.html path(s)
(org + per-repo). run-meta.json gains report_html. Invoke via npm run
audit:test / audit:compare; pure helpers covered by tools/audit-test-harness/
harness.test.ts wired in as test:harness.

Claude-Session: https://claude.ai/code/session_014UpLfgFPaACkrEUuk7CGiF
…anaged artifacts

The headless orchestrator repeatedly skipped the deterministic audit-core
pass and hand-computed the audit (barley 2026-07-03: 3 attempts, ~45 min,
~3x cost), citing SKILL.md's dead load-time pre-run narrative to justify it.
Fix the prompt at the root and make the skip impossible to complete:

- SKILL.md: delete the never-executing load-time !`...` injection and every
  "run audit-core only if audit.json is missing" conditional; Step 5 is now
  the unconditional first scoring action (pre-existing audit.json is stale
  output to overwrite), with the prohibition stated at the decision point.
  Frontmatter disallows Edit/NotebookEdit/ScheduleWakeup while the skill is
  active (verified enforced for plugin skills on Claude Code 2.1.199).
- Engine provenance: audit-core stamps audit.json and every dimension JSON
  with engine.generated_by; patch-judgment, patch-report, and render refuse
  an unstamped single-repo audit, and rollup skips unstamped per-repo audits
  — a hand-assembled audit cannot become a report.
- Artifacts are engine-managed end to end: new report-context verb emits the
  flattened authoring context (check values/hints, git window stats, tracker
  fetch meta) and new patch-report verb merges the authored headline/
  insights/recommendations into audit.json and emits recommendations.md from
  the same array; the audit-core/enrich summary lists pending_judgment_checks.
  The orchestrator authors only judgments.json and report-blocks.json — no
  inline-script inspection or direct artifact edits remain in the flow.
- Tests: provenance regression suite, patch-report/report-context CLI tests,
  lint contracts pinning the no-pre-run/stale/provenance/disallowed-tools
  wording; dist/ rebuilt.

Verified: compliance smoke 3/3 PASS (audit-core called, provenance intact,
judgments patched, reports rendered, zero hand-compute/hand-writes/fan-out).

Claude-Session: https://claude.ai/code/session_019HGet8jGBoU9gT5wrYZ2Fp
…readiness-audit/qa

- Move the QA harness from tools/audit-test-harness/ to
  tools/ai-readiness-audit/qa/ so one folder holds both the engine build
  tooling and the QA tooling for the one audit command.
- New compliance_smoke.ts (npm run audit:smoke): N headless claude -p runs
  against a tiny generated fixture repo, verdict per run from hard signals
  only — audit-core invoked, provenance stamp present, judgments patched,
  reports rendered (not hand-written), no python/node inline compute, no
  scoring-JSON hand writes (judgments.json/report-blocks.json exempt), no
  per-dimension fan-out, no stall-on-question. Fail-fast by default: stop at
  the first failing run instead of paying for the same failure again
  (--keep-going for deliberate rate measurement).
- Full harness: engine-skip retries now append a corrective system prompt
  (a bare relaunch re-confabulates the same skip from leftover artifacts);
  the spoofable injected_audit_core compliance signal is removed — only an
  actual audit-core Bash invocation counts.
- Marketplace repoint helpers extracted to harness_lib.ts (shared by the
  harness and the smoke tool); unit tests for the new transcript signals.

Claude-Session: https://claude.ai/code/session_019HGet8jGBoU9gT5wrYZ2Fp
…urce-probe transparency

Root cause of the missing Cycle time: harness isolation (--strict-mcp-config)
stripped ALL MCP servers, and a separate 2026-07-02 run fetched 994 Jira
tickets without changelogs yet still rendered the default "needs ticketing
connector" next to "Connected: Jira via Atlassian MCP". Three fixes, one
principle — the audit assesses the project, not the auditor's environment:

- Project-scope MCP discovery (harness): the target's own declared servers
  (.mcp.json / mcp.json / .vscode/mcp.json / .cursor/mcp.json, org folder +
  every repo subdir, VS Code {servers} shape normalized, collisions suffixed)
  are merged and passed back explicitly via --mcp-config, which
  --strict-mcp-config honors. User-scope servers stay excluded by design.
- Engine-derived gated rows: audit-core/enrich/aggregate compute
  audit.derived_delivery (cycle-time median from tracker tickets'
  in_progress_at→resolved_at, plus the honest gated note like "Jira connected
  — per-ticket status history not fetched") and the renderer appends those
  rows, ignoring authored ones — the headline can never contradict the
  Connections & Sources section again.
- Source-probe transparency: a new source_probes report block (patch-report)
  records what was searched per unreachable source (mcp configs, CLIs, auth
  state) and renders into "Missed / limited", replacing the bare "supply a
  connector" hint.
- CLI channels (skill): acli/gh/glab are sanctioned measurement channels —
  gh run list fills collected/ci.json (barley now scores real pipeline
  metrics), code-host issues can serve as a minimal tracker, and the acli
  Jira recipe derives the project key from commit-message ticket prefixes;
  recipes in connector-shapes.md → "CLI channels".
- Harness robustness: marketplace repoint snapshots/writes the whole source
  object (a github-shaped entry mutated only via .path produced a
  "corrupted installLocation" rejection); smoke exempts sanctioned
  collected/*.json connector writes.

Verified on barley: CI connected via gh (200 runs), tracker/docs missed with
full probe trails in the report, derived_delivery consistent with sources,
engine-compliant run end to end.

Claude-Session: https://claude.ai/code/session_019HGet8jGBoU9gT5wrYZ2Fp
…harness

Net -835 lines with zero behavior change (1288 tests green). Applies the
findings of the four-angle quality review (reuse, simplification,
efficiency, altitude):

Shared infrastructure
- metrics/_base: skipMetric() replaces ~45 copy-pasted SKIP stanzas;
  makeMetricResult tail folded into an options object; squash-merge
  circuit-breaker note/reliability shared; readArtifact memoized
  (mtime+size) so git/tracker JSON parses once per pass, not 11x/6x;
  evaluateAppliesWhen() is the single applies_when interpreter.
- metrics/_score: shared median/mean/round1; clamp01 reused everywhere.
- detectors/_base: per-(repo, ignore-set) cached file listing replaces
  ~200-300 find spawns per pass; readTextSafe() replaces ~50 hand-rolled
  try/readFileSync blocks; hasMatch() gives boolean greps an early exit;
  DetectorResult.status typed as the PASS/WARN/FAIL/SKIP union.

audit_core / cli
- sources/source_windows/topology derive from one parsed-artifact map
  shared by auditCore and aggregate (the two copies had already drifted);
  patch verbs (aggregate, patch-judgment, patch-report, report-context)
  split into audit_patch.ts; dimensionFiles() iterator shared.
- Registries move to detectors/index.ts and metrics/index.ts; cli.ts is
  dispatch-only (fail()/readJsonArg()/standardsTomlPath() helpers, merged
  audit-core/enrich cases, static fs imports).
- metric verb derives its collectors from standards.toml sources instead
  of hardcoded id prefixes; org rollup reader moves to
  metrics/rollup_input.ts with the delivery check-id table derived from
  DELIVERY_SPECS and AI-tooling codes derived from standards.

check_id single source of truth
- Every standards.toml category now carries check_id (62 injected from
  the dimension .md headings); runtime md parsing (parseCheckIds) is
  deleted; Layer-1 lint enforces presence and md<->toml agreement.

Perf
- enrich reuses repo-derived checks (detectors, AST metrics) from the
  per-dimension artifacts and re-scores only connector-affected
  categories: an enrich pass drops from a full re-audit to ~50ms on a
  small repo.
- git collector: latestCommitDate computed once per collect (was 3
  spawns); one all-history squash-merge scan folded in memory for both
  the unbounded and windowed consumers (was 2 full log passes).
- topology: duplicate flag expressions computed once; boolean code
  greps early-exit.

render
- renderHtml's ~940-line closure split into top-level section functions;
  md/html micro-duplications share helpers; canonical COLLECTOR_SOURCES
  lives in artifact_types.ts.

Tests / QA harness
- Shared fixture factories in tests/helpers.ts (gitRaw, trackerArtifact,
  makeCheck/makeDim/makeAudit, makeCheckRecord, tmpDir, gitAs); the
  ~59 repeated git-raw literals collapse; the two real-repo auditCore
  shape tests share one run.
- QA harness: single stream-json transcript walker, shared claude-spawn
  wrapper, one-spawn salvage render (--format both), performRun()
  replaces 8 mutable outer variables, dead exports dropped.

CLAUDE.md updated: adding a dimension is four touch points (md, toml,
detector module, registry index), and enrich's reuse semantics are
documented.

Claude-Session: https://claude.ai/code/session_01Ld2EFkQ3DuoFGfvXXqLKZF
…r, short gated cycle-time note

Findings from the 2026-07-03 provectus-barhopping org QA run:

- Org report rendered a 96-row Dimensions table (12 dims x 8 repos). Root
  cause was SKILL.md itself: the org-assembly step instructed the model to
  include "dimensions (aggregated dimension data from all per-repo audits)"
  in org-portfolio.json, and the renderer rendered whatever it was given.
  SKILL.md now forbids the key, AuditJson.dimensions is optional (absent on
  org portfolio JSON), and both renderers ignore top-level dimensions in
  org mode - an injected concatenation can no longer become a report table.

- Same run audited repos 2-3x: three repos got two repo-auditor subagents
  each, and the orchestrator additionally ran audit-core/enrich for five
  repos in its own context. SKILL.md org branch now states each repo is
  audited exactly once by exactly one subagent, the orchestrator never runs
  the engine itself in org mode, and re-dispatch is allowed only for a repo
  whose audit.json is missing after its subagent returned.

- Per-repo HTML back-link now targets ../../report.html#repos (the org
  Repositories heading carries id="repos"), returning the reader to the
  table they navigated from instead of the org report top.

- A gated tracker headline row with a connected-but-unmeasurable tracker
  now shows the short "- (no tickets data)" placeholder; the full
  explanation moved to the value tooltip (HTML) and the Connections &
  Sources tracker line (both formats).

Tests: org-mode no-Dimensions contract (injected dims ignored; renders
without the key), #repos anchor + per-repo back-link, short-placeholder +
full-note placement, hostile-escaping fixture split into single-repo and
org variants. dist/ rebuilt.

Claude-Session: https://claude.ai/code/session_01Ld2EFkQ3DuoFGfvXXqLKZF
…pecheck

The Missed/limited probe-log fixture omitted history_available_days, which
SourceSummary requires; local runs passed because tsx does not typecheck,
but CI's tsc --noEmit gate does.

Claude-Session: https://claude.ai/code/session_01Ld2EFkQ3DuoFGfvXXqLKZF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant