Skip to content

feat(provider): expose --kv-bits / --max-context / --max-batch serve knobs for autoresearch#105

Merged
Augustas11 merged 1 commit into
mainfrom
feat/provider-serving-knobs
Jun 18, 2026
Merged

feat(provider): expose --kv-bits / --max-context / --max-batch serve knobs for autoresearch#105
Augustas11 merged 1 commit into
mainfrom
feat/provider-serving-knobs

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

Adds three new serve flags so beta/autotune.py (PR #103) can
hill-climb a real per-machine recipe space, not just --model:

Flag Default mlx-swift wire site
--kv-bits {4,8} unset (no quantization) GenerateParameters.kvBits — both complete() and stream()
--max-context <N> per-tier (8GB:20000, 16GB:50000, 32GB:120000, 64GB+:200000) GenerateParameters.maxKVSize + existing validatePromptTokenCount 413 gate
--max-batch <N> 1 (preserves prior single-slot behavior) AsyncSemaphore(value: maxBatch) in ModelRuntime

All three are triple-exposed (CLI > env > YAML > default), matching
the house convention used by every other flag in ServeCommand. A
new runServingKnobsPreflight fails loud at serve start on invalid
values (--kv-bits 7, --max-context 0, etc.) instead of leaking to
mid-inference.

Why

The reference autoresearch repo's "different recipe per machine"
insight depends on tuning config per machine, not just the model.
Today's autotune loop can only flip --model; this PR widens its
search to KV-cache precision, served context window, and concurrency.

Bug fix folded in

ServeCommand.run() was hardcoding maxConcurrencyOverride: 1 when
building ProviderCapacity, silently ignoring the resolved config.
Capacity now reflects the actual --max-batch value.

Flags shipped vs dropped

All three target knobs shipped — verified against the checked-out
mlx-swift-examples 2.29.1 in .build/checkouts/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:54-105,
which exposes GenerateParameters.kvBits, maxKVSize, and accepts
the existing semaphore-gated batch model. Nothing was dropped.

mlx-swift wiring approach

  • --kv-bits: passed as GenerateParameters.kvBits at both the
    non-streaming (ModelRuntime.complete()) and streaming
    (ModelRuntime.stream()) generation sites. nil keeps the mlx-swift
    default of unquantized KV cache. Group size and quantizedKVStart
    use mlx-swift defaults (64 / 0) — exposed as knobs would widen the
    search prematurely without operator evidence.
  • --max-context: continues to enforce the existing 413
    context_length_exceeded gate (ModelRuntime.validatePromptTokenCount).
    Additionally passes maxKVSize=maxContextTokens to
    GenerateParameters so the RotatingKVCache bounds the allocation —
    this is what gives KV-cache memory pressure relief on small-RAM
    nodes.
  • --max-batch: changes inferenceGate = AsyncSemaphore(value: 1) to
    AsyncSemaphore(value: max(1, maxBatch)). The existing
    acquire/withPermit pattern is unchanged; the actor already handles
    multi-permit safely.

Test additions

Tests/macprovider-cliTests/ServingKnobsConfigTests.swift adds 21
test cases:

  • Defaults preserved when all knobs omitted (regression).
  • CLI > env > YAML precedence for each of the 3 flags.
  • YAML-only resolution for each (the most common ops path).
  • Preflight rejects --kv-bits 5, accepts 4 and 8 and nil.
  • Preflight rejects --max-context 0 and --max-batch 0.
  • Runtime threading: each knob reaches ModelRuntime via test
    accessors (kvBitsOverrideForTest, maxBatchForTest,
    maxContextTokensForTest).
  • --max-context boundary: oversize prompts still throw the
    documented 413 context_length_exceeded APIError; at-boundary
    prompts pass.

Full suite: 219 → 240 tests, all green.

Out of scope

Per-request --kv-group-size / --quantized-kv-start knobs are
intentionally not exposed; they sit below the autotune's noise
floor today and would widen the search without operator evidence.
Adding them later is mechanical (same call site, same plumbing
pattern).

Test plan

  • swift build -c release --product macprovider-cli clean
  • swift test — 240 / 240 passing
  • macprovider-cli help serve lists --kv-bits, --max-context, --max-batch
  • MACPROVIDER_KV_BITS=7 macprovider-cli serve ... rejects with documented error
  • No coordinator/gateway/Pearl-side changes
  • No live deploy — build + tests only on the 8GB Air

Related

🤖 Generated with Claude Code

…knobs for autoresearch

Widens the autoresearch search space for beta/autotune.py (PR #103)
beyond just --model. All three knobs are real downstream wiring into
mlx-swift 2.29.1, not just CLI cosmetics.

- --kv-bits {4,8}: forwarded to MLXLMCommon.GenerateParameters.kvBits
  (both complete + stream call sites); preflight rejects anything but 4/8.
- --max-context <N>: extends the existing per-tier maxContextTokens
  cap; tokens are still rejected at the existing context_length_exceeded
  413 boundary, and we additionally pass maxKVSize=maxContextTokens to
  GenerateParameters so the KV cache (RotatingKVCache) honors the cap.
- --max-batch <N> (default 1, prior single-slot behavior preserved):
  lifts the previously-hardcoded AsyncSemaphore(value: 1) inside
  ModelRuntime to be configurable. Reuses the existing
  maxConcurrencyOverride config field that was already plumbed from
  YAML/env but never wired to the CLI or runtime.

All knobs are triple-exposed (CLI > env > YAML > default), matching the
house convention. Preflight (runServingKnobsPreflight) fails loud at
serve start instead of mid-inference on invalid values.

A bug fix is folded in: ServeCommand.run() was hardcoding
maxConcurrencyOverride: 1 when building ProviderCapacity, silently
ignoring the resolved config. The capacity now reflects --max-batch.

Tests: ServingKnobsConfigTests.swift adds 21 cases covering
config-resolution precedence (CLI > env > YAML), defaults preserved,
preflight rejection of invalid values, runtime threading of all three
knobs, and a regression on the context_length_exceeded gate. Total
suite: 219 -> 240, all passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11 Augustas11 marked this pull request as ready for review June 18, 2026 06:51
@Augustas11 Augustas11 merged commit 46d2f1e into main Jun 18, 2026
5 checks passed
Augustas11 added a commit that referenced this pull request Jun 18, 2026
…ch axes

PR #105 exposed --kv-bits / --max-context / --max-batch as macprovider-cli
serve flags. Wire them into autotune.py as three OPTIONAL search axes
(--kv-bits-options / --max-context-options / --max-batch-options). Each
defaults to [None], so omitting all three preserves the original
model x context candidate space exactly (the original 6-trial 8GB Air run
still produces an identical 6-trial plan). When set, candidates are the
full cartesian, with the chosen knobs passed through to start_provider
and recorded in three additively-migrated tune_trials columns (kv_bits,
max_context_cap, max_batch). Legacy rows keep NULL in the new columns;
existing reports remain readable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11 added a commit that referenced this pull request Jun 18, 2026
* spec(cli): SPEC-013 v0.1 — macprovider-cli autotune subcommand

Initial draft of the autotune subcommand spec + the round-1 codex
audit prompt. NOT for merge — this commit lives on the feature
branch only and the PR is held until the codex audit loop converges.

SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context,
--max-batch) in a two-stage pipeline that encodes the "biggest-fit,
not max-tps" product strategy. Stage 1 iterates a curated
largest-first candidate list and STOPS on the first model that
passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the
chosen model. This is the load-bearing departure from the PR #103
Python prototype (whose cartesian max-tps loop would push every
capable Mac to serve the smallest model).

Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates,
stage2_replicates, kv-bits axis-vs-default) are flagged as Open
Questions pending the in-flight air5 n=3 replication run; v0.2
either confirms placeholders or sends a narrow PR adjusting them.

Files:
- specs/SPEC-013-cli-autotune.md (new, v0.1 draft)
- specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt)
- specs/README.md (+1 row in the index table)

Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings
in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.2 — round-1 codex audit response

Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL
/ 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready
to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs,
and both QUESTIONs. The product framing (biggest-fit, not max-tps)
and two-stage architecture are unchanged — round 1 explicitly
preserved both.

MAJORs closed:
- A.1 fallback contradiction: replaced metrics-bearing `fallbacks`
  with NAME-ONLY `alternates` (the STOP-on-first-feasible rule
  meant smaller candidates were never probed; v0.1's fallback
  metrics were structurally impossible).
- D.1 `models pull` precondition was bigger than admitted: FR-D
  reframed as "weights cache-warm before probe; load-fetch
  latency excluded from gate-ttft-ms" with Shape A (explicit
  pull) vs Shape B (rely on runtime online-fallback + isolate
  measurement) implementation choice. No longer depends on a
  not-yet-existing subcommand.
- E.1 launchd label wrong: `com.macprovider.cli` →
  `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5,
  install.sh, plist template). Drain sequence bound to
  `launchctl bootout/bootstrap gui/$UID/...`.
- F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` /
  `max_batch` were the CLI flag names; actual YAML keys per
  Config.swift:239-241 are `max_context_override` /
  `max_concurrency_override`. JSON `knobs` object now uses YAML
  key names for round-trip into config.yaml; `serve_command`
  retains CLI flag names for shell paste.
- F.2 recipe_hash not deterministic: pinned to
  `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization +
  explicit hash input domain enumeration (machine + inputs +
  recommendation.model + recommendation.knobs; excludes run_id,
  timestamps, observed metrics).
- G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD
  COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new
  inserts MUST set stage=1 or stage=2 explicitly.
- J.1 no AC for operator-supplied order: added AC-17 with
  `--candidate-models 1B,32B` on a Mac where both fit — must
  pick 1B because operator order is the contract.

MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI
summary kv-bits default), F.3 (backup naming collision-safe),
G.2 (transactional retention), H.1 (--resume removed from §7),
J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum),
K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration
note), M.1 (cross-spec renumber to SPEC-014).

QUESTIONs resolved (2): D.2 (signature vs network failure now
asymmetric — integrity aborts whole run, transient advances),
K.2 (added OQ-E flagging thermal/order bias with quantitative
threshold).

Deferred to post-lock: M.2 documentation checklist (decision-log
entry, SPEC-003 install note, PR #103 disposition) — captured as
a §11 checklist but not in the binding contract.

Files:
- specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines)
- specs/SPEC-013-audit.md (NEW, codex round-1 output)
- specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt)

Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the
round-2 closure check, address any new findings, repeat until
LOCK READY, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.3 — round-2 audit response (LOCK candidate)

Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned
LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the
round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex
recommended a narrow v0.3 closing the 4 new findings before
implementation. v0.3 closes all 4. No architecture change.

Round-2 closures:
- N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D
  rewrite permitted Shape B (rely on runtime online-fallback +
  measurement isolation) but NFR-4's egress exception and AC-8
  still spoke only of `models pull`. v0.3 reworords NFR-4 to
  admit both Shape A and Shape B HuggingFace pre-warm paths;
  AC-8 is now shape-neutral with explicit Shape A (mocked pull
  exit non-zero) and Shape B (block egress + runtime fallback
  fails during load) variants. A new sub-variant explicitly
  tests the FR-D.2 integrity-class abort path.
- Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2
  put the parse rules in non-normative §7. v0.3 lifts them into
  FR-B.1 as a normative paragraph (absolute caps, sorted
  ascending after parse, each cell >= --target-context,
  flag-parse-time rejection with exit_reason='config_error',
  duplicate rejection, empty-axis = single-cell). The §7 /
  §5 conflict-resolution rule is now stated explicitly.
- N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling
  protocol: v0.3 adds a 10-paired-runs forward/reverse protocol
  with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger
  threshold. Operators can close OQ-E without relitigating
  methodology.
- O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four
  discrete sites — `tune_runs.spec_version` SQL comment,
  FR-H.2 "v0.1 normative contract" prose, NFR-3 stale
  `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2"
  disclaimer.

Files:
- specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate)
- specs/SPEC-013-audit.md (codex round-2 output landed)
- specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3
  closure-confirmation audit prompt)

Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the
round-3 LOCK-confirmation check. Expected outcome: LOCK with
0 new findings or ≤1 MINOR. If LOCK, push the branch and open
the DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.3 LOCK — round-3 codex confirmation + O-V03.1 fold-in

Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned
LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus
0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new.

The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON
example still showed "SPEC-013 v0.2" inside a v0.3 document.
Codex explicitly said this does not block LOCK (the adjacent SQL
comment already stated writers emit their own producing version),
but recommended folding the fix in before implementation. This
commit folds it: the JSON example now uses
"SPEC-013 v<producing-version>" as a placeholder, and the
spec_version bullet teaches the rule.

Round-3 closures (from specs/SPEC-013-audit.md § Round 3):
- N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or
  equivalent) and Shape B (runtime online fallback during model
  load) HuggingFace pre-warm paths; carve-out scoped to autotune
  runs and weight fetches; AC-8 shape-neutral with explicit
  Shape A + Shape B + integrity-class variants.
- Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from
  non-normative §7 into binding FR-B.1 (absolute caps, sorted
  ascending, ≥ target-context, flag-parse-time rejection with
  exit_reason='config_error', duplicate rejection, empty-axis
  = single-cell); §7 vs §5 conflict-resolution rule explicit.
- N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable
  10-paired-runs forward/reverse sampling protocol on air5 with
  60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger.
- O.1 CLOSED: all 4 named drift sites updated (tune_runs
  SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7
  disclaimer).

Specs index updated: specs/README.md row for SPEC-013 now reads
v0.3.

Files:
- specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in)
- specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed)
- specs/README.md (SPEC-013 row → v0.3)

Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR
+ 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed
from round 2) → LOCK. Next step: push branch + open DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 BUILD prompt — Option A Swift-native impl plan

Adds the operator-paste BUILD prompt that a fresh Codex CLI
session uses to implement SPEC-013 v0.3 against the existing
phase3-binary/ Swift package.

The BUILD prompt picks Option A (Swift-native subcommand inside
macprovider-cli) per SPEC-013 §10, rationale captured inline:
single-binary install consistency with SPEC-003, drain semantics
match existing patterns (UninstallCommand / SelfUpdate), future
SPEC-011 warm-swap integration needs Swift-native. Shape A vs
Shape B for FR-D pre-warm is left as the implementer's call —
the binding contract is FR-D.1's measurement-isolation
requirement, not the mechanism.

The 11-step build sequence:
1. AutotuneCommand subcommand scaffolding + --dry-run
2. --no-join flag on ServeCommand (FR-E.2 precondition)
3. SQLite schema + DB layer (tune_trials + tune_runs +
   migration + transactional retention)
4. Provider lifecycle (start/stop/wait-ready, single-provider
   invariant)
5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap)
6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient
   classification per FR-D.2)
7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible,
   AC-17 biggest-fit guard)
8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from
   prototype)
9. Recommendation surface (FR-F: terminal block + JSON schema +
   RFC 8785 JCS recipe_hash + --apply atomic write)
10. Failure modes + signal handling (FR-H, exit_reason enum)
11. Acceptance test suite (AC-1 through AC-19)

Branch strategy: build work happens on feat/cli-autotune-impl
stacked off spec/cli-autotune-v1, so the implementing PR rebases
cleanly onto main after the SPEC PR (#108) merges.

Hard rules: do not modify any file under specs/ from the build
branch (it's downstream of the SPEC PR); do not touch
beta/coordinator/gateway; do not pivot to Option B without an
Open Question; preserve the biggest-fit objective (AC-17 catches
this if forgotten).

The prompt is self-contained: severity definitions, required
reading, step sequence, acceptance gate, hard rules, anti-rules,
operator checkpoint cadence, and an Open Question template.

Files:
- specs/BUILD_SPEC_013_PROMPT.md (NEW)

Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1
and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected
wall-clock: 1-2 weeks of session work for one new subcommand
inside an existing binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11 added a commit that referenced this pull request Jun 18, 2026
…BUILD steps, 21 audit rounds) (#109)

* feat(autotune) Step 1: subcommand scaffolding

Add the Swift-native autotune parser surface so subsequent SPEC-013 steps can build on a verified CLI entrypoint.

Constraint: SPEC-013 v0.3 is locked; build work is limited to phase3-binary and Option A Swift-native.

Rejected: Implementing provider lifecycle or SQLite in this checkpoint | Step 1 only requires parser registration and dry-run proof.

Confidence: high

Scope-risk: narrow

Directive: Keep Stage 1 model selection biggest-fit and operator-order preserving; do not convert it into cross-model max-tps scoring.

Tested: swift build --package-path phase3-binary; swift test --package-path phase3-binary --filter AutotuneCommandTests; swift test --package-path phase3-binary; macprovider-cli autotune --help; macprovider-cli autotune --dry-run; git diff --check

Not-tested: Real autotune execution, SQLite persistence, provider lifecycle, serve --no-join, and apply/restart behavior are intentionally deferred to later steps.

* audit(autotune): Step 1 implementation audit prompt

Adversarial code review prompt for the facbaef Step 1 commit
(AutotuneCommand parser scaffolding + --dry-run). Read-only review;
codex will write findings to specs/SPEC-013-impl-audit.md.

Per [[feedback-build-audit-loop]]: BUILD work follows the same
audit-loop discipline as SPEC work. This is round 1 of the Step 1
audit cycle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 1 round-1 audit findings (A.4 MAJOR + B.2/B.5/C.2 MINORs)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md)
returned 0 CRITICAL / 1 MAJOR / 3 MINOR / 0 QUESTION on the
facbaef Step 1 commit, verdict FIX REQUIRED. This commit closes
all 4 findings.

A.4 (MAJOR) fix: `parseCSV` used `split(separator: ",")` + filter-
empty, which silently dropped empty cells in
`--max-context-axis 4000,,8000` instead of throwing — violating
the SPEC-013 v0.3 round-2 Z-B.1 closure that requires invalid
cells to fail at flag-parse time. Added `parseCSVStrict(_:flag:)`
that preserves empty subsequences and throws ValidationError with
the offending flag name. The empty-default case
(`--max-context-axis ""` → `[--target-context]`) is preserved
by checking the raw string for emptiness before strict parse.

B.5 (MINOR) fix: same parseCSV-drops-empty bug class affected
`--candidate-models`; an operator typo `one,,two` parsed as 2
candidates instead of throwing. The explicit-candidates path now
uses parseCSVStrict too. parseKvBitsAxis and parsePositiveIntAxis
also moved to the strict helper for consistency.

B.2 (MINOR) fix: removed the redundant
`try validateBasicInputs()` call from `run()`. ArgumentParser
already invokes the `validate()` protocol method at parse time.
Added `_ = try candidatePlan()` to `validate()` so explicit-list
parse errors surface at flag-parse time, not deferred.

C.2 (MINOR) fix: added 8 new tests to AutotuneCommandTests
pinning the Step 1 parser contracts that should have been frozen
in the Step 1 commit:
- testMaxContextAxisRejectsEmptyCell (A.4 regression lock)
- testMaxContextAxisSortsAscending (FR-B.1 sort rule)
- testMaxContextAxisRejectsDuplicates (FR-B.1 dedup rule)
- testMaxContextAxisEmptyDefaultMapsToTargetContext
- testCandidateModelsRejectsEmptyCell (B.5 regression lock)
- testExplicitCandidateOrderPreservesSmallFirst (AC-17 prep;
  fails under any future parser-level re-rank)
- testDryRunLinesContainCandidatePlanInOrder (dry-run output
  contract; uses the new testable dryRunLines helper)
- testRestartForegroundFlagParses

Refactor: extracted `dryRunLines(_:)` from `printDryRun(_:)` so
tests can assert on candidate order without capturing stdout.
Used by tests and by future JSON/report serializers in Steps 9+.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 253 tests, with 0 failures (up from 245 baseline)
  swift build --package-path phase3-binary
  Build complete (12.34s)

Smoke checks (the failures codex flagged):
  $ macprovider-cli autotune --max-context-axis 4000,,8000 --dry-run
  Error: --max-context-axis contains an empty cell; check for stray commas
  exit 64
  $ macprovider-cli autotune --candidate-models one,,two --dry-run
  Error: --candidate-models contains an empty cell; check for stray commas
  exit 64

Files:
- phase3-binary/Sources/macprovider-cli/AutotuneCommand.swift
- phase3-binary/Tests/macprovider-cliTests/AutotuneCommandTests.swift
- phase3-binary/implementation-notes.html
- specs/SPEC-013-impl-audit.md (codex round-1 output landed)

Next step: fire round-2 audit at codex to verify closures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 1 round-2 closure-verification prompt

Round-1 audit (codex on facbaef) returned 1 MAJOR + 3 MINOR,
verdict FIX REQUIRED. Audit-response landed as 02b038d closing
all 4. Round 2 verifies the closures and spot-checks the
fix-pass for new gaps. Output appends to specs/SPEC-013-impl-audit.md.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 1 round-2 LOCK — 4 CLOSED / 0 new findings

Codex round-2 closure audit verified all 4 round-1 fixes landed
clean and the fix-pass (02b038d) introduced no new gaps. Step 1
is locked at commit 02b038d; ready to proceed to Step 2.

Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 2):
- 4 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED
- 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new
- 253 tests passing (245 baseline + 8 added in fix-pass)
- Smoke checks confirm:
  - --max-context-axis 4000,,8000 → exit 64 (was: silently accepted)
  - --candidate-models one,,two → exit 64 (was: silently accepted)
  - --max-context-axis '' → exit 0, maps to [target-context]

Step 1 audit cycle complete after 2 rounds:
- R1 (on facbaef): 1 MAJOR + 3 MINOR, FIX REQUIRED
- Fix-pass (02b038d): closed all 4
- R2 (on 02b038d): 0 findings, READY TO PROCEED TO STEP 2

Next step: fire BUILD_SPEC_013_PROMPT.md Step 2 (--no-join on
ServeCommand) at codex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 2: add serve --no-join

Add a ServeCommand no-join guard so autotune candidate providers can expose local HTTP without creating a coordinator WebSocket client.

Constraint: Step 2 only; no specs, beta, coordinator, or gateway files changed.
Rejected: Mutating resolved coordinator_url | no-join is runtime join suppression and should not rewrite config visibility.
Confidence: high
Scope-risk: narrow
Directive: Keep the default serve path on the coordinator factory branch so absent --no-join behavior remains unchanged.
Tested: swift test --package-path phase3-binary --filter ServeCommandTests; swift test --package-path phase3-binary; swift run --package-path phase3-binary macprovider-cli serve --help | rg -- "--no-join|USAGE|OPTIONS"; git diff --check
Not-tested: Live model-serving smoke with real weights, because package tests cover the coordinator suppression path without loading a model.

* audit(autotune): Step 2 implementation audit prompt

Adversarial code review prompt for the ffb00fb Step 2 commit
(--no-join flag on ServeCommand + testable factory guard +
optional-propagation cascade). Read-only review; codex will append
findings to specs/SPEC-013-impl-audit.md as round 3.

Step 2 anti-regression invariant: with --no-join ABSENT (the
default), every existing serve behavior MUST be byte-identical to
pre-Step-2. The optional-propagation discipline (every former
coordinatorClient.X call now ?.X) is the highest-risk surface;
audit explicitly checks for any surviving direct call.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 2 LOCK — clean pass, 0 findings

Codex round-1 audit on the Step 2 commit (ffb00fb) returned a
clean pass: 0 CRITICAL / 0 MAJOR / 0 MINOR / 0 QUESTION. Step 2
is locked at commit ffb00fb without a fix-pass.

Audit verdict (specs/SPEC-013-impl-audit.md § Round 3):
- 0 findings across all categories (A through E + O)
- 256 tests passing (253 baseline + 3 added by Step 2)
- Optional-propagation grep confirmed clean: 0 direct
  coordinatorClient.X calls survived the diff
- serve --help still exposes PR #105 flags + SPEC-011 flags +
  the new --no-join
- serve --no-join --enable-warm-swap parses (no conflict)

Step 2 audit cycle complete in 1 round (vs Step 1's 2 rounds).
The factory-closure DI pattern is a cleaner shape than
sequential mutation; the tests prove the no-instantiation
guarantee at the type-system level.

Next step: fire BUILD_SPEC_013_PROMPT.md Step 3 (SQLite schema +
DB layer) at codex. Step 3 is the heaviest implementation step
so far (FR-G.1 tune_trials schema + FR-G.2 tune_runs schema +
ALTER TABLE migration + transactional retention sweep).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 3: add SQLite DB layer

Implements the additive AutotuneDB writer surface for SPEC-013 run/trial persistence while keeping AutotuneCommand runtime unwired until later steps.

Constraint: Step 3 only; no specs, beta, phase4-coordinator, or phase5-gateway changes; no new top-level dependency.

Rejected: Adding a third-party SQLite wrapper | system SQLite3 is available via Swift C interop and preserves the single-binary dependency posture.

Confidence: high

Scope-risk: narrow

Directive: Later autotune runtime steps should call insertRun before applyRetentionInTransaction and must set tune_trials.stage explicitly for every trial.

Tested: swift test --filter AutotuneDBTests; swift test (260 tests, 0 failures).

Not-tested: Real autotune runtime writes, because Step 3 intentionally does not wire AutotuneDB into AutotuneCommand.run().

* audit(autotune): Step 3 implementation audit prompt

Adversarial code review prompt for the d0029e9 Step 3 commit
(AutotuneDB SQLite layer + AutotuneExitReason enum +
AutotuneTrialRow/AutotuneRunRow value types + ALTER TABLE
migration + transactional retention sweep + 4 unit tests).
Read-only review; codex will append findings to
specs/SPEC-013-impl-audit.md as round 4.

Step 3 is the heaviest implementation step so far. High-risk
surfaces flagged in the audit prompt:
- Schema column-by-column match against SPEC-013 §5.7 FR-G.1/G.2
- Application-layer enum match against the 9 normative
  exit_reason values
- Transactional retention atomicity (BEGIN IMMEDIATE → DELETE
  trials → DELETE runs → COMMIT; ROLLBACK on error)
- C-interop discipline (statement finalize, handle close on all
  paths including error)
- SQL injection guards (interpolated values must be Swift Int)
- Anti-regression on the 256 baseline (commit claims 260 with 4
  new tests)

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 3 LOCK — clean pass, 0 findings

Codex round-1 audit on the Step 3 commit (d0029e9) returned a
clean pass across all categories: 0 CRITICAL / 0 MAJOR / 0 MINOR
/ 0 QUESTION. Step 3 is locked at commit d0029e9 without a
fix-pass — the second consecutive step to clear audit in one
round (Step 2 was the first).

Audit verdict (specs/SPEC-013-impl-audit.md § Round 4):
- 0 findings across all 6 categories (A through E + O)
- 260 tests passing (256 baseline + 4 added by Step 3)
- All tune_trials + tune_runs schema columns match SPEC-013
  §5.7 FR-G.1/G.2 byte-for-byte
- All 9 normative exit_reason enum values present with correct
  snake_case raw values
- Transactional retention discipline verified: BEGIN IMMEDIATE
  → DELETE trials → DELETE runs → COMMIT, ROLLBACK preserves
  original error
- C-interop resource lifetimes clean: defer finalizes every
  prepared statement, close() runs on every init error path
- No third-party SQLite dependency added (system SQLite3)
- AutotuneTrialRow + AutotuneRunRow value types expose every
  field Steps 7/9/10/11 need

Next step: fire BUILD_SPEC_013_PROMPT.md Step 4 (provider
lifecycle — start/stop/wait-ready) at codex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 4: add provider lifecycle runner

Constraint: SPEC-013 Step 4 only; no specs, beta, phase4-coordinator, or phase5-gateway changes; do not wire runner into AutotuneCommand.run yet.

Rejected: --provider-bin CLI override | Step 4 requires resolving the current macprovider-cli binary without adding a new operator flag.

Confidence: high

Scope-risk: narrow

Directive: Keep CandidateProviderRunner as the lifecycle primitive for Step 7; preserve SIGTERM-only stop unless Step 5 foreground-restart logic owns an explicit escalation path.

Tested: swift test --package-path phase3-binary --filter CandidateProviderRunnerTests; swift test --package-path phase3-binary; git diff --cached --check.

Not-tested: Real macprovider-cli serve lifecycle with a cached small model; integration test is gated behind MACPROVIDER_INTEGRATION_TEST=1 plus provider binary/model env vars.

* audit(autotune): Step 4 implementation audit prompt

Adversarial code review prompt for the 994c7ee Step 4 commit
(CandidateProviderRunner + ReadyStatus enum + 6 unit tests + 1
integration-gated test). Read-only review; codex will append
findings to specs/SPEC-013-impl-audit.md as round 5.

Step 4 is the most concurrency-heavy step so far. High-risk
surfaces flagged in the audit prompt:
- Single-provider invariant under concurrent start/stop races
- Process lifecycle correctness (zombie risk, signal delivery
  timing, isRunning semantics)
- Pipe/FD/log-handle lifetime across normal + error paths
- waitForReady process-exit race detection (checked twice per
  iteration is good)
- No SIGKILL escalation in v1 (Step 4 forbids; Step 5 will
  own launchd-specific paths)
- Async/await cancellation correctness
- Argv assembly matches PR #105 + FR-E.2

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 4 round-1 audit findings (2 MAJOR + 2 MINOR + 1 QUESTION)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
§ Round 5) returned 0 CRITICAL / 2 MAJOR / 2 MINOR / 1 QUESTION
on the 994c7ee Step 4 commit, verdict FIX REQUIRED. This commit
closes all 5 findings.

B.1 (MAJOR) fix: process.run() failure cleanup hung draining
never-launched pipes. The catch path called finishLogging() which
ran readDataToEndOfFile() on stdout/stderr pipes whose write ends
were still open in the parent (no child existed to close them on
exec). Codex reproduced a 2-second hang locally with a missing
binary path. Added RunningProvider.discardLogging(reason:) that
clears readability handlers and closes the log file with a brief
"process spawn failed" note WITHOUT draining pipes; pipe FDs close
on Pipe deinit when RunningProvider is released. New test
testStartFailsPromptlyWithMissingBinary asserts start() throws in
< 1 second (fix returns in ~10ms vs the prior 2-second hang).

C.1 (MAJOR) fix: waitForReady() returned .ready immediately on
HTTP 200 without re-checking process state. If the provider 200d
/v1/models and then crashed before waitForReady returned, Step 7
would begin measurement against a dead process. waitForReady now
re-checks provider.process.isRunning after receiving HTTP 200; if
the process exited in the race, returns .processExited(rc:stderrTail:)
instead. New test testWaitForReadyHandlesImmediateExitAfterFirstResponse
uses a Python stub that responds 200 and immediately exits; the
runner produces .ready or .processExited but NEVER .timeout (the
pre-fix silent-hang regression class).

A.1 (QUESTION) resolution: stop() now returns StopResult enum with
.stopped and .stuck(pid:) cases (@discardableResult so existing
callers compile unchanged). The .stuck variant lets Step 7 record
a clear failure instead of discovering it via the next start()
failing with alreadyRunning. NO SIGKILL escalation in v1 per the
BUILD prompt; current intentionally stays set so the single-
provider invariant holds. New test
testStopReturnsStoppedForNeverStartedRunner covers the happy path;
the .stuck path is hard to reach deterministically without an
unkillable subprocess and is left to Step 7 integration coverage.

E.1 (MINOR) fix: argv validation tests covered only invalidKvBits.
Added testServeArgumentsRejectInvalidPort,
testServeArgumentsRejectInvalidMaxContext, and
testServeArgumentsRejectInvalidMaxBatch against
CandidateProviderRunner.serveArguments(...).

I.1 (MINOR) fix: log filename collision within one second. The
prior filename used only Int(timeIntervalSince1970), so two starts
of the same model+port within one second collided and the second
.atomic write truncated the first log. Filenames now append the
first 8 chars of a UUID:
<model>-<port>-<unix-ts>-<uuid8>.log
New test testLogFileURLsDoNotCollideWithinOneSecond verifies two
consecutive builds for the same (model, port) produce distinct
paths.

API addition: exposed logFileURL(...) as
logFileURLForTesting(model:port:) static helper for I.1 test
visibility; named explicitly to flag it as a test accessor.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 273 tests, with 1 test skipped and 0 failures
  (266 baseline + 7 new from this fix-pass)

  swift build --package-path phase3-binary
  Build complete (7.84s)

Smoke check for B.1:
  start() with /nonexistent/macprovider-cli-stub-<uuid> throws in
  ~10ms (was: 2-second hang).

Files:
- phase3-binary/Sources/macprovider-cli/CandidateProviderRunner.swift
- phase3-binary/Tests/macprovider-cliTests/CandidateProviderRunnerTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 audit at codex to verify closures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 4 round-2 closure-verification prompt

Round-1 audit (codex on 994c7ee) returned 2 MAJOR + 2 MINOR + 1
QUESTION, verdict FIX REQUIRED. Audit-response landed as 4bcef89
closing all 5. Round 2 verifies the closures and spot-checks the
fix-pass for new gaps.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 4 round-2 LOCK — 5 CLOSED / 0 new findings

Codex round-2 closure audit verified all 5 round-1 fixes landed
clean and the fix-pass (4bcef89) introduced no new gaps. Step 4
is locked at commit 4bcef89; ready to proceed to Step 5.

Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 6):
- 5 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED
- 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new
- 273 tests passing (1 skipped integration-gated)
- Smoke checks confirm:
  - testStartFailsPromptlyWithMissingBinary passes in 7ms
    (was 2-second hang pre-fix)
  - waitForReady post-200 isRunning check correctly returns
    .processExited under the test stub race
  - StopResult.stopped/.stuck distinction surfaces stuck-
    provider failure cleanly
- No SIGKILL escalation introduced; SIGTERM-only contract held

Step 4 audit cycle complete after 2 rounds:
- R1 (on 994c7ee): 2 MAJOR + 2 MINOR + 1 QUESTION, FIX REQUIRED
- Fix-pass (4bcef89): closed all 5
- R2 (on 4bcef89): 0 findings, READY TO PROCEED TO STEP 5

Next step: fire BUILD_SPEC_013_PROMPT.md Step 5 (provider-conflict
pre-flight + launchd drain) at codex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 5: add provider conflict preflight

Add launchd and foreground serve conflict detection plus drain/restore primitives without wiring them into autotune execution yet. Keep launchctl, process-list, signal, port, and restart effects behind injected closures so unit tests do not mutate live services.

Constraint: SPEC-013 Step 5 only; do not modify specs, beta, coordinator, gateway, or AutotuneCommand.run wiring.

Rejected: substring argv matching for serve | it can match autotune/helper processes and violate FR-E.1 self-exclusion.

Rejected: SIGKILL escalation | v1 drain policy is SIGTERM-only for foreground and launchctl-only for launchd.

Confidence: high

Scope-risk: narrow

Directive: Step 7 should call ProviderConflictDetector/ProviderDrainer for pre-flight; Step 10 should call restore after tune/apply decisions.

Tested: swift test --package-path phase3-binary (283 tests, 2 skipped, 0 failures); git diff --check

Not-tested: real bootout/bootstrap drain against an installed launchd provider; launchctl list integration remains gated behind MACPROVIDER_INTEGRATION_TEST=1

* audit(autotune): Step 5 implementation audit prompt

Adversarial code review prompt for the d40a6f7 Step 5 commit
(ProviderConflictDetector + ProviderDrainer + extracted PortProbe
+ 10 unit tests, 1 integration-gated). Read-only review; codex
will append findings to specs/SPEC-013-impl-audit.md as round 7.

Step 5 has the most external-effect surface so far. High-risk
checks flagged in the audit prompt:
- Launchd label byte-match vs SPEC-003 v0.9.2 §FR-C5 / install.sh
- launchctl list parser handles PID '-' and label-as-substring
- isForegroundServe whole-word match (NOT substring); self-
  exclusion via argv.contains('autotune')
- gui/<uid>/<label> bootout vs gui/<uid> bootstrap distinction
- SIGKILL discipline (v1 forbids; warningWriter emits when stuck)
- KERN_PROCARGS2 sysctl argv parse buffer safety
- DI surface completeness: every effect must be closure-injected
  so tests don't mutate real launchd state
- Anti-regression on Step 4's runner refactor (isPortOpen
  extraction to MacProviderPortProbe must be byte-equivalent)

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 6.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 5 round-1 audit MINORs (G.1 + G.2 + G.3)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
§ Round 7) returned 0 CRITICAL / 0 MAJOR / 4 MINOR / 0 QUESTION
on the d40a6f7 Step 5 commit, verdict READY TO PROCEED. Codex's
verdict already allowed progress to Step 6, but 3 of the 4
MINORs are cheap test-coverage locks and worth closing now
(matching the Step 1 discipline). The 4th MINOR (H.1) is a
Step 10 forward-compat note and is tracked there.

G.1 (MINOR) fix: parseLaunchdManagedPID's inactive-PID case
(launchctl list line beginning with `-`) had no direct unit
test. Added testParseLaunchdManagedInactivePIDReturnsNil in
ProviderConflictDetectorTests that exercises the parser
directly with `-\t-\tlive.streamvc.macprovider` and asserts
(found: true, pid: nil).

G.2 (MINOR) fix: the helper-binary false-positive class (e.g.
/usr/local/bin/macprovider-cli-helper serve) had no test.
Added testHelperBinaryDoesNotMatchForegroundServe with a
process list containing that argv; the
lastPathComponent == "macprovider-cli" check correctly rejects.
Complements testServeSubstringDoesNotMatchForegroundServe
(which covered the subcommand-like name class only).

G.3 (MINOR) fix: the SIGKILL-disabled warning path had no unit
test. Added
testForegroundDrainEmitsNoSIGKILLWarningWhenProcessRemainsAfterGrace
in ProviderDrainerTests with processIsRunning: {_ in true},
portIsOpen: {_ in false}, and a recording warning writer;
asserts the warning fires, includes the stuck PID, and names
the v1 no-SIGKILL policy.

H.1 (MINOR) NOT FIXED: restore idempotency. The drainer's
restore(.launchdManaged) calls launchctl bootstrap
unconditionally; a second restore after a successful bootstrap
will fail because the service is already loaded. Codex flagged
this as a Step 10 concern (Step 10 owns lifecycle cleanup);
captured as a forward-compat note in implementation-notes
rather than fixed in Step 5 code.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 286 tests, with 2 tests skipped and 0 failures
  (283 baseline + 3 new from this fix-pass)

Files:
- phase3-binary/Tests/macprovider-cliTests/ProviderConflictDetectorTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 audit at codex to verify closures, then
proceed to Step 6 (pre-warm — Shape A vs Shape B).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 5 round-2 closure-verification prompt

Round-1 audit (codex on d40a6f7) returned 0 CRITICAL / 0 MAJOR /
4 MINOR / 0 QUESTION, verdict READY TO PROCEED. Fix-pass
(3adddbf) closed G.1 + G.2 + G.3 as pure test additions and
deferred H.1 to Step 10 with a forward-compat note. Round 2
verifies the test closures + the deferral rationale.

Production code was not touched in 3adddbf — round 2 also
asserts that empty production-diff invariant.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 6 (already met; this round verifies
the additional MINOR closures are sound).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 5 round-2 LOCK — 3 CLOSED + 1 DEFERRED / 0 new findings

Codex round-2 closure audit verified all 3 MINOR closures landed
clean and the H.1 Step-10 deferral is well-justified. Step 5 is
locked at commit 3adddbf; ready to proceed to Step 6.

Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 8):
- 3 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED / 1 DEFERRED
- 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new
- 286 tests passing (2 skipped integration-gated)
- git diff d40a6f7 3adddbf -- phase3-binary/Sources/ is EMPTY
  (test-only fix-pass)
- Closures verified meaningful (not tautological):
  - G.1: testParseLaunchdManagedInactivePIDReturnsNil pins
    parser inactive-PID handling
  - G.2: testHelperBinaryDoesNotMatchForegroundServe would fail
    under substring matching regression
  - G.3: testForegroundDrainEmitsNoSIGKILLWarningWhenProcessRemainsAfterGrace
    pins the visible no-SIGKILL policy with all DI seams
- H.1 deferral: implementation-notes explicitly assigns
  call-once discipline to Step 10's lifecycle wiring

Step 5 audit cycle complete after 2 rounds:
- R1 (on d40a6f7): 4 MINOR, READY TO PROCEED
- Fix-pass (3adddbf): closed G.1+G.2+G.3, deferred H.1 to Step 10
- R2 (on 3adddbf): 0 findings, READY TO PROCEED TO STEP 6

Next step: fire BUILD_SPEC_013_PROMPT.md Step 6 (pre-warm —
Shape B Swift-native).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 6: add Shape B pre-warm wrapper

Implement the dormant Swift-native pre-warm load phase so Step 7 can wire measurement after provider readiness without adding a models pull subcommand.

Constraint: SPEC-013 Step 6 only; no specs, beta, coordinator, gateway, or AutotuneCommand.run() wiring changes.

Rejected: Shape A models pull subcommand | out of scope for SPEC-013 v1 and not present in the binary.

Confidence: high

Scope-risk: narrow

Directive: Keep gate-ttft measurement after ProviderPreWarmer returns .warmed; this wrapper measures load/readiness only.

Tested: swift test --package-path phase3-binary (293 tests, 2 skipped, 0 failures).

Not-tested: real HuggingFace online fallback and real model cache freshness beyond local snapshot-file heuristic.

* audit(autotune): Step 6 implementation audit prompt

Adversarial code review prompt for the e7bfab5 Step 6 commit
(HuggingFaceCacheChecker + ProviderPreWarmer + PreWarmResult +
7 unit tests). Read-only review; codex will append findings to
specs/SPEC-013-impl-audit.md as round 9.

Step 6 has interesting risk surface in the FR-D.2 classifier:
- Integrity-vs-transient classification is ASYMMETRIC. A missed
  integrity (integrity-as-transient) = CRITICAL silently
  advancing past a security failure. A misclassified transient
  (transient-as-integrity) = MAJOR aborting recoverable run.
- 8 integrity markers (substring search, case-insensitive).
  Audit checks for missing markers vs SPEC-named examples and
  false-positive risk on short markers like 'hash mismatch'.
- Cache check uses HF snapshot layout heuristic. Partial-download
  false positive is acceptable (downstream load failure
  classifies correctly).

Architecturally consequential: the
pattern means the provider is STOPPED before Step 7 sees
.warmed. Audit flags this as the load-bearing G.2 question —
is Step 7 expected to restart for the trial (load happens
twice) or should Step 6 leave the provider alive?

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 6 round-1 audit findings (1 CRITICAL + 1 MAJOR + 2 MINOR)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
§ Round 9) returned 1 CRITICAL / 1 MAJOR / 2 MINOR / 0 QUESTION
on the e7bfab5 Step 6 commit, verdict FIX REQUIRED. This commit
closes all 4 findings.

B.1 (CRITICAL) fix: the missing-tokenizer integrity marker did
not match the actual swift-transformers Hub error string. The
prior marker "missing tokenizer.json" did NOT catch the real
error "Required configuration file missing: tokenizer.json"
(lowercased: "required configuration file missing: tokenizer.json")
because of the colon — the SPEC-named integrity failure was
silently classifying as transient and Step 7 would have advanced
instead of aborting with exit_reason='pre_warm_integrity_failure'.

This is the worst-case asymmetric FR-D.2 failure mode the audit
prompt explicitly named: integrity-as-transient = CRITICAL
(silently advancing past a security failure). Added the colon-
bearing variants:
- "configuration file missing: tokenizer.json"
- "missing: tokenizer.json"
- "missing required tokenizer files"
Kept the original "missing tokenizer.json" as fallback for
variant phrasings. New regression-lock test
testPreWarmerClassifiesConfigurationFileMissingTokenizerAsIntegrity
feeds the exact Hub error string to the stub provider and asserts
.integrity classification.

B.2 (MAJOR) fix: malformed safetensors / weight loader errors
had no integrity markers. Added markers from the mlx-swift
safetensors reader path (clean-room safe: mlx-swift is MIT-
licensed, not Darkbloom):
- "invalid json header" (catches "[load_safetensors] Invalid json header length ...")
- "invalid json metadata" (catches "[load_safetensors] Invalid json metadata ...")
- "incomplete metadata"
- "invalid or corrupted"
These signal corrupted or tampered repository content. Asymmetric
FR-D.2 risk model biases toward integrity here. New test
testPreWarmerClassifiesInvalidJsonHeaderAsIntegrity locks the
behavior.

B.3 (MINOR) fix: classification tests covered only the lowercased
"signature mismatch" marker. Added
testPreWarmerClassifiesMixedCaseIntegrityCorrectly that feeds
"SIGNATURE Mismatch on Model Weights" and pins the
.lowercased() matcher semantics. Combined with B.1 and B.2's new
tests, the integrity-classification suite now covers signature,
tokenizer-missing, and safetensors-corruption marker classes.

E.1 (MINOR) fix: now injection was unused in duration assertions.
Added testPreWarmerLoadDurationReflectsInjectedClock that
injects a deterministic now() closure advancing by 7s between
the start and end calls, then asserts loadDurationSec == 7.0
(accuracy 0.001). A regression that records the end time before
waitForReady returns or always returns 0 would fail this test.

G.2 resolved (defer-stop after .warmed was intentional). Round-1
confirmed the disposable-load-phase design: Step 6 stops the
provider before returning .warmed, Step 7 will start a fresh
provider against the now-warm cache for trial measurement. The
cold fetch is excluded because Step 6 has already warmed the HF
cache; the second readiness wait is still outside gate-ttft-ms.
Not a bug.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 297 tests, with 2 tests skipped and 0 failures
  (293 baseline + 4 new from this fix-pass)

Marker list now totals 15 patterns (vs 8 before) covering:
- signature/hash/checksum class (7)
- missing-tokenizer class (4)
- safetensors corruption class (4)

Files:
- phase3-binary/Sources/macprovider-cli/ProviderPreWarmer.swift
- phase3-binary/Tests/macprovider-cliTests/ProviderPreWarmerTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 audit at codex to verify closures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 6 round-2 closure-verification prompt

Round-1 audit returned 1 CRITICAL + 1 MAJOR + 2 MINOR, verdict
FIX REQUIRED. Fix-pass (682abe8) closes all 4: B.1 added 3 new
tokenizer-missing markers including the colon-bearing
swift-transformers Hub error variant; B.2 added 4 mlx-swift
safetensors corruption markers; B.3 added mixed-case integrity
test; E.1 added deterministic-clock load-duration test.

Round 2 verifies closures + the expanded 15-marker list for
new false-positive risk. Two of the new markers
('incomplete metadata', 'invalid or corrupted') are vague
enough that codex should specifically assess FP risk against
benign log lines.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 7.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 6 round-2 audit N.1 MAJOR (vague integrity markers)

Codex round-2 closure audit (specs/SPEC-013-impl-audit.md
§ Round 10) verified all 4 round-1 fixes closed cleanly but
flagged 1 MAJOR NEW finding (N.1): the round-1 fix-pass added
two unanchored integrity markers ("incomplete metadata" and
"invalid or corrupted") that could match plausible benign
transient lines, over-aborting recoverable runs.

Concrete examples codex identified:
- "Download failed: incomplete metadata in Hugging Face API
  response; retry later" would match "incomplete metadata" and
  classify as integrity (silently aborting an HF API hiccup
  that the next candidate could survive).
- "cache index invalid or corrupted; rebuild cache and retry"
  would match "invalid or corrupted" and classify as integrity
  (silently aborting a local-state recoverable case).

Under the asymmetric FR-D.2 risk model
(integrity-as-transient = CRITICAL silent advance past security
failure; transient-as-integrity = MAJOR over-abort recoverable),
the vague markers exceeded the recoverable-tolerance threshold.

N.1 fix: removed both vague markers from the integrity list.
The remaining safetensors-context markers "invalid json header"
and "invalid json metadata" stay because they're anchored to
the loader's actual output (`[load_safetensors] Invalid json
header ...`) and unlikely to appear in transient/network/cache
contexts. SPEC-013 §5.4 FR-D.2 still has full coverage:
- signature/hash/checksum class (7 markers)
- missing-tokenizer class (4 markers)
- safetensors header/metadata corruption (2 markers)
= 13 total markers (was 15 after round 1).

Negative-lock tests added (regression-locks against any future
broad-marker reintroduction):
- testPreWarmerClassifiesIncompleteMetadataDownloadErrorAsTransient
  feeds the HF API metadata line and asserts .transient
- testPreWarmerClassifiesCacheRebuildHintAsTransient
  feeds the cache rebuild hint and asserts .transient

These tests would FAIL if either marker is reintroduced.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 299 tests, with 2 tests skipped and 0 failures
  (297 baseline + 2 new negative locks; the 4 round-1 closure
  tests still pass under the narrower marker list)

Files:
- phase3-binary/Sources/macprovider-cli/ProviderPreWarmer.swift
- phase3-binary/Tests/macprovider-cliTests/ProviderPreWarmerTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-3 audit at codex to verify the N.1
closure didn't introduce further gaps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 6 round-3 closure-verification prompt

Round-2 audit (codex on 682abe8) returned 4 round-1 CLOSED but 1
MAJOR NEW (N.1) — vague markers 'incomplete metadata' and
'invalid or corrupted' could over-classify benign transients.
Round-2 fix-pass (dcd7f63) removed both vague markers and added
2 negative-lock tests pinning the benign-line behavior as
.transient.

Round 3 verifies N.1 closure + confirms the narrower marker
list still covers all SPEC-013 §5.4 FR-D.2 named integrity
examples.

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 7. This is the 11th audit round in
the SPEC-013 implementation cycle.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 6 round-3 LOCK — N.1 CLOSED / 0 new findings

Codex round-3 closure audit verified the N.1 fix landed clean
without losing SPEC-named integrity coverage. Step 6 is locked
at commit dcd7f63; ready to proceed to Step 7.

Round-3 verdict (specs/SPEC-013-impl-audit.md § Round 11):
- N.1 CLOSED
- 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new
- 299 tests passing (2 skipped integration-gated)
- 4 round-1 closure tests still pass under narrower list:
  - testPreWarmerClassifiesConfigurationFileMissingTokenizerAsIntegrity
  - testPreWarmerClassifiesInvalidJsonHeaderAsIntegrity
  - testPreWarmerClassifiesMixedCaseIntegrityCorrectly
  - testPreWarmerLoadDurationReflectsInjectedClock
- 2 round-2 negative-lock tests would fail under marker
  reintroduction:
  - testPreWarmerClassifiesIncompleteMetadataDownloadErrorAsTransient
  - testPreWarmerClassifiesCacheRebuildHintAsTransient

Codex's cross-checks against mlx-swift / mlx-swift-examples /
swift-transformers verified:
- Active MLX provider load path uses [load_safetensors] strings
  covered by the 2 remaining safetensors markers.
- swift-transformers WeightsError.invalidFile string 'The weights
  file is invalid or corrupted' exists but is NOT reached by the
  audited MLX provider load path — so the removal of the
  'invalid or corrupted' marker does not open a concrete gap.

Final integrity marker list (13 patterns):
- signature/hash/checksum class (7)
- missing-tokenizer class (4 including colon-bearing
  swift-transformers Hub variant)
- safetensors header/metadata anchored to [load_safetensors] (2)

Step 6 audit cycle complete after 3 rounds:
- R1 (e7bfab5): 1 CRITICAL + 1 MAJOR + 2 MINOR, FIX REQUIRED
- Fix1 (682abe8): closed all 4 (markers 8 → 15)
- R2 (682abe8): closures verified, 1 MAJOR NEW (N.1: vague
  markers FP risk)
- Fix2 (dcd7f63): removed vague markers (15 → 13) + 2 negative
  locks
- R3 (dcd7f63): N.1 CLOSED, READY TO PROCEED TO STEP 7

Next step: fire BUILD_SPEC_013_PROMPT.md Step 7 (Stage 1
feasibility iteration — the heaviest BUILD step, composes
Steps 1-6).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 7: add Stage 1 feasibility iterator

Build the dormant Stage 1 primitive so SPEC-013 can compose pre-warm, candidate probing, and trial persistence before runtime wiring in Step 10.

Constraint: Step 7 only; do not wire AutotuneCommand.run yet or touch specs, beta, coordinator, or gateway paths.

Rejected: internal size-based re-rank | violates FR-A.1 and AC-17 operator-order guarantee.

Confidence: high

Scope-risk: moderate

Directive: Keep Stage1Iterator candidate traversal in caller-supplied order; Step 10 should own tune_runs persistence when full runtime metadata exists.

Tested: swift test --package-path phase3-binary --filter Stage1IteratorTests; swift test --package-path phase3-binary

Not-tested: real cached MLX model Stage 1 probe outside integration-gated local environment

* audit(autotune): Step 7 implementation audit prompt

Adversarial code review prompt for the 98d7079 Step 7 commit
(Stage1Iterator + Stage1Prober + 3 DI protocols + 7 unit tests,
529 + 440 lines). Read-only review; codex will append findings
to specs/SPEC-013-impl-audit.md as round 12.

Step 7 is the heaviest BUILD step — composes Steps 1-6 into
the actual largest-first candidate iteration with SSE/TTFT
measurement. High-risk audit surfaces flagged:
- AC-17 regression: any internal sort/re-rank of candidates =
  CRITICAL (biggest-fit product strategy depends on this)
- FR-A.2 STOP-on-first-feasible: must not continue past
  feasible
- failureReasons.last semantics for FR-H.4 smallest-first:
  silently coupled to largest-first iteration order; could
  flip under operator-supplied smallest-first list
- SSE parsing: OpenAI-compat data: line handling, [DONE]
  terminator, multi-line events (uncommon)
- TTFT measurement: from request start to first content delta
- The two waitForReady(timeout: 0.05) mid-probe peeks for
  process-exit detection during the probe
- DI protocol cast in ProviderPreWarmer extension is leaky

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 7 round-1 audit findings (1 CRITICAL + 1 MAJOR + 4 MINOR)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
§ Round 12) returned 1 CRITICAL / 1 MAJOR / 4 MINOR / 0 QUESTION
on the 98d7079 Step 7 commit, verdict FIX REQUIRED. This commit
closes all 6 findings.

E.1 (CRITICAL) fix: failureReasons.last only surfaced the smallest
candidate under largest-first iteration. With operator override
["1b", "32b"] (the AC-17 shape) and both failing, .last returned
32b's reason — silent FR-A.4 / FR-H.4 violation. This was the
subtle design risk explicitly flagged in the audit prompt's
Category E.

Fix: added an orthogonal `candidatesBySize: [String]?` init
parameter (default: caller list reversed, preserving prior
behavior for largest-first default lists). The all-infeasible
throw now walks candidatesBySize smallest-first to find the first
failed candidate and surfaces its reason. The error's trials
list is also size-ordered (smallest-first) so the caller can
print reasons in FR-H.4's "even <smallest>: <reason>" format
followed by each larger candidate in size order.

New regression-lock test
testStage1IteratorAllInfeasibleWithOperatorOverrideSurfacesSmallestBySize
passes ["1b", "32b"] as both iteration AND size order, both fail,
asserts surfaced reason is "1b: 1b leaked stop token" (not
"32b OOM during probe"). Existing test
testStage1IteratorAllInfeasibleSurfacesSmallestFirstReason updated
to match the new contract.

F.1 (MAJOR) fix: integrity-abort path lost per-candidate trial
evidence. SPEC-013 §5.7 FR-G.1 requires "every trial is recorded"
including the security-relevant integrity failure that aborts the
whole run. The prior code threw before insertTrial.

Fix: pre-warm .failed(.integrity) branch now writes the trial row
with notes="pre-warm integrity: <reason>" BEFORE throwing
preWarmIntegrityFailure. Test testStage1IteratorAbortsOnIntegrity
flipped from asserting trialModels == [] to trialModels == ["model-a"].

H.1 (MINOR) fix: Stage 1 row kept=true conflicted with SPEC
FR-G.1 schema comment "Stage 2 only; 0 in Stage 1".
Stage1Iterator now always passes kept: false for Stage 1 rows;
Step 9 derives the chosen model from Stage1IteratorResult.selectedModel
rather than tune_trials.kept.

D.1 (MINOR) fix: SSE parser edge cases untested. Added 2 new
tests:
- testStage1ProberClassifiesHTTPNon2xxAsInfeasible: HTTP 503
  response → infeasible with "HTTP 503" in reason
- testStage1ProberAcceptsCompletionsStyleTextSSE: SSE stream
  using choices[0].text (completions API) instead of
  choices[0].delta.content (chat API) → feasible

The completions-style fixture also exercises SSE comment lines
(": keep-alive") and malformed JSON data: payloads ("not-valid-json"),
locking the parser's graceful-skip behavior.

K.1 (MINOR) fix: Stage 1 persistence fields partly asserted.
Added testStage1IteratorPersistsFullStage1FieldSet with a new
assertSingleTrialRow helper that reads every Stage 1-relevant
column and asserts: stage=1, runID, fits, nErr, kept=false,
replicatesN=stage1Replicates, maxContextCap=targetContext,
kvBits=nil, maxBatch=nil. A future regression writing stage=0 or
losing replicates_n would fail this test.

G.1 (MINOR) fix: ProviderPreWarmer DI cast leakiness. Added a
documentation block on extension ProviderPreWarmer:
Stage1PreWarming explaining the production-only cast (real
prewarmer requires CandidateProviderRunner for Step 6's
defer-stop pattern) and steering test authors to inject a fake
Stage1PreWarming directly rather than combining a real prewarmer
with a fake runner.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 310 tests, with 2 tests skipped and 0 failures
  (306 baseline + 4 new from this fix-pass)

Files:
- phase3-binary/Sources/macprovider-cli/Stage1Iterator.swift
- phase3-binary/Tests/macprovider-cliTests/Stage1IteratorTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 audit at codex to verify closures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 7 round-2 closure-verification prompt

Round-1 audit returned 1 CRITICAL + 1 MAJOR + 4 MINOR, verdict
FIX REQUIRED. Fix-pass (a9da9e5) closes all 6 with substantial
changes to the iterator core: new candidatesBySize parameter for
FR-H.4 size-ordered failure surface, integrity row recorded
before throw, Stage 1 kept=false always, 2 new SSE tests, 1 new
persistence-field test, 1 new doc block.

Round 2 verifies closures + spot-checks the candidatesBySize API
for new precision gaps (default fallback + operator override
interaction is the highest-risk new surface).

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 7 round-2 LOCK — 6 CLOSED / 1 MINOR forward-compat note

Codex round-2 closure audit verified all 6 round-1 fixes landed
clean. Step 7 is locked at commit a9da9e5; ready to proceed to
Step 8.

Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 13):
- 6 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED
- 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new (N.1)
- 310 tests passing (2 skipped integration-gated)
- AC-17 guard still meaningful: testStage1IteratorHonorsOperatorOrderForACSeventeen
  passes [1b, 32b] both feasible, asserts only 1b is selected

N.1 (MINOR) DEFERRED to Step 10: candidatesBySize is brittle
under operator override. The nil fallback (candidates.reversed())
is correct for largest-first defaults but WRONG if operator
passes smallest-first via --candidate-models without explicit
candidatesBySize. Codex correctly notes:
- Not a Step 7 blocker (Stage 1 is dormant; AutotuneCommand.run
  doesn't wire Stage1Iterator yet)
- Doc comment + regression-lock test mitigate
- Step 10 owns the wiring layer that must derive size order
  explicitly using AutotuneCommand.defaultCandidates' sizeB metadata
- Recommendation: Step 10 integration test with
  '--candidate-models 1b,32b' both failing, assert all-infeasible
  surface leads with 1b

Step 7 audit cycle complete after 2 rounds:
- R1 (98d7079): 1 CRITICAL (E.1) + 1 MAJOR (F.1) + 4 MINOR, FIX REQUIRED
- Fix (a9da9e5): closed all 6 + 4 new tests
- R2 (a9da9e5): 6 CLOSED, 1 MINOR forward-compat to Step 10,
  READY TO PROCEED TO STEP 8

Next step: fire BUILD_SPEC_013_PROMPT.md Step 8 (Stage 2 knob
hill-climb).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 8: add Stage 2 knob hill-climb

Implement the dormant SPEC-013 Stage 2 hill-climb so Step 10 can compose Stage 1 model selection with per-model knob selection.

Constraint: Step 8 only; do not wire Stage2HillClimb into AutotuneCommand.run yet and do not modify specs or sensitive service paths.

Rejected: wiring Stage 2 into the CLI runtime now | Step 10 owns Stage 1 + Stage 2 integration.

Confidence: high

Scope-risk: moderate

Directive: Preserve deterministic cell order and strict-all-feasible cell semantics when wiring this into the autotune runtime.

Tested: swift test --package-path phase3-binary --filter Stage2HillClimbTests; swift test --package-path phase3-binary

Not-tested: real provider Stage 2 hill-climb against mlx-swift hardware; integration tests remain gated by MACPROVIDER_INTEGRATION_TEST.

* audit(autotune): Step 8 implementation audit prompt

Adversarial code review prompt for the 118599e Step 8 commit
(Stage2HillClimb + Stage2Prober + WinningKnobs +
Stage2HillClimbResult + isNewBest + 7 unit tests). Read-only
review; codex will append findings to specs/SPEC-013-impl-audit.md
as round 14.

Step 8 is the second-heaviest BUILD step (520 + 334 lines).
Highest-risk audit surfaces:
- isNewBest port (Category A): must match prototype Python
  verbatim. Any subtle deviation in the tie-band logic could
  silently pick the wrong knob combo.
- Strict-all-feasible per cell (Category B, FR-B.2): if any
  replicate fails, the cell MUST be infeasible.
- Cartesian iteration determinism (Category C): operators must
  be able to replay runs.
- AutotuneTrialRow stage=2 + kvBits/maxBatch/maxContextCap
  populated correctly (Category D).
- kept= semantics — SPEC §5.7 schema comment said 'Stage 2
  only' so the winner cell may set kept=true (audit checks
  Step 8's interpretation).

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 9.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 8 round-1 audit findings (1 MAJOR + 1 MINOR)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
§ Round 14) returned 0 CRITICAL / 1 MAJOR / 1 MINOR / 0 QUESTION
on the 118599e Step 8 commit, verdict FIX REQUIRED. This commit
closes both findings. The Category A isNewBest port was verified
faithful to the prototype Python — no change there.

D.1 (MAJOR) fix: multiple Stage 2 rows had kept=true. The prior
code computed `kept` at insertion time based on isNewBest, which
left every cell that became "running best" with kept=1 in the DB.
SPEC-013 §5.7 FR-G.1 requires only the FINAL winning Stage 2 cell
to carry kept=1.

Implementation:
- Added AutotuneDB.markStage2WinnerCell(runID:kvBits:maxBatch:maxContextCap:)
  — an idempotent SQL UPDATE matching the (run_id, stage=2, knobs)
  compound. The kv_bits NULL/non-NULL is handled at SQL-clause
  level since SQLite's `= NULL` doesn't match. Uses the existing
  withStatement/bind helpers, no new C-interop.
- Stage2HillClimb.run() now inserts every row with kept=false
  during the loop, tracks the winning row's array index via
  bestRowIndex, and after the loop calls markStage2WinnerCell
  with the winner's knobs + mutates trialRows[bestRowIndex].kept
  in memory so cellTrials is consistent.

Test updates:
- testStage2HillClimbAppliesIsNewBestThroughputPrimary: flipped
  [true, true] → [false, true]
- testStage2HillClimbAppliesIsNewBestTTFTTiebreak: same flip
- testStage2HillClimbPersistsAllCellTrialsWithStageTwo: now
  explicitly asserts rows.map(\.kept) == [false, true, false,
  false] AND rows.filter(\.kept).count == 1 as a regression-lock
  on the exactly-one-winner invariant

J.1 (MINOR) fix: direct isNewBest edge-branch tests missing.
Added 5 new tests targeting the prototype's documented but
untested branches:
- testIsNewBestAcceptsPositiveTPSWhenBestTPSIsZero — bestTPS=0
  edge with positive new TPS → true
- testIsNewBestRejectsZeroTPSWhenBestTPSIsZero — zero-vs-zero
  edge → false (catches a potential div-by-zero crash if the
  bestTPS<=0 guard ever drifts)
- testIsNewBestWinsTieBandWhenBestTTFTIsNil — tie-band with
  bestTTFT=nil but new ttft measurable → true (matches Python
  `ttft is not None and (best_ttft is None or ttft < best_ttft)`)
- testIsNewBestHoldsWhenBothTTFTsAreNilInTieBand — symmetric
  unmeasurable → false (incumbent holds)
- testIsNewBestHoldsWhenNewTTFTIsNilInTieBand — asymmetric (new
  unmeasurable, best measurable) → false (incumbent holds)

AutotuneDB additive method note: Step 3 is locked at d0029e9,
but adding markStage2WinnerCell is purely additive — same
pattern Step 5 used to extract PortProbe.swift from Step 4. The
new method follows the existing insertTrial / insertRun /
applyRetentionInTransaction idioms (withStatement + bind, throws
on SQLite errors).

Anti-regression:
  swift test --package-path phase3-binary
  Executed 322 tests, with 2 tests skipped and 0 failures
  (317 baseline + 5 new from this fix-pass)

Files:
- phase3-binary/Sources/macprovider-cli/AutotuneDB.swift (+44 lines)
- phase3-binary/Sources/macprovider-cli/Stage2HillClimb.swift
- phase3-binary/Tests/macprovider-cliTests/Stage2HillClimbTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 audit at codex to verify closures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 8 round-2 closure-verification prompt

Round-1 audit returned 1 MAJOR (D.1) + 1 MINOR (J.1), verdict
FIX REQUIRED. Fix-pass (022e8a3) closes both: D.1 via new
AutotuneDB.markStage2WinnerCell + post-loop UPDATE pattern; J.1
via 5 new direct isNewBest edge-branch tests.

Round 2 verifies closures + spot-checks the additive
markStage2WinnerCell SQL (highest-risk new surface: NULL handling
in the kv_bits clause, idempotency).

Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR
before proceeding to Step 9 (the recommendation surface).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 8 round-2 LOCK — 2 CLOSED / 0 NEW

Codex round-2 closure verification on commit 022e8a3 returned
0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new across
the 2 round-1 findings (D.1 + J.1), verdict READY TO PROCEED
TO STEP 9.

Round 14 → Round 15 closure summary:

D.1 (MAJOR, multiple kept=true Stage 2 rows) — CLOSED. Verified
that Stage2HillClimb.run() now inserts every cellTrial row with
kept=false and only after the Cartesian product completes calls
markStage2WinnerCell on the final winner. The no-feasible-cell
edge throws .noFeasibleCell before reaching the marker call
(Stage2HillClimb.swift:165-179), so no winner row is fabricated.

The new AutotuneDB.markStage2WinnerCell SQL was the highest-risk
new surface — verified:
- kv_bits NULL handling uses IS NULL (not = NULL, which matches
  nothing in SQLite)
- Bind arithmetic correct in both branches: nil binds run_id at
  1, skips kv_bits, max_batch at 2, max_context_cap at 3;
  non-nil binds run_id at 1, kv_bits at 2, max_batch at 3,
  max_context_cap at 4
- No SQL injection — only constant string interpolation is the
  kv_bits clause choice; all operator data bound via the
  existing withStatement/bind helpers
- Idempotent by SQL semantics

J.1 (MINOR, isNewBest edge-branch tests missing) — CLOSED. The
five new direct tests target the bestTPS<=0 boundary and the
nil-TTFT tie-band branches at Stage2HillClimb.isNewBest:196-225.
Non-tautological — each would fail if its named branch broke.

Step 3 (AutotuneDB) extension is verified additive: same
pattern as the Step 5 PortProbe extraction. git diff
022e8a3^ 022e8a3 shows only Step 8 files + audit report; no
existing AutotuneDB method signature changed.

Anti-regression: swift test --package-path phase3-binary
executed 322 tests, skipped 2 integration-gated, 0 failures.

Loop state:

  Step | Build   | R1            | Fix     | R2 | Status
  -----|---------|---------------|---------|----|---------
   8   | 118599e | 1M (D.1)      | 022e8a3 | 0  | LOCKED
                | + 1m (J.1)    |         |    |

Next: fire Step 9 BUILD (recommendation surface + --json
schema + recipe_hash + --apply config write).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(autotune) Step 9: add recommendation surface

Build the standalone SPEC-013 recommendation, JSON, recipe-hash, and config-apply surfaces so Step 10 can wire runtime orchestration without inventing output or write semantics.

Constraint: Steps 1-8 are locked; Step 9 must not wire AutotuneCommand.run() or edit specs/beta/coordinator/gateway paths.

Rejected: JSONEncoder-only recipe hashing | RFC 8785 JCS requires canonical key ordering and whitespace-free bytes independent of pretty JSON output.

Rejected: Yams emit for config writes | Yams 5.1 drops comments/order, so the applier validates with Yams then rewrites only SPEC-013 top-level owned keys.

Confidence: high

Scope-risk: moderate

Directive: Step 10 should call these components but keep recipe_hash persistence and run wiring separate from the pure emitter contract.

Tested: swift test --package-path phase3-binary --filter 'RecommendationEmitterTests|ConfigApplierTests' (22/22 passed)

Tested: swift test --package-path phase3-binary (344 tests, 2 skipped, 0 failures)

Not-tested: Real ~/.config/macprovider/config.yaml writes; ConfigApplier tests use temp directories only.

* audit(autotune): Step 9 implementation audit prompt

Round 1 of Step 9 audit covering the 292b2f9 commit
(RecommendationEmitter + RFC8785JCS + ConfigApplier).

The 14 audit categories (A-N) focus on:
- JCS encoder correctness — 81-LOC vendor implementation
  must produce canonical bytes matching an independent
  reference (AC-12 property 2 cross-impl determinism). The
  prompt includes an exact reference JSON vector + asks
  the auditor to cross-check via printf|shasum.
- recipe_hash domain isolation — exact 7 fields, no leakage
  of observation fields. Two CRITICAL-grade tripwires
  documented (B.2).
- ConfigApplier custom YAML rewriter — Yams emit was
  REJECTED in the build commit; the custom rewriter is
  load-bearing for AC-9 "non-owned keys byte-identical
  pre/post" assertion.
- kv_bits = nil propagation — 6 surfaces that MUST agree
  (terminal "unset", serve omits flag, JSON null, JCS
  null, YAML key omitted, summary "unset").
- Atomic write — POSIX rename in same directory, NOT
  FileManager.replaceItem.
- Backup counter — 0 through 65535, no overwrite, no TOCTOU race.

Forward-compat seams (Category M) verify Step 10 can wire
the emitter without touching the LOCKED surface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(autotune): close Step 9 round-1 audit findings (4 MINOR + 1 QUESTION)

Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md
Round 16) returned 0 CRITICAL / 0 MAJOR / 4 MINOR / 1 QUESTION on
the 292b2f9 Step 9 commit, verdict READY TO PROCEED TO STEP 10.
The independent printf|shasum cross-check on the reference vector
matched the baked test literal, locking JCS cross-implementation
determinism. This commit closes all 5 findings to satisfy the
"until 0 findings" loop discipline.

B.1 (QUESTION) — nil-recommendation recipe_hash policy. The
audit asked Step 10 to decide whether the degenerate
model=null+knobs=null hash gets persisted or NULL is stored.
Pre-committed the cleaner contract here: EmittedRecommendation.recipeHash
is now String?; when inputs.recommendation == nil the emitter
returns nil and the JSON output emits literal null. This matches
the tune_runs.recipe_hash schema comment "NULL if no
recommendation" and avoids a JSON/DB asymmetry. New test
testRecipeHashIsNilWhenRecommendationIsNil locks both invariants.

E.1 (MINOR) — JSON schema test only spot-checks nested objects.
Rewrote testJSONOutputMatchesSpec013Schema to walk every
documented nested key, assert primitive types, and use
Set(dict.keys) == <expected> as an additive-only
regression-lock. Now covers machine.{ram_gb, chip, os_version,
binary_version}, inputs.{target_context, candidate_models,
stage1_replicates, stage2_replicates, gate_ttft_ms,
tps_tie_epsilon}, recommendation.{model, target_context, knobs,
tps_median, ttft_p95_ms, replicates, serve_command},
recommendation.knobs.{kv_bits, max_concurrency_override,
max_context_override}, alternates string array, and
infeasible[].{model, rank, reason}. ISO-8601 timestamp shape is
asserted via regex.

G.1 (MINOR) — TOCTOU race between fileExists and rename. The
prior firstAvailableBackupPath + atomicWrite pattern could
overwrite a concurrently-created backup. Replaced with
writeBackupExclusively which opens each candidate counter with
O_CREAT | O_EXCL | O_WRONLY; on EEXIST retries the next counter,
on success writes the original bytes directly to the
exclusively-opened FD. The config write still uses temp+rename
per FR-F.3. New error case backupWriteFailed reports non-EEXIST
open or write failures. New test
testApplyBackupUsesExclusiveCreateAgainstTOCTOURace pre-creates
backups at counters 0-3, applies, and asserts (a) new backup
lands at counter 4, (b) all 4 pre-existing files retain their
byte content unchanged.

I.1 (MINOR) — AC-9 preservation tests too weak. Added two
stronger tests:
- testApplyPreservesNonOwnedLinesByteIdentically uses a fixture
  with leading/trailing/block/inline comments, blank lines, and
  ordering sentinels; splits pre and post into lines, filters
  out the 4 owned keys, and asserts the remaining line arrays
  are byte-identical (not just substring-contains).
- testApplyResultIsParseableByConfigLoader runs the post-apply
  file through the binary's MacProviderCore.ConfigLoader.load
  and asserts loaded.{model, kvBitsOverride, maxContextOverride,
  maxConcurrencyOverride} resolve to the recommendation values.
  Catches future parser/key-name drift the spot-check tests
  would miss.

K.1 (MINOR) — launchd hint test only checks bootout/bootstrap.
Renamed testLaunchdRestartHintIncludesBootoutAndBootstrap to
testLaunchdRestartHintIncludesAllRequiredSubstrings and added
assertions for the plist path
(~/Library/LaunchAgents/live.streamvc.macprovider.plist) and
the service identifier (gui/$UID/live.streamvc.macprovider).

testApplyWriteIsAtomic adjustment: the test now asserts
tempPaths.count == 1 (was 2) since the backup write no longer
uses the temp+rename path. The config write still uses
temp+rename and the spy tempFileNamer exercises it.

Anti-regression:
  swift test --package-path phase3-binary
  Executed 348 tests, with 2 tests skipped and 0 failures
  (344 baseline + 4 new from this fix-pass)

Files:
- phase3-binary/Sources/macprovider-cli/RecommendationEmitter.swift
- phase3-binary/Sources/macprovider-cli/ConfigApplier.swift
- phase3-binary/Tests/macprovider-cliTests/RecommendationEmitterTests.swift
- phase3-binary/Tests/macprovider-cliTests/ConfigApplierTests.swift
- phase3-binary/implementation-notes.html

Next step: fire round-2 closure verification audit at codex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 9 round-2 closure-verification prompt

Round 2 verifies the 4 MINOR + 1 QUESTION closures from Round 16, with two highest-risk surfaces: writeBackupExclusively (POSIX O_EXCL + manual write loop) and Optional recipeHash JSON encoding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* audit(autotune): Step 9 round-2 LOCK — 5 CLOSED / 0 NEW

Codex round-2 closure verification on commit d6c634c returned
0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new across
the 5 round-1 findings (1 QUESTION + 4 MINOR), verdict READY TO
PROCEED TO STEP 10.

Round 16 -> Round 17 closure summary:

B.1 (QUESTION, nil-recommendation recipe_hash policy) — CLOSED.
Verified that EmittedRecommendation.recipeHash is now String?,
the emitter's recipeHash(_:) returns nil before computing the
JCS hash when no recommendation exists, and JSONRoot.encode(to:)
calls encodeNil for the recipe_hash key (producing literal JSON
null, not the string "null"). The Step 10 tune_runs persistence
contract is now unambiguous: store NULL when recipeHash is nil.

E.1 (MINOR, JSON schema test too weak) — CLOSED. The rewritten
testJSONOutputMatchesSpec013Schema now asserts every documented
nested key with a primitive type assertion AND
Set(dict.keys) == <expected> for machine/inputs/recommendation/
recommendation.knobs/infeasible[0]. ISO-8601 regex is asserted
on both timestamps. Catches future undocumented field additions.

G.1 (MINOR, TOCTOU race) — CLOSED. writeBackupExclusively
correctly uses open(O_CREAT | O_EXCL | O_WRONLY, 0o644), retries
EEXIST, throws backupWriteFailed on other errno. writeAll's
pointer arithmetic (base.advanced(by: written), data.count -
written) is correct for partial writes; EINTR retry is in
place. The n==0 edge case is acknowledged as benign for a
blocking regular-file fd (no v1 risk). The
testApplyBackupUsesExclusiveCreateAgainstTOCTOURace test
pre-creates counters 0-3, verifies the new backup lands at
counter 4, and asserts all pre-existing files retain byte
content unchanged.

I.1 (MINOR, AC-9 preservation too weak) — CLOSED.
testApplyPreservesNonOwnedLinesByteIdentically uses a fixture
with comments/blanks/inline-comment/SPEC-013 marker, and
asserts non-owned line arrays are byte-identical pre/post.
testApplyResultIsParseableByConfigLoader runs the post-apply
file through MacProviderCore.ConfigLoader.load and asserts
the 4 owned values resolve to recommendation values — catches
future parser/key-name drift.

K.1 (MINOR, launchd hint too narrow) — CLOSED. Renamed
testLaunchdRestartHintIncludesAllRequiredSubstrings asserts
launchctl bootout + bootstrap + plist path + service id.

Anti-regression: swift test --package-path phase3-binary
executed 348 tests, skipped 2 integration-gated, 0 failures.

Loop state:

  Step | Build   | R1                | Fix     | R2 | Status
  -----|---------|-------------------|---------|----|---------
   9   | 292b2f9 | 4m + 1 QUESTION   | d6c634c | 0  | LOCKED

Next: fire Step 10 BUILD (failure modes + signal handling +
AutotuneCommand.run() wiring + tune_runs persistence +
size-ordered candidatesBySize for Stage 1)…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant