feat(provider): expose --kv-bits / --max-context / --max-batch serve knobs for autoresearch#105
Merged
Merged
Conversation
…knobs for autoresearch Widens the autoresearch search space for beta/autotune.py (PR #103) beyond just --model. All three knobs are real downstream wiring into mlx-swift 2.29.1, not just CLI cosmetics. - --kv-bits {4,8}: forwarded to MLXLMCommon.GenerateParameters.kvBits (both complete + stream call sites); preflight rejects anything but 4/8. - --max-context <N>: extends the existing per-tier maxContextTokens cap; tokens are still rejected at the existing context_length_exceeded 413 boundary, and we additionally pass maxKVSize=maxContextTokens to GenerateParameters so the KV cache (RotatingKVCache) honors the cap. - --max-batch <N> (default 1, prior single-slot behavior preserved): lifts the previously-hardcoded AsyncSemaphore(value: 1) inside ModelRuntime to be configurable. Reuses the existing maxConcurrencyOverride config field that was already plumbed from YAML/env but never wired to the CLI or runtime. All knobs are triple-exposed (CLI > env > YAML > default), matching the house convention. Preflight (runServingKnobsPreflight) fails loud at serve start instead of mid-inference on invalid values. A bug fix is folded in: ServeCommand.run() was hardcoding maxConcurrencyOverride: 1 when building ProviderCapacity, silently ignoring the resolved config. The capacity now reflects --max-batch. Tests: ServingKnobsConfigTests.swift adds 21 cases covering config-resolution precedence (CLI > env > YAML), defaults preserved, preflight rejection of invalid values, runtime threading of all three knobs, and a regression on the context_length_exceeded gate. Total suite: 219 -> 240, all passing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11
added a commit
that referenced
this pull request
Jun 18, 2026
…ch axes PR #105 exposed --kv-bits / --max-context / --max-batch as macprovider-cli serve flags. Wire them into autotune.py as three OPTIONAL search axes (--kv-bits-options / --max-context-options / --max-batch-options). Each defaults to [None], so omitting all three preserves the original model x context candidate space exactly (the original 6-trial 8GB Air run still produces an identical 6-trial plan). When set, candidates are the full cartesian, with the chosen knobs passed through to start_provider and recorded in three additively-migrated tune_trials columns (kv_bits, max_context_cap, max_batch). Legacy rows keep NULL in the new columns; existing reports remain readable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
Augustas11
added a commit
that referenced
this pull request
Jun 18, 2026
* spec(cli): SPEC-013 v0.1 — macprovider-cli autotune subcommand Initial draft of the autotune subcommand spec + the round-1 codex audit prompt. NOT for merge — this commit lives on the feature branch only and the PR is held until the codex audit loop converges. SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context, --max-batch) in a two-stage pipeline that encodes the "biggest-fit, not max-tps" product strategy. Stage 1 iterates a curated largest-first candidate list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the chosen model. This is the load-bearing departure from the PR #103 Python prototype (whose cartesian max-tps loop would push every capable Mac to serve the smallest model). Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates, stage2_replicates, kv-bits axis-vs-default) are flagged as Open Questions pending the in-flight air5 n=3 replication run; v0.2 either confirms placeholders or sends a narrow PR adjusting them. Files: - specs/SPEC-013-cli-autotune.md (new, v0.1 draft) - specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt) - specs/README.md (+1 row in the index table) Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.2 — round-1 codex audit response Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs, and both QUESTIONs. The product framing (biggest-fit, not max-tps) and two-stage architecture are unchanged — round 1 explicitly preserved both. MAJORs closed: - A.1 fallback contradiction: replaced metrics-bearing `fallbacks` with NAME-ONLY `alternates` (the STOP-on-first-feasible rule meant smaller candidates were never probed; v0.1's fallback metrics were structurally impossible). - D.1 `models pull` precondition was bigger than admitted: FR-D reframed as "weights cache-warm before probe; load-fetch latency excluded from gate-ttft-ms" with Shape A (explicit pull) vs Shape B (rely on runtime online-fallback + isolate measurement) implementation choice. No longer depends on a not-yet-existing subcommand. - E.1 launchd label wrong: `com.macprovider.cli` → `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5, install.sh, plist template). Drain sequence bound to `launchctl bootout/bootstrap gui/$UID/...`. - F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` / `max_batch` were the CLI flag names; actual YAML keys per Config.swift:239-241 are `max_context_override` / `max_concurrency_override`. JSON `knobs` object now uses YAML key names for round-trip into config.yaml; `serve_command` retains CLI flag names for shell paste. - F.2 recipe_hash not deterministic: pinned to `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization + explicit hash input domain enumeration (machine + inputs + recommendation.model + recommendation.knobs; excludes run_id, timestamps, observed metrics). - G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new inserts MUST set stage=1 or stage=2 explicitly. - J.1 no AC for operator-supplied order: added AC-17 with `--candidate-models 1B,32B` on a Mac where both fit — must pick 1B because operator order is the contract. MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI summary kv-bits default), F.3 (backup naming collision-safe), G.2 (transactional retention), H.1 (--resume removed from §7), J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum), K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration note), M.1 (cross-spec renumber to SPEC-014). QUESTIONs resolved (2): D.2 (signature vs network failure now asymmetric — integrity aborts whole run, transient advances), K.2 (added OQ-E flagging thermal/order bias with quantitative threshold). Deferred to post-lock: M.2 documentation checklist (decision-log entry, SPEC-003 install note, PR #103 disposition) — captured as a §11 checklist but not in the binding contract. Files: - specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines) - specs/SPEC-013-audit.md (NEW, codex round-1 output) - specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt) Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the round-2 closure check, address any new findings, repeat until LOCK READY, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.3 — round-2 audit response (LOCK candidate) Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex recommended a narrow v0.3 closing the 4 new findings before implementation. v0.3 closes all 4. No architecture change. Round-2 closures: - N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D rewrite permitted Shape B (rely on runtime online-fallback + measurement isolation) but NFR-4's egress exception and AC-8 still spoke only of `models pull`. v0.3 reworords NFR-4 to admit both Shape A and Shape B HuggingFace pre-warm paths; AC-8 is now shape-neutral with explicit Shape A (mocked pull exit non-zero) and Shape B (block egress + runtime fallback fails during load) variants. A new sub-variant explicitly tests the FR-D.2 integrity-class abort path. - Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2 put the parse rules in non-normative §7. v0.3 lifts them into FR-B.1 as a normative paragraph (absolute caps, sorted ascending after parse, each cell >= --target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell). The §7 / §5 conflict-resolution rule is now stated explicitly. - N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling protocol: v0.3 adds a 10-paired-runs forward/reverse protocol with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger threshold. Operators can close OQ-E without relitigating methodology. - O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four discrete sites — `tune_runs.spec_version` SQL comment, FR-H.2 "v0.1 normative contract" prose, NFR-3 stale `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2" disclaimer. Files: - specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate) - specs/SPEC-013-audit.md (codex round-2 output landed) - specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3 closure-confirmation audit prompt) Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the round-3 LOCK-confirmation check. Expected outcome: LOCK with 0 new findings or ≤1 MINOR. If LOCK, push the branch and open the DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.3 LOCK — round-3 codex confirmation + O-V03.1 fold-in Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new. The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON example still showed "SPEC-013 v0.2" inside a v0.3 document. Codex explicitly said this does not block LOCK (the adjacent SQL comment already stated writers emit their own producing version), but recommended folding the fix in before implementation. This commit folds it: the JSON example now uses "SPEC-013 v<producing-version>" as a placeholder, and the spec_version bullet teaches the rule. Round-3 closures (from specs/SPEC-013-audit.md § Round 3): - N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or equivalent) and Shape B (runtime online fallback during model load) HuggingFace pre-warm paths; carve-out scoped to autotune runs and weight fetches; AC-8 shape-neutral with explicit Shape A + Shape B + integrity-class variants. - Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from non-normative §7 into binding FR-B.1 (absolute caps, sorted ascending, ≥ target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell); §7 vs §5 conflict-resolution rule explicit. - N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable 10-paired-runs forward/reverse sampling protocol on air5 with 60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger. - O.1 CLOSED: all 4 named drift sites updated (tune_runs SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7 disclaimer). Specs index updated: specs/README.md row for SPEC-013 now reads v0.3. Files: - specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in) - specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed) - specs/README.md (SPEC-013 row → v0.3) Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR + 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed from round 2) → LOCK. Next step: push branch + open DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 BUILD prompt — Option A Swift-native impl plan Adds the operator-paste BUILD prompt that a fresh Codex CLI session uses to implement SPEC-013 v0.3 against the existing phase3-binary/ Swift package. The BUILD prompt picks Option A (Swift-native subcommand inside macprovider-cli) per SPEC-013 §10, rationale captured inline: single-binary install consistency with SPEC-003, drain semantics match existing patterns (UninstallCommand / SelfUpdate), future SPEC-011 warm-swap integration needs Swift-native. Shape A vs Shape B for FR-D pre-warm is left as the implementer's call — the binding contract is FR-D.1's measurement-isolation requirement, not the mechanism. The 11-step build sequence: 1. AutotuneCommand subcommand scaffolding + --dry-run 2. --no-join flag on ServeCommand (FR-E.2 precondition) 3. SQLite schema + DB layer (tune_trials + tune_runs + migration + transactional retention) 4. Provider lifecycle (start/stop/wait-ready, single-provider invariant) 5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap) 6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient classification per FR-D.2) 7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible, AC-17 biggest-fit guard) 8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from prototype) 9. Recommendation surface (FR-F: terminal block + JSON schema + RFC 8785 JCS recipe_hash + --apply atomic write) 10. Failure modes + signal handling (FR-H, exit_reason enum) 11. Acceptance test suite (AC-1 through AC-19) Branch strategy: build work happens on feat/cli-autotune-impl stacked off spec/cli-autotune-v1, so the implementing PR rebases cleanly onto main after the SPEC PR (#108) merges. Hard rules: do not modify any file under specs/ from the build branch (it's downstream of the SPEC PR); do not touch beta/coordinator/gateway; do not pivot to Option B without an Open Question; preserve the biggest-fit objective (AC-17 catches this if forgotten). The prompt is self-contained: severity definitions, required reading, step sequence, acceptance gate, hard rules, anti-rules, operator checkpoint cadence, and an Open Question template. Files: - specs/BUILD_SPEC_013_PROMPT.md (NEW) Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1 and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected wall-clock: 1-2 weeks of session work for one new subcommand inside an existing binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11
added a commit
that referenced
this pull request
Jun 18, 2026
…BUILD steps, 21 audit rounds) (#109) * feat(autotune) Step 1: subcommand scaffolding Add the Swift-native autotune parser surface so subsequent SPEC-013 steps can build on a verified CLI entrypoint. Constraint: SPEC-013 v0.3 is locked; build work is limited to phase3-binary and Option A Swift-native. Rejected: Implementing provider lifecycle or SQLite in this checkpoint | Step 1 only requires parser registration and dry-run proof. Confidence: high Scope-risk: narrow Directive: Keep Stage 1 model selection biggest-fit and operator-order preserving; do not convert it into cross-model max-tps scoring. Tested: swift build --package-path phase3-binary; swift test --package-path phase3-binary --filter AutotuneCommandTests; swift test --package-path phase3-binary; macprovider-cli autotune --help; macprovider-cli autotune --dry-run; git diff --check Not-tested: Real autotune execution, SQLite persistence, provider lifecycle, serve --no-join, and apply/restart behavior are intentionally deferred to later steps. * audit(autotune): Step 1 implementation audit prompt Adversarial code review prompt for the facbaef Step 1 commit (AutotuneCommand parser scaffolding + --dry-run). Read-only review; codex will write findings to specs/SPEC-013-impl-audit.md. Per [[feedback-build-audit-loop]]: BUILD work follows the same audit-loop discipline as SPEC work. This is round 1 of the Step 1 audit cycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 1 round-1 audit findings (A.4 MAJOR + B.2/B.5/C.2 MINORs) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md) returned 0 CRITICAL / 1 MAJOR / 3 MINOR / 0 QUESTION on the facbaef Step 1 commit, verdict FIX REQUIRED. This commit closes all 4 findings. A.4 (MAJOR) fix: `parseCSV` used `split(separator: ",")` + filter- empty, which silently dropped empty cells in `--max-context-axis 4000,,8000` instead of throwing — violating the SPEC-013 v0.3 round-2 Z-B.1 closure that requires invalid cells to fail at flag-parse time. Added `parseCSVStrict(_:flag:)` that preserves empty subsequences and throws ValidationError with the offending flag name. The empty-default case (`--max-context-axis ""` → `[--target-context]`) is preserved by checking the raw string for emptiness before strict parse. B.5 (MINOR) fix: same parseCSV-drops-empty bug class affected `--candidate-models`; an operator typo `one,,two` parsed as 2 candidates instead of throwing. The explicit-candidates path now uses parseCSVStrict too. parseKvBitsAxis and parsePositiveIntAxis also moved to the strict helper for consistency. B.2 (MINOR) fix: removed the redundant `try validateBasicInputs()` call from `run()`. ArgumentParser already invokes the `validate()` protocol method at parse time. Added `_ = try candidatePlan()` to `validate()` so explicit-list parse errors surface at flag-parse time, not deferred. C.2 (MINOR) fix: added 8 new tests to AutotuneCommandTests pinning the Step 1 parser contracts that should have been frozen in the Step 1 commit: - testMaxContextAxisRejectsEmptyCell (A.4 regression lock) - testMaxContextAxisSortsAscending (FR-B.1 sort rule) - testMaxContextAxisRejectsDuplicates (FR-B.1 dedup rule) - testMaxContextAxisEmptyDefaultMapsToTargetContext - testCandidateModelsRejectsEmptyCell (B.5 regression lock) - testExplicitCandidateOrderPreservesSmallFirst (AC-17 prep; fails under any future parser-level re-rank) - testDryRunLinesContainCandidatePlanInOrder (dry-run output contract; uses the new testable dryRunLines helper) - testRestartForegroundFlagParses Refactor: extracted `dryRunLines(_:)` from `printDryRun(_:)` so tests can assert on candidate order without capturing stdout. Used by tests and by future JSON/report serializers in Steps 9+. Anti-regression: swift test --package-path phase3-binary Executed 253 tests, with 0 failures (up from 245 baseline) swift build --package-path phase3-binary Build complete (12.34s) Smoke checks (the failures codex flagged): $ macprovider-cli autotune --max-context-axis 4000,,8000 --dry-run Error: --max-context-axis contains an empty cell; check for stray commas exit 64 $ macprovider-cli autotune --candidate-models one,,two --dry-run Error: --candidate-models contains an empty cell; check for stray commas exit 64 Files: - phase3-binary/Sources/macprovider-cli/AutotuneCommand.swift - phase3-binary/Tests/macprovider-cliTests/AutotuneCommandTests.swift - phase3-binary/implementation-notes.html - specs/SPEC-013-impl-audit.md (codex round-1 output landed) Next step: fire round-2 audit at codex to verify closures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 1 round-2 closure-verification prompt Round-1 audit (codex on facbaef) returned 1 MAJOR + 3 MINOR, verdict FIX REQUIRED. Audit-response landed as 02b038d closing all 4. Round 2 verifies the closures and spot-checks the fix-pass for new gaps. Output appends to specs/SPEC-013-impl-audit.md. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 2. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 1 round-2 LOCK — 4 CLOSED / 0 new findings Codex round-2 closure audit verified all 4 round-1 fixes landed clean and the fix-pass (02b038d) introduced no new gaps. Step 1 is locked at commit 02b038d; ready to proceed to Step 2. Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 2): - 4 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED - 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new - 253 tests passing (245 baseline + 8 added in fix-pass) - Smoke checks confirm: - --max-context-axis 4000,,8000 → exit 64 (was: silently accepted) - --candidate-models one,,two → exit 64 (was: silently accepted) - --max-context-axis '' → exit 0, maps to [target-context] Step 1 audit cycle complete after 2 rounds: - R1 (on facbaef): 1 MAJOR + 3 MINOR, FIX REQUIRED - Fix-pass (02b038d): closed all 4 - R2 (on 02b038d): 0 findings, READY TO PROCEED TO STEP 2 Next step: fire BUILD_SPEC_013_PROMPT.md Step 2 (--no-join on ServeCommand) at codex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 2: add serve --no-join Add a ServeCommand no-join guard so autotune candidate providers can expose local HTTP without creating a coordinator WebSocket client. Constraint: Step 2 only; no specs, beta, coordinator, or gateway files changed. Rejected: Mutating resolved coordinator_url | no-join is runtime join suppression and should not rewrite config visibility. Confidence: high Scope-risk: narrow Directive: Keep the default serve path on the coordinator factory branch so absent --no-join behavior remains unchanged. Tested: swift test --package-path phase3-binary --filter ServeCommandTests; swift test --package-path phase3-binary; swift run --package-path phase3-binary macprovider-cli serve --help | rg -- "--no-join|USAGE|OPTIONS"; git diff --check Not-tested: Live model-serving smoke with real weights, because package tests cover the coordinator suppression path without loading a model. * audit(autotune): Step 2 implementation audit prompt Adversarial code review prompt for the ffb00fb Step 2 commit (--no-join flag on ServeCommand + testable factory guard + optional-propagation cascade). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 3. Step 2 anti-regression invariant: with --no-join ABSENT (the default), every existing serve behavior MUST be byte-identical to pre-Step-2. The optional-propagation discipline (every former coordinatorClient.X call now ?.X) is the highest-risk surface; audit explicitly checks for any surviving direct call. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 3. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 2 LOCK — clean pass, 0 findings Codex round-1 audit on the Step 2 commit (ffb00fb) returned a clean pass: 0 CRITICAL / 0 MAJOR / 0 MINOR / 0 QUESTION. Step 2 is locked at commit ffb00fb without a fix-pass. Audit verdict (specs/SPEC-013-impl-audit.md § Round 3): - 0 findings across all categories (A through E + O) - 256 tests passing (253 baseline + 3 added by Step 2) - Optional-propagation grep confirmed clean: 0 direct coordinatorClient.X calls survived the diff - serve --help still exposes PR #105 flags + SPEC-011 flags + the new --no-join - serve --no-join --enable-warm-swap parses (no conflict) Step 2 audit cycle complete in 1 round (vs Step 1's 2 rounds). The factory-closure DI pattern is a cleaner shape than sequential mutation; the tests prove the no-instantiation guarantee at the type-system level. Next step: fire BUILD_SPEC_013_PROMPT.md Step 3 (SQLite schema + DB layer) at codex. Step 3 is the heaviest implementation step so far (FR-G.1 tune_trials schema + FR-G.2 tune_runs schema + ALTER TABLE migration + transactional retention sweep). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 3: add SQLite DB layer Implements the additive AutotuneDB writer surface for SPEC-013 run/trial persistence while keeping AutotuneCommand runtime unwired until later steps. Constraint: Step 3 only; no specs, beta, phase4-coordinator, or phase5-gateway changes; no new top-level dependency. Rejected: Adding a third-party SQLite wrapper | system SQLite3 is available via Swift C interop and preserves the single-binary dependency posture. Confidence: high Scope-risk: narrow Directive: Later autotune runtime steps should call insertRun before applyRetentionInTransaction and must set tune_trials.stage explicitly for every trial. Tested: swift test --filter AutotuneDBTests; swift test (260 tests, 0 failures). Not-tested: Real autotune runtime writes, because Step 3 intentionally does not wire AutotuneDB into AutotuneCommand.run(). * audit(autotune): Step 3 implementation audit prompt Adversarial code review prompt for the d0029e9 Step 3 commit (AutotuneDB SQLite layer + AutotuneExitReason enum + AutotuneTrialRow/AutotuneRunRow value types + ALTER TABLE migration + transactional retention sweep + 4 unit tests). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 4. Step 3 is the heaviest implementation step so far. High-risk surfaces flagged in the audit prompt: - Schema column-by-column match against SPEC-013 §5.7 FR-G.1/G.2 - Application-layer enum match against the 9 normative exit_reason values - Transactional retention atomicity (BEGIN IMMEDIATE → DELETE trials → DELETE runs → COMMIT; ROLLBACK on error) - C-interop discipline (statement finalize, handle close on all paths including error) - SQL injection guards (interpolated values must be Swift Int) - Anti-regression on the 256 baseline (commit claims 260 with 4 new tests) Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 3 LOCK — clean pass, 0 findings Codex round-1 audit on the Step 3 commit (d0029e9) returned a clean pass across all categories: 0 CRITICAL / 0 MAJOR / 0 MINOR / 0 QUESTION. Step 3 is locked at commit d0029e9 without a fix-pass — the second consecutive step to clear audit in one round (Step 2 was the first). Audit verdict (specs/SPEC-013-impl-audit.md § Round 4): - 0 findings across all 6 categories (A through E + O) - 260 tests passing (256 baseline + 4 added by Step 3) - All tune_trials + tune_runs schema columns match SPEC-013 §5.7 FR-G.1/G.2 byte-for-byte - All 9 normative exit_reason enum values present with correct snake_case raw values - Transactional retention discipline verified: BEGIN IMMEDIATE → DELETE trials → DELETE runs → COMMIT, ROLLBACK preserves original error - C-interop resource lifetimes clean: defer finalizes every prepared statement, close() runs on every init error path - No third-party SQLite dependency added (system SQLite3) - AutotuneTrialRow + AutotuneRunRow value types expose every field Steps 7/9/10/11 need Next step: fire BUILD_SPEC_013_PROMPT.md Step 4 (provider lifecycle — start/stop/wait-ready) at codex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 4: add provider lifecycle runner Constraint: SPEC-013 Step 4 only; no specs, beta, phase4-coordinator, or phase5-gateway changes; do not wire runner into AutotuneCommand.run yet. Rejected: --provider-bin CLI override | Step 4 requires resolving the current macprovider-cli binary without adding a new operator flag. Confidence: high Scope-risk: narrow Directive: Keep CandidateProviderRunner as the lifecycle primitive for Step 7; preserve SIGTERM-only stop unless Step 5 foreground-restart logic owns an explicit escalation path. Tested: swift test --package-path phase3-binary --filter CandidateProviderRunnerTests; swift test --package-path phase3-binary; git diff --cached --check. Not-tested: Real macprovider-cli serve lifecycle with a cached small model; integration test is gated behind MACPROVIDER_INTEGRATION_TEST=1 plus provider binary/model env vars. * audit(autotune): Step 4 implementation audit prompt Adversarial code review prompt for the 994c7ee Step 4 commit (CandidateProviderRunner + ReadyStatus enum + 6 unit tests + 1 integration-gated test). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 5. Step 4 is the most concurrency-heavy step so far. High-risk surfaces flagged in the audit prompt: - Single-provider invariant under concurrent start/stop races - Process lifecycle correctness (zombie risk, signal delivery timing, isRunning semantics) - Pipe/FD/log-handle lifetime across normal + error paths - waitForReady process-exit race detection (checked twice per iteration is good) - No SIGKILL escalation in v1 (Step 4 forbids; Step 5 will own launchd-specific paths) - Async/await cancellation correctness - Argv assembly matches PR #105 + FR-E.2 Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 4 round-1 audit findings (2 MAJOR + 2 MINOR + 1 QUESTION) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md § Round 5) returned 0 CRITICAL / 2 MAJOR / 2 MINOR / 1 QUESTION on the 994c7ee Step 4 commit, verdict FIX REQUIRED. This commit closes all 5 findings. B.1 (MAJOR) fix: process.run() failure cleanup hung draining never-launched pipes. The catch path called finishLogging() which ran readDataToEndOfFile() on stdout/stderr pipes whose write ends were still open in the parent (no child existed to close them on exec). Codex reproduced a 2-second hang locally with a missing binary path. Added RunningProvider.discardLogging(reason:) that clears readability handlers and closes the log file with a brief "process spawn failed" note WITHOUT draining pipes; pipe FDs close on Pipe deinit when RunningProvider is released. New test testStartFailsPromptlyWithMissingBinary asserts start() throws in < 1 second (fix returns in ~10ms vs the prior 2-second hang). C.1 (MAJOR) fix: waitForReady() returned .ready immediately on HTTP 200 without re-checking process state. If the provider 200d /v1/models and then crashed before waitForReady returned, Step 7 would begin measurement against a dead process. waitForReady now re-checks provider.process.isRunning after receiving HTTP 200; if the process exited in the race, returns .processExited(rc:stderrTail:) instead. New test testWaitForReadyHandlesImmediateExitAfterFirstResponse uses a Python stub that responds 200 and immediately exits; the runner produces .ready or .processExited but NEVER .timeout (the pre-fix silent-hang regression class). A.1 (QUESTION) resolution: stop() now returns StopResult enum with .stopped and .stuck(pid:) cases (@discardableResult so existing callers compile unchanged). The .stuck variant lets Step 7 record a clear failure instead of discovering it via the next start() failing with alreadyRunning. NO SIGKILL escalation in v1 per the BUILD prompt; current intentionally stays set so the single- provider invariant holds. New test testStopReturnsStoppedForNeverStartedRunner covers the happy path; the .stuck path is hard to reach deterministically without an unkillable subprocess and is left to Step 7 integration coverage. E.1 (MINOR) fix: argv validation tests covered only invalidKvBits. Added testServeArgumentsRejectInvalidPort, testServeArgumentsRejectInvalidMaxContext, and testServeArgumentsRejectInvalidMaxBatch against CandidateProviderRunner.serveArguments(...). I.1 (MINOR) fix: log filename collision within one second. The prior filename used only Int(timeIntervalSince1970), so two starts of the same model+port within one second collided and the second .atomic write truncated the first log. Filenames now append the first 8 chars of a UUID: <model>-<port>-<unix-ts>-<uuid8>.log New test testLogFileURLsDoNotCollideWithinOneSecond verifies two consecutive builds for the same (model, port) produce distinct paths. API addition: exposed logFileURL(...) as logFileURLForTesting(model:port:) static helper for I.1 test visibility; named explicitly to flag it as a test accessor. Anti-regression: swift test --package-path phase3-binary Executed 273 tests, with 1 test skipped and 0 failures (266 baseline + 7 new from this fix-pass) swift build --package-path phase3-binary Build complete (7.84s) Smoke check for B.1: start() with /nonexistent/macprovider-cli-stub-<uuid> throws in ~10ms (was: 2-second hang). Files: - phase3-binary/Sources/macprovider-cli/CandidateProviderRunner.swift - phase3-binary/Tests/macprovider-cliTests/CandidateProviderRunnerTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 audit at codex to verify closures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 4 round-2 closure-verification prompt Round-1 audit (codex on 994c7ee) returned 2 MAJOR + 2 MINOR + 1 QUESTION, verdict FIX REQUIRED. Audit-response landed as 4bcef89 closing all 5. Round 2 verifies the closures and spot-checks the fix-pass for new gaps. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 5. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 4 round-2 LOCK — 5 CLOSED / 0 new findings Codex round-2 closure audit verified all 5 round-1 fixes landed clean and the fix-pass (4bcef89) introduced no new gaps. Step 4 is locked at commit 4bcef89; ready to proceed to Step 5. Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 6): - 5 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED - 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new - 273 tests passing (1 skipped integration-gated) - Smoke checks confirm: - testStartFailsPromptlyWithMissingBinary passes in 7ms (was 2-second hang pre-fix) - waitForReady post-200 isRunning check correctly returns .processExited under the test stub race - StopResult.stopped/.stuck distinction surfaces stuck- provider failure cleanly - No SIGKILL escalation introduced; SIGTERM-only contract held Step 4 audit cycle complete after 2 rounds: - R1 (on 994c7ee): 2 MAJOR + 2 MINOR + 1 QUESTION, FIX REQUIRED - Fix-pass (4bcef89): closed all 5 - R2 (on 4bcef89): 0 findings, READY TO PROCEED TO STEP 5 Next step: fire BUILD_SPEC_013_PROMPT.md Step 5 (provider-conflict pre-flight + launchd drain) at codex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 5: add provider conflict preflight Add launchd and foreground serve conflict detection plus drain/restore primitives without wiring them into autotune execution yet. Keep launchctl, process-list, signal, port, and restart effects behind injected closures so unit tests do not mutate live services. Constraint: SPEC-013 Step 5 only; do not modify specs, beta, coordinator, gateway, or AutotuneCommand.run wiring. Rejected: substring argv matching for serve | it can match autotune/helper processes and violate FR-E.1 self-exclusion. Rejected: SIGKILL escalation | v1 drain policy is SIGTERM-only for foreground and launchctl-only for launchd. Confidence: high Scope-risk: narrow Directive: Step 7 should call ProviderConflictDetector/ProviderDrainer for pre-flight; Step 10 should call restore after tune/apply decisions. Tested: swift test --package-path phase3-binary (283 tests, 2 skipped, 0 failures); git diff --check Not-tested: real bootout/bootstrap drain against an installed launchd provider; launchctl list integration remains gated behind MACPROVIDER_INTEGRATION_TEST=1 * audit(autotune): Step 5 implementation audit prompt Adversarial code review prompt for the d40a6f7 Step 5 commit (ProviderConflictDetector + ProviderDrainer + extracted PortProbe + 10 unit tests, 1 integration-gated). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 7. Step 5 has the most external-effect surface so far. High-risk checks flagged in the audit prompt: - Launchd label byte-match vs SPEC-003 v0.9.2 §FR-C5 / install.sh - launchctl list parser handles PID '-' and label-as-substring - isForegroundServe whole-word match (NOT substring); self- exclusion via argv.contains('autotune') - gui/<uid>/<label> bootout vs gui/<uid> bootstrap distinction - SIGKILL discipline (v1 forbids; warningWriter emits when stuck) - KERN_PROCARGS2 sysctl argv parse buffer safety - DI surface completeness: every effect must be closure-injected so tests don't mutate real launchd state - Anti-regression on Step 4's runner refactor (isPortOpen extraction to MacProviderPortProbe must be byte-equivalent) Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 6. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 5 round-1 audit MINORs (G.1 + G.2 + G.3) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md § Round 7) returned 0 CRITICAL / 0 MAJOR / 4 MINOR / 0 QUESTION on the d40a6f7 Step 5 commit, verdict READY TO PROCEED. Codex's verdict already allowed progress to Step 6, but 3 of the 4 MINORs are cheap test-coverage locks and worth closing now (matching the Step 1 discipline). The 4th MINOR (H.1) is a Step 10 forward-compat note and is tracked there. G.1 (MINOR) fix: parseLaunchdManagedPID's inactive-PID case (launchctl list line beginning with `-`) had no direct unit test. Added testParseLaunchdManagedInactivePIDReturnsNil in ProviderConflictDetectorTests that exercises the parser directly with `-\t-\tlive.streamvc.macprovider` and asserts (found: true, pid: nil). G.2 (MINOR) fix: the helper-binary false-positive class (e.g. /usr/local/bin/macprovider-cli-helper serve) had no test. Added testHelperBinaryDoesNotMatchForegroundServe with a process list containing that argv; the lastPathComponent == "macprovider-cli" check correctly rejects. Complements testServeSubstringDoesNotMatchForegroundServe (which covered the subcommand-like name class only). G.3 (MINOR) fix: the SIGKILL-disabled warning path had no unit test. Added testForegroundDrainEmitsNoSIGKILLWarningWhenProcessRemainsAfterGrace in ProviderDrainerTests with processIsRunning: {_ in true}, portIsOpen: {_ in false}, and a recording warning writer; asserts the warning fires, includes the stuck PID, and names the v1 no-SIGKILL policy. H.1 (MINOR) NOT FIXED: restore idempotency. The drainer's restore(.launchdManaged) calls launchctl bootstrap unconditionally; a second restore after a successful bootstrap will fail because the service is already loaded. Codex flagged this as a Step 10 concern (Step 10 owns lifecycle cleanup); captured as a forward-compat note in implementation-notes rather than fixed in Step 5 code. Anti-regression: swift test --package-path phase3-binary Executed 286 tests, with 2 tests skipped and 0 failures (283 baseline + 3 new from this fix-pass) Files: - phase3-binary/Tests/macprovider-cliTests/ProviderConflictDetectorTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 audit at codex to verify closures, then proceed to Step 6 (pre-warm — Shape A vs Shape B). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 5 round-2 closure-verification prompt Round-1 audit (codex on d40a6f7) returned 0 CRITICAL / 0 MAJOR / 4 MINOR / 0 QUESTION, verdict READY TO PROCEED. Fix-pass (3adddbf) closed G.1 + G.2 + G.3 as pure test additions and deferred H.1 to Step 10 with a forward-compat note. Round 2 verifies the test closures + the deferral rationale. Production code was not touched in 3adddbf — round 2 also asserts that empty production-diff invariant. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 6 (already met; this round verifies the additional MINOR closures are sound). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 5 round-2 LOCK — 3 CLOSED + 1 DEFERRED / 0 new findings Codex round-2 closure audit verified all 3 MINOR closures landed clean and the H.1 Step-10 deferral is well-justified. Step 5 is locked at commit 3adddbf; ready to proceed to Step 6. Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 8): - 3 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED / 1 DEFERRED - 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new - 286 tests passing (2 skipped integration-gated) - git diff d40a6f7 3adddbf -- phase3-binary/Sources/ is EMPTY (test-only fix-pass) - Closures verified meaningful (not tautological): - G.1: testParseLaunchdManagedInactivePIDReturnsNil pins parser inactive-PID handling - G.2: testHelperBinaryDoesNotMatchForegroundServe would fail under substring matching regression - G.3: testForegroundDrainEmitsNoSIGKILLWarningWhenProcessRemainsAfterGrace pins the visible no-SIGKILL policy with all DI seams - H.1 deferral: implementation-notes explicitly assigns call-once discipline to Step 10's lifecycle wiring Step 5 audit cycle complete after 2 rounds: - R1 (on d40a6f7): 4 MINOR, READY TO PROCEED - Fix-pass (3adddbf): closed G.1+G.2+G.3, deferred H.1 to Step 10 - R2 (on 3adddbf): 0 findings, READY TO PROCEED TO STEP 6 Next step: fire BUILD_SPEC_013_PROMPT.md Step 6 (pre-warm — Shape B Swift-native). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 6: add Shape B pre-warm wrapper Implement the dormant Swift-native pre-warm load phase so Step 7 can wire measurement after provider readiness without adding a models pull subcommand. Constraint: SPEC-013 Step 6 only; no specs, beta, coordinator, gateway, or AutotuneCommand.run() wiring changes. Rejected: Shape A models pull subcommand | out of scope for SPEC-013 v1 and not present in the binary. Confidence: high Scope-risk: narrow Directive: Keep gate-ttft measurement after ProviderPreWarmer returns .warmed; this wrapper measures load/readiness only. Tested: swift test --package-path phase3-binary (293 tests, 2 skipped, 0 failures). Not-tested: real HuggingFace online fallback and real model cache freshness beyond local snapshot-file heuristic. * audit(autotune): Step 6 implementation audit prompt Adversarial code review prompt for the e7bfab5 Step 6 commit (HuggingFaceCacheChecker + ProviderPreWarmer + PreWarmResult + 7 unit tests). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 9. Step 6 has interesting risk surface in the FR-D.2 classifier: - Integrity-vs-transient classification is ASYMMETRIC. A missed integrity (integrity-as-transient) = CRITICAL silently advancing past a security failure. A misclassified transient (transient-as-integrity) = MAJOR aborting recoverable run. - 8 integrity markers (substring search, case-insensitive). Audit checks for missing markers vs SPEC-named examples and false-positive risk on short markers like 'hash mismatch'. - Cache check uses HF snapshot layout heuristic. Partial-download false positive is acceptable (downstream load failure classifies correctly). Architecturally consequential: the pattern means the provider is STOPPED before Step 7 sees .warmed. Audit flags this as the load-bearing G.2 question — is Step 7 expected to restart for the trial (load happens twice) or should Step 6 leave the provider alive? Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 6 round-1 audit findings (1 CRITICAL + 1 MAJOR + 2 MINOR) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md § Round 9) returned 1 CRITICAL / 1 MAJOR / 2 MINOR / 0 QUESTION on the e7bfab5 Step 6 commit, verdict FIX REQUIRED. This commit closes all 4 findings. B.1 (CRITICAL) fix: the missing-tokenizer integrity marker did not match the actual swift-transformers Hub error string. The prior marker "missing tokenizer.json" did NOT catch the real error "Required configuration file missing: tokenizer.json" (lowercased: "required configuration file missing: tokenizer.json") because of the colon — the SPEC-named integrity failure was silently classifying as transient and Step 7 would have advanced instead of aborting with exit_reason='pre_warm_integrity_failure'. This is the worst-case asymmetric FR-D.2 failure mode the audit prompt explicitly named: integrity-as-transient = CRITICAL (silently advancing past a security failure). Added the colon- bearing variants: - "configuration file missing: tokenizer.json" - "missing: tokenizer.json" - "missing required tokenizer files" Kept the original "missing tokenizer.json" as fallback for variant phrasings. New regression-lock test testPreWarmerClassifiesConfigurationFileMissingTokenizerAsIntegrity feeds the exact Hub error string to the stub provider and asserts .integrity classification. B.2 (MAJOR) fix: malformed safetensors / weight loader errors had no integrity markers. Added markers from the mlx-swift safetensors reader path (clean-room safe: mlx-swift is MIT- licensed, not Darkbloom): - "invalid json header" (catches "[load_safetensors] Invalid json header length ...") - "invalid json metadata" (catches "[load_safetensors] Invalid json metadata ...") - "incomplete metadata" - "invalid or corrupted" These signal corrupted or tampered repository content. Asymmetric FR-D.2 risk model biases toward integrity here. New test testPreWarmerClassifiesInvalidJsonHeaderAsIntegrity locks the behavior. B.3 (MINOR) fix: classification tests covered only the lowercased "signature mismatch" marker. Added testPreWarmerClassifiesMixedCaseIntegrityCorrectly that feeds "SIGNATURE Mismatch on Model Weights" and pins the .lowercased() matcher semantics. Combined with B.1 and B.2's new tests, the integrity-classification suite now covers signature, tokenizer-missing, and safetensors-corruption marker classes. E.1 (MINOR) fix: now injection was unused in duration assertions. Added testPreWarmerLoadDurationReflectsInjectedClock that injects a deterministic now() closure advancing by 7s between the start and end calls, then asserts loadDurationSec == 7.0 (accuracy 0.001). A regression that records the end time before waitForReady returns or always returns 0 would fail this test. G.2 resolved (defer-stop after .warmed was intentional). Round-1 confirmed the disposable-load-phase design: Step 6 stops the provider before returning .warmed, Step 7 will start a fresh provider against the now-warm cache for trial measurement. The cold fetch is excluded because Step 6 has already warmed the HF cache; the second readiness wait is still outside gate-ttft-ms. Not a bug. Anti-regression: swift test --package-path phase3-binary Executed 297 tests, with 2 tests skipped and 0 failures (293 baseline + 4 new from this fix-pass) Marker list now totals 15 patterns (vs 8 before) covering: - signature/hash/checksum class (7) - missing-tokenizer class (4) - safetensors corruption class (4) Files: - phase3-binary/Sources/macprovider-cli/ProviderPreWarmer.swift - phase3-binary/Tests/macprovider-cliTests/ProviderPreWarmerTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 audit at codex to verify closures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 6 round-2 closure-verification prompt Round-1 audit returned 1 CRITICAL + 1 MAJOR + 2 MINOR, verdict FIX REQUIRED. Fix-pass (682abe8) closes all 4: B.1 added 3 new tokenizer-missing markers including the colon-bearing swift-transformers Hub error variant; B.2 added 4 mlx-swift safetensors corruption markers; B.3 added mixed-case integrity test; E.1 added deterministic-clock load-duration test. Round 2 verifies closures + the expanded 15-marker list for new false-positive risk. Two of the new markers ('incomplete metadata', 'invalid or corrupted') are vague enough that codex should specifically assess FP risk against benign log lines. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 7. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 6 round-2 audit N.1 MAJOR (vague integrity markers) Codex round-2 closure audit (specs/SPEC-013-impl-audit.md § Round 10) verified all 4 round-1 fixes closed cleanly but flagged 1 MAJOR NEW finding (N.1): the round-1 fix-pass added two unanchored integrity markers ("incomplete metadata" and "invalid or corrupted") that could match plausible benign transient lines, over-aborting recoverable runs. Concrete examples codex identified: - "Download failed: incomplete metadata in Hugging Face API response; retry later" would match "incomplete metadata" and classify as integrity (silently aborting an HF API hiccup that the next candidate could survive). - "cache index invalid or corrupted; rebuild cache and retry" would match "invalid or corrupted" and classify as integrity (silently aborting a local-state recoverable case). Under the asymmetric FR-D.2 risk model (integrity-as-transient = CRITICAL silent advance past security failure; transient-as-integrity = MAJOR over-abort recoverable), the vague markers exceeded the recoverable-tolerance threshold. N.1 fix: removed both vague markers from the integrity list. The remaining safetensors-context markers "invalid json header" and "invalid json metadata" stay because they're anchored to the loader's actual output (`[load_safetensors] Invalid json header ...`) and unlikely to appear in transient/network/cache contexts. SPEC-013 §5.4 FR-D.2 still has full coverage: - signature/hash/checksum class (7 markers) - missing-tokenizer class (4 markers) - safetensors header/metadata corruption (2 markers) = 13 total markers (was 15 after round 1). Negative-lock tests added (regression-locks against any future broad-marker reintroduction): - testPreWarmerClassifiesIncompleteMetadataDownloadErrorAsTransient feeds the HF API metadata line and asserts .transient - testPreWarmerClassifiesCacheRebuildHintAsTransient feeds the cache rebuild hint and asserts .transient These tests would FAIL if either marker is reintroduced. Anti-regression: swift test --package-path phase3-binary Executed 299 tests, with 2 tests skipped and 0 failures (297 baseline + 2 new negative locks; the 4 round-1 closure tests still pass under the narrower marker list) Files: - phase3-binary/Sources/macprovider-cli/ProviderPreWarmer.swift - phase3-binary/Tests/macprovider-cliTests/ProviderPreWarmerTests.swift - phase3-binary/implementation-notes.html Next step: fire round-3 audit at codex to verify the N.1 closure didn't introduce further gaps. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 6 round-3 closure-verification prompt Round-2 audit (codex on 682abe8) returned 4 round-1 CLOSED but 1 MAJOR NEW (N.1) — vague markers 'incomplete metadata' and 'invalid or corrupted' could over-classify benign transients. Round-2 fix-pass (dcd7f63) removed both vague markers and added 2 negative-lock tests pinning the benign-line behavior as .transient. Round 3 verifies N.1 closure + confirms the narrower marker list still covers all SPEC-013 §5.4 FR-D.2 named integrity examples. Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 7. This is the 11th audit round in the SPEC-013 implementation cycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 6 round-3 LOCK — N.1 CLOSED / 0 new findings Codex round-3 closure audit verified the N.1 fix landed clean without losing SPEC-named integrity coverage. Step 6 is locked at commit dcd7f63; ready to proceed to Step 7. Round-3 verdict (specs/SPEC-013-impl-audit.md § Round 11): - N.1 CLOSED - 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new - 299 tests passing (2 skipped integration-gated) - 4 round-1 closure tests still pass under narrower list: - testPreWarmerClassifiesConfigurationFileMissingTokenizerAsIntegrity - testPreWarmerClassifiesInvalidJsonHeaderAsIntegrity - testPreWarmerClassifiesMixedCaseIntegrityCorrectly - testPreWarmerLoadDurationReflectsInjectedClock - 2 round-2 negative-lock tests would fail under marker reintroduction: - testPreWarmerClassifiesIncompleteMetadataDownloadErrorAsTransient - testPreWarmerClassifiesCacheRebuildHintAsTransient Codex's cross-checks against mlx-swift / mlx-swift-examples / swift-transformers verified: - Active MLX provider load path uses [load_safetensors] strings covered by the 2 remaining safetensors markers. - swift-transformers WeightsError.invalidFile string 'The weights file is invalid or corrupted' exists but is NOT reached by the audited MLX provider load path — so the removal of the 'invalid or corrupted' marker does not open a concrete gap. Final integrity marker list (13 patterns): - signature/hash/checksum class (7) - missing-tokenizer class (4 including colon-bearing swift-transformers Hub variant) - safetensors header/metadata anchored to [load_safetensors] (2) Step 6 audit cycle complete after 3 rounds: - R1 (e7bfab5): 1 CRITICAL + 1 MAJOR + 2 MINOR, FIX REQUIRED - Fix1 (682abe8): closed all 4 (markers 8 → 15) - R2 (682abe8): closures verified, 1 MAJOR NEW (N.1: vague markers FP risk) - Fix2 (dcd7f63): removed vague markers (15 → 13) + 2 negative locks - R3 (dcd7f63): N.1 CLOSED, READY TO PROCEED TO STEP 7 Next step: fire BUILD_SPEC_013_PROMPT.md Step 7 (Stage 1 feasibility iteration — the heaviest BUILD step, composes Steps 1-6). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 7: add Stage 1 feasibility iterator Build the dormant Stage 1 primitive so SPEC-013 can compose pre-warm, candidate probing, and trial persistence before runtime wiring in Step 10. Constraint: Step 7 only; do not wire AutotuneCommand.run yet or touch specs, beta, coordinator, or gateway paths. Rejected: internal size-based re-rank | violates FR-A.1 and AC-17 operator-order guarantee. Confidence: high Scope-risk: moderate Directive: Keep Stage1Iterator candidate traversal in caller-supplied order; Step 10 should own tune_runs persistence when full runtime metadata exists. Tested: swift test --package-path phase3-binary --filter Stage1IteratorTests; swift test --package-path phase3-binary Not-tested: real cached MLX model Stage 1 probe outside integration-gated local environment * audit(autotune): Step 7 implementation audit prompt Adversarial code review prompt for the 98d7079 Step 7 commit (Stage1Iterator + Stage1Prober + 3 DI protocols + 7 unit tests, 529 + 440 lines). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 12. Step 7 is the heaviest BUILD step — composes Steps 1-6 into the actual largest-first candidate iteration with SSE/TTFT measurement. High-risk audit surfaces flagged: - AC-17 regression: any internal sort/re-rank of candidates = CRITICAL (biggest-fit product strategy depends on this) - FR-A.2 STOP-on-first-feasible: must not continue past feasible - failureReasons.last semantics for FR-H.4 smallest-first: silently coupled to largest-first iteration order; could flip under operator-supplied smallest-first list - SSE parsing: OpenAI-compat data: line handling, [DONE] terminator, multi-line events (uncommon) - TTFT measurement: from request start to first content delta - The two waitForReady(timeout: 0.05) mid-probe peeks for process-exit detection during the probe - DI protocol cast in ProviderPreWarmer extension is leaky Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 7 round-1 audit findings (1 CRITICAL + 1 MAJOR + 4 MINOR) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md § Round 12) returned 1 CRITICAL / 1 MAJOR / 4 MINOR / 0 QUESTION on the 98d7079 Step 7 commit, verdict FIX REQUIRED. This commit closes all 6 findings. E.1 (CRITICAL) fix: failureReasons.last only surfaced the smallest candidate under largest-first iteration. With operator override ["1b", "32b"] (the AC-17 shape) and both failing, .last returned 32b's reason — silent FR-A.4 / FR-H.4 violation. This was the subtle design risk explicitly flagged in the audit prompt's Category E. Fix: added an orthogonal `candidatesBySize: [String]?` init parameter (default: caller list reversed, preserving prior behavior for largest-first default lists). The all-infeasible throw now walks candidatesBySize smallest-first to find the first failed candidate and surfaces its reason. The error's trials list is also size-ordered (smallest-first) so the caller can print reasons in FR-H.4's "even <smallest>: <reason>" format followed by each larger candidate in size order. New regression-lock test testStage1IteratorAllInfeasibleWithOperatorOverrideSurfacesSmallestBySize passes ["1b", "32b"] as both iteration AND size order, both fail, asserts surfaced reason is "1b: 1b leaked stop token" (not "32b OOM during probe"). Existing test testStage1IteratorAllInfeasibleSurfacesSmallestFirstReason updated to match the new contract. F.1 (MAJOR) fix: integrity-abort path lost per-candidate trial evidence. SPEC-013 §5.7 FR-G.1 requires "every trial is recorded" including the security-relevant integrity failure that aborts the whole run. The prior code threw before insertTrial. Fix: pre-warm .failed(.integrity) branch now writes the trial row with notes="pre-warm integrity: <reason>" BEFORE throwing preWarmIntegrityFailure. Test testStage1IteratorAbortsOnIntegrity flipped from asserting trialModels == [] to trialModels == ["model-a"]. H.1 (MINOR) fix: Stage 1 row kept=true conflicted with SPEC FR-G.1 schema comment "Stage 2 only; 0 in Stage 1". Stage1Iterator now always passes kept: false for Stage 1 rows; Step 9 derives the chosen model from Stage1IteratorResult.selectedModel rather than tune_trials.kept. D.1 (MINOR) fix: SSE parser edge cases untested. Added 2 new tests: - testStage1ProberClassifiesHTTPNon2xxAsInfeasible: HTTP 503 response → infeasible with "HTTP 503" in reason - testStage1ProberAcceptsCompletionsStyleTextSSE: SSE stream using choices[0].text (completions API) instead of choices[0].delta.content (chat API) → feasible The completions-style fixture also exercises SSE comment lines (": keep-alive") and malformed JSON data: payloads ("not-valid-json"), locking the parser's graceful-skip behavior. K.1 (MINOR) fix: Stage 1 persistence fields partly asserted. Added testStage1IteratorPersistsFullStage1FieldSet with a new assertSingleTrialRow helper that reads every Stage 1-relevant column and asserts: stage=1, runID, fits, nErr, kept=false, replicatesN=stage1Replicates, maxContextCap=targetContext, kvBits=nil, maxBatch=nil. A future regression writing stage=0 or losing replicates_n would fail this test. G.1 (MINOR) fix: ProviderPreWarmer DI cast leakiness. Added a documentation block on extension ProviderPreWarmer: Stage1PreWarming explaining the production-only cast (real prewarmer requires CandidateProviderRunner for Step 6's defer-stop pattern) and steering test authors to inject a fake Stage1PreWarming directly rather than combining a real prewarmer with a fake runner. Anti-regression: swift test --package-path phase3-binary Executed 310 tests, with 2 tests skipped and 0 failures (306 baseline + 4 new from this fix-pass) Files: - phase3-binary/Sources/macprovider-cli/Stage1Iterator.swift - phase3-binary/Tests/macprovider-cliTests/Stage1IteratorTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 audit at codex to verify closures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 7 round-2 closure-verification prompt Round-1 audit returned 1 CRITICAL + 1 MAJOR + 4 MINOR, verdict FIX REQUIRED. Fix-pass (a9da9e5) closes all 6 with substantial changes to the iterator core: new candidatesBySize parameter for FR-H.4 size-ordered failure surface, integrity row recorded before throw, Stage 1 kept=false always, 2 new SSE tests, 1 new persistence-field test, 1 new doc block. Round 2 verifies closures + spot-checks the candidatesBySize API for new precision gaps (default fallback + operator override interaction is the highest-risk new surface). Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 7 round-2 LOCK — 6 CLOSED / 1 MINOR forward-compat note Codex round-2 closure audit verified all 6 round-1 fixes landed clean. Step 7 is locked at commit a9da9e5; ready to proceed to Step 8. Round-2 verdict (specs/SPEC-013-impl-audit.md § Round 13): - 6 CLOSED / 0 PARTIAL / 0 NOT CLOSED / 0 OVER-CLOSED - 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new (N.1) - 310 tests passing (2 skipped integration-gated) - AC-17 guard still meaningful: testStage1IteratorHonorsOperatorOrderForACSeventeen passes [1b, 32b] both feasible, asserts only 1b is selected N.1 (MINOR) DEFERRED to Step 10: candidatesBySize is brittle under operator override. The nil fallback (candidates.reversed()) is correct for largest-first defaults but WRONG if operator passes smallest-first via --candidate-models without explicit candidatesBySize. Codex correctly notes: - Not a Step 7 blocker (Stage 1 is dormant; AutotuneCommand.run doesn't wire Stage1Iterator yet) - Doc comment + regression-lock test mitigate - Step 10 owns the wiring layer that must derive size order explicitly using AutotuneCommand.defaultCandidates' sizeB metadata - Recommendation: Step 10 integration test with '--candidate-models 1b,32b' both failing, assert all-infeasible surface leads with 1b Step 7 audit cycle complete after 2 rounds: - R1 (98d7079): 1 CRITICAL (E.1) + 1 MAJOR (F.1) + 4 MINOR, FIX REQUIRED - Fix (a9da9e5): closed all 6 + 4 new tests - R2 (a9da9e5): 6 CLOSED, 1 MINOR forward-compat to Step 10, READY TO PROCEED TO STEP 8 Next step: fire BUILD_SPEC_013_PROMPT.md Step 8 (Stage 2 knob hill-climb). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 8: add Stage 2 knob hill-climb Implement the dormant SPEC-013 Stage 2 hill-climb so Step 10 can compose Stage 1 model selection with per-model knob selection. Constraint: Step 8 only; do not wire Stage2HillClimb into AutotuneCommand.run yet and do not modify specs or sensitive service paths. Rejected: wiring Stage 2 into the CLI runtime now | Step 10 owns Stage 1 + Stage 2 integration. Confidence: high Scope-risk: moderate Directive: Preserve deterministic cell order and strict-all-feasible cell semantics when wiring this into the autotune runtime. Tested: swift test --package-path phase3-binary --filter Stage2HillClimbTests; swift test --package-path phase3-binary Not-tested: real provider Stage 2 hill-climb against mlx-swift hardware; integration tests remain gated by MACPROVIDER_INTEGRATION_TEST. * audit(autotune): Step 8 implementation audit prompt Adversarial code review prompt for the 118599e Step 8 commit (Stage2HillClimb + Stage2Prober + WinningKnobs + Stage2HillClimbResult + isNewBest + 7 unit tests). Read-only review; codex will append findings to specs/SPEC-013-impl-audit.md as round 14. Step 8 is the second-heaviest BUILD step (520 + 334 lines). Highest-risk audit surfaces: - isNewBest port (Category A): must match prototype Python verbatim. Any subtle deviation in the tie-band logic could silently pick the wrong knob combo. - Strict-all-feasible per cell (Category B, FR-B.2): if any replicate fails, the cell MUST be infeasible. - Cartesian iteration determinism (Category C): operators must be able to replay runs. - AutotuneTrialRow stage=2 + kvBits/maxBatch/maxContextCap populated correctly (Category D). - kept= semantics — SPEC §5.7 schema comment said 'Stage 2 only' so the winner cell may set kept=true (audit checks Step 8's interpretation). Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 9. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 8 round-1 audit findings (1 MAJOR + 1 MINOR) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md § Round 14) returned 0 CRITICAL / 1 MAJOR / 1 MINOR / 0 QUESTION on the 118599e Step 8 commit, verdict FIX REQUIRED. This commit closes both findings. The Category A isNewBest port was verified faithful to the prototype Python — no change there. D.1 (MAJOR) fix: multiple Stage 2 rows had kept=true. The prior code computed `kept` at insertion time based on isNewBest, which left every cell that became "running best" with kept=1 in the DB. SPEC-013 §5.7 FR-G.1 requires only the FINAL winning Stage 2 cell to carry kept=1. Implementation: - Added AutotuneDB.markStage2WinnerCell(runID:kvBits:maxBatch:maxContextCap:) — an idempotent SQL UPDATE matching the (run_id, stage=2, knobs) compound. The kv_bits NULL/non-NULL is handled at SQL-clause level since SQLite's `= NULL` doesn't match. Uses the existing withStatement/bind helpers, no new C-interop. - Stage2HillClimb.run() now inserts every row with kept=false during the loop, tracks the winning row's array index via bestRowIndex, and after the loop calls markStage2WinnerCell with the winner's knobs + mutates trialRows[bestRowIndex].kept in memory so cellTrials is consistent. Test updates: - testStage2HillClimbAppliesIsNewBestThroughputPrimary: flipped [true, true] → [false, true] - testStage2HillClimbAppliesIsNewBestTTFTTiebreak: same flip - testStage2HillClimbPersistsAllCellTrialsWithStageTwo: now explicitly asserts rows.map(\.kept) == [false, true, false, false] AND rows.filter(\.kept).count == 1 as a regression-lock on the exactly-one-winner invariant J.1 (MINOR) fix: direct isNewBest edge-branch tests missing. Added 5 new tests targeting the prototype's documented but untested branches: - testIsNewBestAcceptsPositiveTPSWhenBestTPSIsZero — bestTPS=0 edge with positive new TPS → true - testIsNewBestRejectsZeroTPSWhenBestTPSIsZero — zero-vs-zero edge → false (catches a potential div-by-zero crash if the bestTPS<=0 guard ever drifts) - testIsNewBestWinsTieBandWhenBestTTFTIsNil — tie-band with bestTTFT=nil but new ttft measurable → true (matches Python `ttft is not None and (best_ttft is None or ttft < best_ttft)`) - testIsNewBestHoldsWhenBothTTFTsAreNilInTieBand — symmetric unmeasurable → false (incumbent holds) - testIsNewBestHoldsWhenNewTTFTIsNilInTieBand — asymmetric (new unmeasurable, best measurable) → false (incumbent holds) AutotuneDB additive method note: Step 3 is locked at d0029e9, but adding markStage2WinnerCell is purely additive — same pattern Step 5 used to extract PortProbe.swift from Step 4. The new method follows the existing insertTrial / insertRun / applyRetentionInTransaction idioms (withStatement + bind, throws on SQLite errors). Anti-regression: swift test --package-path phase3-binary Executed 322 tests, with 2 tests skipped and 0 failures (317 baseline + 5 new from this fix-pass) Files: - phase3-binary/Sources/macprovider-cli/AutotuneDB.swift (+44 lines) - phase3-binary/Sources/macprovider-cli/Stage2HillClimb.swift - phase3-binary/Tests/macprovider-cliTests/Stage2HillClimbTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 audit at codex to verify closures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 8 round-2 closure-verification prompt Round-1 audit returned 1 MAJOR (D.1) + 1 MINOR (J.1), verdict FIX REQUIRED. Fix-pass (022e8a3) closes both: D.1 via new AutotuneDB.markStage2WinnerCell + post-loop UPDATE pattern; J.1 via 5 new direct isNewBest edge-branch tests. Round 2 verifies closures + spot-checks the additive markStage2WinnerCell SQL (highest-risk new surface: NULL handling in the kv_bits clause, idempotency). Per [[feedback-build-audit-loop]]: loop until 0 CRITICAL/MAJOR before proceeding to Step 9 (the recommendation surface). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 8 round-2 LOCK — 2 CLOSED / 0 NEW Codex round-2 closure verification on commit 022e8a3 returned 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new across the 2 round-1 findings (D.1 + J.1), verdict READY TO PROCEED TO STEP 9. Round 14 → Round 15 closure summary: D.1 (MAJOR, multiple kept=true Stage 2 rows) — CLOSED. Verified that Stage2HillClimb.run() now inserts every cellTrial row with kept=false and only after the Cartesian product completes calls markStage2WinnerCell on the final winner. The no-feasible-cell edge throws .noFeasibleCell before reaching the marker call (Stage2HillClimb.swift:165-179), so no winner row is fabricated. The new AutotuneDB.markStage2WinnerCell SQL was the highest-risk new surface — verified: - kv_bits NULL handling uses IS NULL (not = NULL, which matches nothing in SQLite) - Bind arithmetic correct in both branches: nil binds run_id at 1, skips kv_bits, max_batch at 2, max_context_cap at 3; non-nil binds run_id at 1, kv_bits at 2, max_batch at 3, max_context_cap at 4 - No SQL injection — only constant string interpolation is the kv_bits clause choice; all operator data bound via the existing withStatement/bind helpers - Idempotent by SQL semantics J.1 (MINOR, isNewBest edge-branch tests missing) — CLOSED. The five new direct tests target the bestTPS<=0 boundary and the nil-TTFT tie-band branches at Stage2HillClimb.isNewBest:196-225. Non-tautological — each would fail if its named branch broke. Step 3 (AutotuneDB) extension is verified additive: same pattern as the Step 5 PortProbe extraction. git diff 022e8a3^ 022e8a3 shows only Step 8 files + audit report; no existing AutotuneDB method signature changed. Anti-regression: swift test --package-path phase3-binary executed 322 tests, skipped 2 integration-gated, 0 failures. Loop state: Step | Build | R1 | Fix | R2 | Status -----|---------|---------------|---------|----|--------- 8 | 118599e | 1M (D.1) | 022e8a3 | 0 | LOCKED | + 1m (J.1) | | | Next: fire Step 9 BUILD (recommendation surface + --json schema + recipe_hash + --apply config write). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(autotune) Step 9: add recommendation surface Build the standalone SPEC-013 recommendation, JSON, recipe-hash, and config-apply surfaces so Step 10 can wire runtime orchestration without inventing output or write semantics. Constraint: Steps 1-8 are locked; Step 9 must not wire AutotuneCommand.run() or edit specs/beta/coordinator/gateway paths. Rejected: JSONEncoder-only recipe hashing | RFC 8785 JCS requires canonical key ordering and whitespace-free bytes independent of pretty JSON output. Rejected: Yams emit for config writes | Yams 5.1 drops comments/order, so the applier validates with Yams then rewrites only SPEC-013 top-level owned keys. Confidence: high Scope-risk: moderate Directive: Step 10 should call these components but keep recipe_hash persistence and run wiring separate from the pure emitter contract. Tested: swift test --package-path phase3-binary --filter 'RecommendationEmitterTests|ConfigApplierTests' (22/22 passed) Tested: swift test --package-path phase3-binary (344 tests, 2 skipped, 0 failures) Not-tested: Real ~/.config/macprovider/config.yaml writes; ConfigApplier tests use temp directories only. * audit(autotune): Step 9 implementation audit prompt Round 1 of Step 9 audit covering the 292b2f9 commit (RecommendationEmitter + RFC8785JCS + ConfigApplier). The 14 audit categories (A-N) focus on: - JCS encoder correctness — 81-LOC vendor implementation must produce canonical bytes matching an independent reference (AC-12 property 2 cross-impl determinism). The prompt includes an exact reference JSON vector + asks the auditor to cross-check via printf|shasum. - recipe_hash domain isolation — exact 7 fields, no leakage of observation fields. Two CRITICAL-grade tripwires documented (B.2). - ConfigApplier custom YAML rewriter — Yams emit was REJECTED in the build commit; the custom rewriter is load-bearing for AC-9 "non-owned keys byte-identical pre/post" assertion. - kv_bits = nil propagation — 6 surfaces that MUST agree (terminal "unset", serve omits flag, JSON null, JCS null, YAML key omitted, summary "unset"). - Atomic write — POSIX rename in same directory, NOT FileManager.replaceItem. - Backup counter — 0 through 65535, no overwrite, no TOCTOU race. Forward-compat seams (Category M) verify Step 10 can wire the emitter without touching the LOCKED surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(autotune): close Step 9 round-1 audit findings (4 MINOR + 1 QUESTION) Codex round-1 implementation audit (specs/SPEC-013-impl-audit.md Round 16) returned 0 CRITICAL / 0 MAJOR / 4 MINOR / 1 QUESTION on the 292b2f9 Step 9 commit, verdict READY TO PROCEED TO STEP 10. The independent printf|shasum cross-check on the reference vector matched the baked test literal, locking JCS cross-implementation determinism. This commit closes all 5 findings to satisfy the "until 0 findings" loop discipline. B.1 (QUESTION) — nil-recommendation recipe_hash policy. The audit asked Step 10 to decide whether the degenerate model=null+knobs=null hash gets persisted or NULL is stored. Pre-committed the cleaner contract here: EmittedRecommendation.recipeHash is now String?; when inputs.recommendation == nil the emitter returns nil and the JSON output emits literal null. This matches the tune_runs.recipe_hash schema comment "NULL if no recommendation" and avoids a JSON/DB asymmetry. New test testRecipeHashIsNilWhenRecommendationIsNil locks both invariants. E.1 (MINOR) — JSON schema test only spot-checks nested objects. Rewrote testJSONOutputMatchesSpec013Schema to walk every documented nested key, assert primitive types, and use Set(dict.keys) == <expected> as an additive-only regression-lock. Now covers machine.{ram_gb, chip, os_version, binary_version}, inputs.{target_context, candidate_models, stage1_replicates, stage2_replicates, gate_ttft_ms, tps_tie_epsilon}, recommendation.{model, target_context, knobs, tps_median, ttft_p95_ms, replicates, serve_command}, recommendation.knobs.{kv_bits, max_concurrency_override, max_context_override}, alternates string array, and infeasible[].{model, rank, reason}. ISO-8601 timestamp shape is asserted via regex. G.1 (MINOR) — TOCTOU race between fileExists and rename. The prior firstAvailableBackupPath + atomicWrite pattern could overwrite a concurrently-created backup. Replaced with writeBackupExclusively which opens each candidate counter with O_CREAT | O_EXCL | O_WRONLY; on EEXIST retries the next counter, on success writes the original bytes directly to the exclusively-opened FD. The config write still uses temp+rename per FR-F.3. New error case backupWriteFailed reports non-EEXIST open or write failures. New test testApplyBackupUsesExclusiveCreateAgainstTOCTOURace pre-creates backups at counters 0-3, applies, and asserts (a) new backup lands at counter 4, (b) all 4 pre-existing files retain their byte content unchanged. I.1 (MINOR) — AC-9 preservation tests too weak. Added two stronger tests: - testApplyPreservesNonOwnedLinesByteIdentically uses a fixture with leading/trailing/block/inline comments, blank lines, and ordering sentinels; splits pre and post into lines, filters out the 4 owned keys, and asserts the remaining line arrays are byte-identical (not just substring-contains). - testApplyResultIsParseableByConfigLoader runs the post-apply file through the binary's MacProviderCore.ConfigLoader.load and asserts loaded.{model, kvBitsOverride, maxContextOverride, maxConcurrencyOverride} resolve to the recommendation values. Catches future parser/key-name drift the spot-check tests would miss. K.1 (MINOR) — launchd hint test only checks bootout/bootstrap. Renamed testLaunchdRestartHintIncludesBootoutAndBootstrap to testLaunchdRestartHintIncludesAllRequiredSubstrings and added assertions for the plist path (~/Library/LaunchAgents/live.streamvc.macprovider.plist) and the service identifier (gui/$UID/live.streamvc.macprovider). testApplyWriteIsAtomic adjustment: the test now asserts tempPaths.count == 1 (was 2) since the backup write no longer uses the temp+rename path. The config write still uses temp+rename and the spy tempFileNamer exercises it. Anti-regression: swift test --package-path phase3-binary Executed 348 tests, with 2 tests skipped and 0 failures (344 baseline + 4 new from this fix-pass) Files: - phase3-binary/Sources/macprovider-cli/RecommendationEmitter.swift - phase3-binary/Sources/macprovider-cli/ConfigApplier.swift - phase3-binary/Tests/macprovider-cliTests/RecommendationEmitterTests.swift - phase3-binary/Tests/macprovider-cliTests/ConfigApplierTests.swift - phase3-binary/implementation-notes.html Next step: fire round-2 closure verification audit at codex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 9 round-2 closure-verification prompt Round 2 verifies the 4 MINOR + 1 QUESTION closures from Round 16, with two highest-risk surfaces: writeBackupExclusively (POSIX O_EXCL + manual write loop) and Optional recipeHash JSON encoding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * audit(autotune): Step 9 round-2 LOCK — 5 CLOSED / 0 NEW Codex round-2 closure verification on commit d6c634c returned 0 CRITICAL anti-regression / 0 MAJOR new / 0 MINOR new across the 5 round-1 findings (1 QUESTION + 4 MINOR), verdict READY TO PROCEED TO STEP 10. Round 16 -> Round 17 closure summary: B.1 (QUESTION, nil-recommendation recipe_hash policy) — CLOSED. Verified that EmittedRecommendation.recipeHash is now String?, the emitter's recipeHash(_:) returns nil before computing the JCS hash when no recommendation exists, and JSONRoot.encode(to:) calls encodeNil for the recipe_hash key (producing literal JSON null, not the string "null"). The Step 10 tune_runs persistence contract is now unambiguous: store NULL when recipeHash is nil. E.1 (MINOR, JSON schema test too weak) — CLOSED. The rewritten testJSONOutputMatchesSpec013Schema now asserts every documented nested key with a primitive type assertion AND Set(dict.keys) == <expected> for machine/inputs/recommendation/ recommendation.knobs/infeasible[0]. ISO-8601 regex is asserted on both timestamps. Catches future undocumented field additions. G.1 (MINOR, TOCTOU race) — CLOSED. writeBackupExclusively correctly uses open(O_CREAT | O_EXCL | O_WRONLY, 0o644), retries EEXIST, throws backupWriteFailed on other errno. writeAll's pointer arithmetic (base.advanced(by: written), data.count - written) is correct for partial writes; EINTR retry is in place. The n==0 edge case is acknowledged as benign for a blocking regular-file fd (no v1 risk). The testApplyBackupUsesExclusiveCreateAgainstTOCTOURace test pre-creates counters 0-3, verifies the new backup lands at counter 4, and asserts all pre-existing files retain byte content unchanged. I.1 (MINOR, AC-9 preservation too weak) — CLOSED. testApplyPreservesNonOwnedLinesByteIdentically uses a fixture with comments/blanks/inline-comment/SPEC-013 marker, and asserts non-owned line arrays are byte-identical pre/post. testApplyResultIsParseableByConfigLoader runs the post-apply file through MacProviderCore.ConfigLoader.load and asserts the 4 owned values resolve to recommendation values — catches future parser/key-name drift. K.1 (MINOR, launchd hint too narrow) — CLOSED. Renamed testLaunchdRestartHintIncludesAllRequiredSubstrings asserts launchctl bootout + bootstrap + plist path + service id. Anti-regression: swift test --package-path phase3-binary executed 348 tests, skipped 2 integration-gated, 0 failures. Loop state: Step | Build | R1 | Fix | R2 | Status -----|---------|-------------------|---------|----|--------- 9 | 292b2f9 | 4m + 1 QUESTION | d6c634c | 0 | LOCKED Next: fire Step 10 BUILD (failure modes + signal handling + AutotuneCommand.run() wiring + tune_runs persistence + size-ordered candidatesBySize for Stage 1)…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds three new
serveflags sobeta/autotune.py(PR #103) canhill-climb a real per-machine recipe space, not just
--model:--kv-bits {4,8}GenerateParameters.kvBits— bothcomplete()andstream()--max-context <N>GenerateParameters.maxKVSize+ existingvalidatePromptTokenCount413 gate--max-batch <N>AsyncSemaphore(value: maxBatch)inModelRuntimeAll three are triple-exposed (CLI > env > YAML > default), matching
the house convention used by every other flag in
ServeCommand. Anew
runServingKnobsPreflightfails loud at serve start on invalidvalues (
--kv-bits 7,--max-context 0, etc.) instead of leaking tomid-inference.
Why
The reference autoresearch repo's "different recipe per machine"
insight depends on tuning config per machine, not just the model.
Today's autotune loop can only flip
--model; this PR widens itssearch to KV-cache precision, served context window, and concurrency.
Bug fix folded in
ServeCommand.run()was hardcodingmaxConcurrencyOverride: 1whenbuilding
ProviderCapacity, silently ignoring the resolved config.Capacity now reflects the actual
--max-batchvalue.Flags shipped vs dropped
All three target knobs shipped — verified against the checked-out
mlx-swift-examples 2.29.1 in
.build/checkouts/mlx-swift-examples/Libraries/MLXLMCommon/Evaluate.swift:54-105,which exposes
GenerateParameters.kvBits,maxKVSize, and acceptsthe existing semaphore-gated batch model. Nothing was dropped.
mlx-swift wiring approach
--kv-bits: passed asGenerateParameters.kvBitsat both thenon-streaming (
ModelRuntime.complete()) and streaming(
ModelRuntime.stream()) generation sites. nil keeps the mlx-swiftdefault of unquantized KV cache. Group size and quantizedKVStart
use mlx-swift defaults (64 / 0) — exposed as knobs would widen the
search prematurely without operator evidence.
--max-context: continues to enforce the existing 413context_length_exceededgate (ModelRuntime.validatePromptTokenCount).Additionally passes
maxKVSize=maxContextTokenstoGenerateParametersso the RotatingKVCache bounds the allocation —this is what gives KV-cache memory pressure relief on small-RAM
nodes.
--max-batch: changesinferenceGate = AsyncSemaphore(value: 1)toAsyncSemaphore(value: max(1, maxBatch)). The existingacquire/withPermitpattern is unchanged; the actor already handlesmulti-permit safely.
Test additions
Tests/macprovider-cliTests/ServingKnobsConfigTests.swiftadds 21test cases:
--kv-bits 5, accepts 4 and 8 and nil.--max-context 0and--max-batch 0.ModelRuntimevia testaccessors (
kvBitsOverrideForTest,maxBatchForTest,maxContextTokensForTest).--max-contextboundary: oversize prompts still throw thedocumented 413
context_length_exceededAPIError; at-boundaryprompts pass.
Full suite: 219 → 240 tests, all green.
Out of scope
Per-request
--kv-group-size/--quantized-kv-startknobs areintentionally not exposed; they sit below the autotune's noise
floor today and would widen the search without operator evidence.
Adding them later is mechanical (same call site, same plumbing
pattern).
Test plan
swift build -c release --product macprovider-clicleanswift test— 240 / 240 passingmacprovider-cli help servelists--kv-bits,--max-context,--max-batchMACPROVIDER_KV_BITS=7 macprovider-cli serve ...rejects with documented errorRelated
(
beta/autotune.py— provider-side keep/revert hill-climb).🤖 Generated with Claude Code