feat(search): migrate to antfly v0.2 (zig) — direct-download install, private instance, brew removal#457
feat(search): migrate to antfly v0.2 (zig) — direct-download install, private instance, brew removal#457markhayden wants to merge 24 commits into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lation modules Pure move, zero behavior change. defaults.ts owns AntflySettings + DEFAULT_SETTINGS + mergeSettings; query-translation.ts owns the pure Bakin⇄Antfly shape translation (query request building — consolidated from the previously duplicated query()/multiQuery() blocks — table config, response mapping). search.ts keeps the adapter class only. Prepares the v0.2 protocol migration to land as small reviewable diffs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inert: dependency unchanged until the protocol migration lands. Built from antflydb/antfly@1d9e8c040 (tag v0.2.0-rc.2), ts/packages/sdk, via npm install/build/pack. Temporary until upstream publishes to npm under the next dist-tag; provenance + swap-out steps in vendor/README.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rovider, field-driven query strategy) - @antfly/sdk -> vendored v0.2.0-rc.2 build (file: dep); the openapi-fetch CJS interop patch is deleted — upstream now bundles it inline. - Provider rename termite -> antfly in embedder + reranker defaults (adapter + core settings). - Default URL -> http://localhost:3738 (Bakin's private instance; the SDK owns the /db/v1 prefix, so no path suffix). - QueryRequest: vector `indexes` only sent alongside semantic_search; RRF is implied by both search modes being present (v0.2 has no strategy field). Bakin's strategy setting maps to which fields are populated. - Chunking now nests as ChunkerConfig (provider antfly, model fixed, text.target_tokens/overlap_tokens) instead of flat chunk_size/overlap. - reranker.threshold removed from settings types and never sent — the v0.2 RerankerConfig has no such field (old default 0 was a no-op); regression test guards legacy settings.json files carrying one. Checkpoint α: protocol switched; without the new binary installed the adapter degrades to file-only mode. Reverting commits 1-3 restores the old world losslessly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fly/readyz, richer availability health) Bakin now runs its own antfly: swarm spawns bound to 127.0.0.1:3738 (health 3739) with --data-dir ~/.bakin/antfly (new getBakinPaths().antfly entry) and --models-dir under the antfly home. The old --metadata-api flag and ANTFLY_DATA_DIR env are gone in v0.2. - Guest mode is now an explicit branch: a non-default settings.url means an externally managed server — connect only, never spawn, never touch its disk, no takeover restarts. - Readiness via GET /antfly/readyz (replaces /api/v1/status); a readyz-404 + legacy-status-200 signature identifies pre-0.2 servers and the availability health check says so explicitly. - Binary discovery: ANTFLY_PATH -> ~/.antfly/bin/antfly (brew candidate paths removed); new paths.ts owns antfly-home resolution (ANTFLY_HOME honored, doubling as the test seam). - Log machinery extracted to server-logs.ts; the optional-model-registry demotion no longer requires Go-era termite/ callers (zig server). Final re-baseline against live output happens in the smoke task. - antfly.availability health check now reports mode (private/external), URL, and a targeted remediation per failure shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…download (rip out brew) The brew/xcode install path is gone. `bakin install search` now: download the pinned release tarball from releases.antfly.io -> verify SHA256 against the in-repo pin -> extract -> atomic rename into ~/.antfly/bin/antfly -> verify `antfly --version` matches the pin. No brew, no xcode CLT, no node, no python, no sudo. - pin.ts: single source of truth (version + per-platform checksums + URL/archive naming). Upgrades = bump the pin, re-run install. - installer.ts: check() now version-checks against the pin (a stale binary is `error` with remediation, not silently accepted); install() handles wrong-version replacement, refuses to swap under a running server (`bakin stop` first), refuses when a wrong-version ANTFLY_PATH override would mask the managed install, fails loudly on checksum mismatch without writing anything, and reports unsupported platforms (darwin-x64 has no upstream zig build — Bakin doesn't ship it either). - setup.ts: brew machinery deleted (findBrew, brew spawn, xcode error matching, compact brew status renderer); now composes installer + models. Models flow (termite pull) unchanged here — migrated next. - Tests rebuilt high-fidelity: real tar extraction, real spawn of a scripted fake binary for --version, synthetic tarballs with real checksums; only fetch is mocked. No network, no child_process mocks. Checkpoint β: fresh-machine onboarding is brew-free end to end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eat missing models as degraded
`antfly termite pull` is gone in v0.2 — models now come from
`antfly inference pull <owner/name>` into the consolidated
~/.antfly/inference/models/{owner}/{name}/ layout (no per-kind buckets,
~/.termite is dead). Same three models.
- New models.ts owns the prefetch component; termiteModelsRoot() is
replaced by inferenceModelsRoot() (paths.ts) through the factory.
- Quick win: the v0.2 runtime lazy-downloads missing models on first
use, so check() now reads "search still works; lazy-downloads on
first use" with prefetch as the remediation — degraded, not broken.
Skip/decline messages carry the same framing.
- Verification matches the new registry output: model_manifest.json
presence + a real weight file (.onnx/.gguf/.safetensors), replacing
the old files[]-with-sizes manifest contract that no longer exists.
- setup.ts is now a pure composition layer (installer + models).
- Models tests rebuilt real-fs style: the scripted fake binary's
`inference pull` actually creates model dirs, so spawn -> verify runs
end to end with zero module mocks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Legacy state is now pure housekeeping — the private instance means the new world boots healthy whether or not the old dirs exist. - legacy-cleanup.ts: detects ~/.termite, the pre-0.2 shared ~/.antfly/data, and brew-installed binaries. Deletions are per-item, interactive-only (a blanket --yes only reports), DEFAULT NO, and the shared-data prompt warns it removes ALL antfly data on this machine with a pointer at upstream backup/restore. The brew binary is only ever a printed `brew uninstall antfly` suggestion — Bakin never runs brew, even to clean up. Wired into install() after verify (both installed and noop paths) with outcomes in the install summary. - search onboarding component auto-corrects settings.json URLs that exactly match a known pre-0.2 default (localhost/0.0.0.0/127.0.0.1 :8080/api/v1 -> localhost:3738); deliberate non-default URLs are never rewritten. - MEMORY_SCHEMA_VERSION 3 -> 4: the index moved to a fresh data dir, and without the bump the offset-based memory indexers would silently skip already-read bytes forever. The existing migration mechanism resets the table + clears offsets on first healthy boot. - setup.ts is down to a 34-line composition layer (was 520 pre-B1). Checkpoint γ: code-complete. Live smoke + docs remain. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live-smoke drift fixes (every item verified against the real server on real hardware; upstream findings filed as #456): Installer/supervision: - Release tarball layout: binary at archive root + share/ (antfarm assets installed alongside bin/), not antfly/antfly. - Readiness endpoint is /readyz at root (both ports), not /antfly/readyz. - Spawn env pins TERMITE_PREFERRED_BACKEND=onnx: the Metal inference backend SIGABRTs the server (concurrent reranks AND single embeds). - Log parser handles the zig server's JSON log lines (level mapping, field extraction) with key=value fallback; empty-table EmptySegment text-merge noise demoted to debug. Protocol/config (each tagged with #456 in code): - Dense embeddings indexes carry explicit `dimension` (server has no auto-probe at this RC): embedder settings gain a dimension field (BGE=384, CLIP=512), included in the embedder rebuild hash. - No `schema` at table create — a create-time schema permanently breaks query parsing on the table. Semantic/filters/facets verified working schemaless; revisit via PUT /schema when fixed upstream. - multiQuery fans out as SEQUENTIAL single global queries: the NDJSON multiquery endpoint rejects its own framing, and concurrency makes inference crashes more likely. Per-table failure isolation included. - reranker disabled by default: invoking mxbai SIGABRTs the server on both Metal and onnx backends. Plumbing intact for one-flag re-enable. - Visual embedder ref: openai/clip-vit-base-patch32 has no ONNX exports (pull fails NoModelFilesFound) -> Xenova/clip-vit-base-patch32. - Models messaging reframed: no index-time lazy download at this RC and failed backfills do not self-heal — prefetch is strongly recommended, with `bakin reindex --rebuild` as the late-install remediation. Migrations & data safety: - search SCHEMA_VERSION 2->4: recreate tables with dimensions (3) and schemaless (4). - memory migration no longer bumps its marker when search is down — bumping there recorded the migration as done without clearing offsets, permanently hiding rows once search returned. - Settings URL auto-correction writes ONLY the url key: it previously persisted fully-merged settings, freezing every then-current default (reranker on, stale CLIP ref, dimension-less embedders) into settings.json as explicit overrides. - Legacy detection widened: bakin-managed.yaml (written by the pre-0.2 adapter) marks the antfly home as Bakin-managed, unlocking cleanup of the full old footprint (data/, store/, metadata/, the yaml); without the marker only data/ is offered. CLI: - `bakin search` renderer reads the /api/search shape ({id, table, fields}); it previously expected {key, _table, document} and printed "undefined" for every hit. End-to-end verified live: install -> boot (127.0.0.1:3738, CPU inference) -> reindex -> semantic search returns correct results ("rename beacon to bakin" top-hits the beacon cleanup task) -> facets -> doctor green -> server survives sustained querying. Closes the C1 live-smoke task. Upstream issue tracker: #456. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… architecture - .claude/knowledge/search-system.md: install/runtime section (direct download, private instance on 3738, CPU inference pin, guest mode), v0.2 protocol notes (no strategy field, no create-time schema, dims required, sequential multiquery), reranker default-off callout, new model table (antfly provider, Xenova CLIP, dims, inference models root), upstream-limitation notes refined per the deep root-cause analysis (#456): _all-population (not term extraction) with the field-qualified interim option, any-write-heals model recovery, override-hygiene rule for updateSettings, SCHEMA_VERSION 4 history. - .claude/knowledge/multimodal-search.md: inference-runtime refs, v0.2 vs v0.1-era upstream issue split (#456 vs #72), embedder example updated (provider/dimension/ONNX-bearing repos). - docs start/operation.md: new "Upgrading search from a pre-0.2 install" section — guided path, manual path, shared-data warning, skip-models repair. - Generated settings reference re-synced (bun run docs:generate). - CLAUDE.md: runtime-dir map gains ~/.bakin/antfly; search bullet reflects v0.2 private instance + #456. - models.ts messaging refined per deep analysis: any write heals a failed embeddings backfill once models land, so remediation is plain `bakin reindex` (not --rebuild); CLIP ref note re: upstream-blessed antflydb/clipclap pending their answer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- New memory-migration tests: the defer-when-search-unavailable fix now has a regression guard (marker must NOT bump while the index is unreachable; retry on next boot actually migrates), plus the happy path (reset + offsets cleared + marker bumped) and the no-op path. - Installer test asserts the exact release artifact URL (uname-style naming) and that share/ (antfarm assets) installs alongside bin/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dder override dims, tmp cleanup From the pre-merge five-axis review: - installer: --version verification now runs against the extracted binary in the temp dir BEFORE share/ and the binary are moved into place — a verification failure leaves the existing install untouched (true verify-then-commit, per spec 4.1). - installer: the empty ~/.antfly/tmp parent is removed when the install workdir was its only occupant. - mergeSettings: per-embedder entries deep-merge over their defaults so a partial override (legacy provider+model-only settings.json) keeps the default `dimension` — dropping it would 400 every table create on the v0.2 server. Regression test included. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion # Conflicts: # src/core/search-migration.ts
…lock The resourceSummary filter keeps entries with `responseEnd >= bootStart-5`, and under bun --isolate the performance.now() epoch spans the whole suite run. The seeded absolute `responseEnd: 100_000` fell out of the window once the suite had been running >100s — exactly the full-suite-on-CI condition, which is why this test failed CI deterministically while passing locally and in isolation (and why the earlier waitFor-headroom bump didn't help). Relative timestamps make the seed epoch-invariant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…what-we-bind URL Field finding from the second machine: an orphaned antfly held 127.0.0.1:3738 in a wedged state — it answered /readyz with 200 (so boot logged "started and healthy" while our own spawn silently failed to bind behind it) but served garbage on /db/v1/*, and the client fell to file-only with an inscrutable "Failed to parse JSON". - Adoption is now gated on a strict probe: a pre-existing listener must return parseable JSON with a `health` field from /db/v1/status before we treat it as our server. Failing that, boot logs an actionable error (lsof one-liner + kill + restart) instead of silently degrading. - Default URL moves to http://127.0.0.1:3738 — dial exactly what the server binds. `localhost` resolution is machine-dependent (::1 listeners, proxy env vars); not this incident's cause, but the same failure class. mergeSettings normalizes the localhost spelling, isLocalDefaultUrl accepts both, and the onboarding URL corrector rewrites the briefly-current localhost:3738 value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ad of failing cryptically A checkout that skips bun install still loads the old npm @antfly/sdk, whose relative paths 404 into 'Failed to parse JSON' against a healthy server (field-verified on machine 2). The adapter now probes for the v0.2-only tables.scanAll before spawning anything and fails fast with the actual remediation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Machine-2 field lesson: setup steps get skipped mid-firefight, and the resulting state — indexing works, every semantic query dead — was only diagnosable via `bakin check search-models`, which nobody runs while confused. Two guardrails: - The antfly.availability health check stays WARN while models are missing, with the install+reindex remediation inline — so the health dashboard and doctor surface it persistently instead of reporting a green "connected". - `bakin reindex` pre-flights the models check and warns up front that the reindex will populate tables while semantic search stays dead. Advisory only; never blocks the reindex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
~/.bakin/antfly (the migration's --data-dir, spec decision 10) sits inside the chokidar-watched content root. The antfly server churns WAL/segment files there constantly; the resulting event storm deadlocked Bun natively within seconds of boot — main thread, all Bun Pool workers, and the File Watcher thread parked on os_unfair_lock, every HTTP request hanging forever. This was the "UI lockup": the whole server was wedged, /api/search was just where it got noticed. Diagnosed live via thread sampling (sample(1) on the wedged process); boot logs show the watcher ingesting antfly's full_text_index_v0 segment deletions immediately before the logs stop dead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dder warmup Two resilience guards for the cross-table search path: - multiQuery races each sequential table query against a hard 15s ceiling (BAKIN_ANTFLY_QUERY_TIMEOUT_MS overrides). The fan-out is sequential by design (bakin#456), so one wedged query stalled every remaining table and hung /api/search indefinitely. A timed-out table now returns empty with a warn; the rest still answer. - After connecting, the adapter fires one throwaway /ai/v1 embed per unique antfly-provider embedder model (fire-and-forget, debug-logged failures), so cold ONNX model loads are paid at boot instead of inside a user's first semantic query. Warmup failures are expected for models the current pin cannot text-embed (CLIP: InputArityMismatch, see bakin#456) and stay debug-level. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync process.on('exit') net for the orphan modes diagnosed in #459: exits
that bypass the async lifecycle shutdown (dev.ts signal handlers calling
process.exit, uncaught EADDRINUSE thrown at listen time) left the antfly
child holding 3738 across dev generations. 'exit' handlers must be
synchronous; ChildProcess.kill is.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Firefight session summary (machine-3, 2026-06-05/06)Picked up the branch mid-firefight to chase the Problem 1 — "/api/search hangs" was actually the whole server deadlockingThe briefed hypothesis (CLIP cold-load >>10s at query time) was disproven by timed curls: antfly answered every direct query in <0.3s while Root cause: the migration put the private instance's The two briefed mitigations also shipped (
Problem 2 — orphans: diagnosed, filed #459, fixed in PR #460 (off main)Full mechanism in #459: (a) Problem 3 —
|
…x health The health plugin's "Reindex All" looked like it hung forever. The POST itself completed — the button's follow-up fetchData() died inside /api/plugins/health/search-status, which counted documents per table via a matchAll QUERY. Queries against a table with an active embeddings backfill hang indefinitely at this pin (bakin#456 finding 10), so the whole health page froze for every post-reindex backfill window (~90 min for bakin_memory on CPU). Three fixes, live-verified: - tables.stats now reads doc_count from the indexes GET (which never blocks) instead of running a query. search-status: hang -> 4ms during an active backfill. - The single-table query() path gets the same 15s no-infinite-patience ceiling as multiQuery — it serves the per-plugin /search routes directly. - indexes.list returns an ARRAY in the v0.2 SDK; Object.entries over it named every index "0"/"1" in health payloads and made rebuildIndexes drop nonexistent index names (silently breaking reindex?rebuild=true). Normalized both shapes; names come from config.name. While there: backfill_state 'failed' (the bakin#456 findings-8/9 state on bakin_assets) now surfaces as an unhealthy index with a named error instead of showing green while semantic search returns nothing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Follow-up (
Suite 4690/0, lint + typecheck clean, pushed. 🤖 Generated with Claude Code |
…DDRINUSE (#459) Two dev-loop defects that minted orphaned antfly processes across generations (diagnosed live; full writeup in #459): - scripts/dev.ts registered SIGINT/SIGTERM handlers before server.ts loads and called process.exit(0) — Node runs signal listeners in registration order, so the lifecycle's async shutdown (the only thing that stops the antfly child) never executed. The dev handler now defers once the lifecycle owns shutdown (globalThis flag set by registerShutdownHandlers; checked without importing lifecycle so an early Ctrl-C during the build phase doesn't load the server module graph). - server.ts had no 'error' listener on listen(): EADDRINUSE threw as an uncaught exception AFTER the full boot (antfly spawned, watcher running), bypassing cleanup. It now logs the port + lsof remediation, routes through the lifecycle shutdown via self-SIGTERM, and exits 1 — the lifecycle's final exit honors a pre-set process.exitCode instead of stamping 0 over it. Verified end-to-end with two isolated-BAKIN_HOME servers on one port: second exits 1 with the named remediation and "Shutdown complete"; first keeps serving and exits cleanly on SIGTERM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e log stream The raw "enrichment worker failed: UnsupportedEmbeddingProvider" error sent the operator debugging an embedder provider config that was fine — the real cause is upstream image enrichment rejecting every embedder at this pin (bakin#456 finding 8; lab-bisected). The log classifier now appends the explanation + issue pointer and demotes the line to warn — it repeats on every boot reconcile and asset write, and the health page already carries the red per-index state. Same treatment for the CLIP InputArityMismatch enrichment failure (finding 9). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nfig Embedder settings entries may now carry antfly's documented EmbedderConfig pass-throughs: api_url (route that embedder's calls over HTTP to a named inference endpoint, e.g. the private instance's own /ai/v1) and multimodal (declare non-text support for models outside the built-in registry). Forwarded verbatim at table create; unset entries omit the keys — no behavior change without explicit settings. This is the Bakin half of running image search against a media-path-fixed antfly: visual embedder -> antflydb/clipclap + api_url loopback. Verified live end-to-end on a locally patched antfly main: assets enrich (text+ visual), and "a photo of a cat" via /api/search returns the actual pet portrait image (visual score 0.79). See #456 verification comments and #466 for the pin-bump plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Replaces the Homebrew-based Antfly v0.1 install with direct binary download of the new Zig-based v0.2.0-rc.2, and upgrades the adapter to the new
/db/v1protocol. The brew/xcode onboarding failure mode is gone entirely —bakin install searchnow needs no brew, no xcode CLT, no node, no python, no sudo.Spec:
.claude/specs/antfly-zig-migration.md· Plan:.claude/specs/antfly-zig-migration-plan.md· Upstream findings: #456What changed
~/.antfly/bin/antfly, with strict--versionchecking, verify-then-commit installs, wrong-version replacement, and a running-server guard. Upgrades = bumppin.ts, re-run install.127.0.0.1:3738with data under~/.bakin/antfly/— never the shared~/.antfly/data. Future antfly upgrades and Bakin uninstalls can no longer touch other projects' tables by construction. External servers are an explicit opt-in viasettings.url(guest mode: connect-only).@antfly/sdk(file: dep until upstream publishes to npm —vendor/README.mdhas the swap-out), provider rename termite→antfly, field-driven query strategy, nested chunker config, explicit embedding dims, models viaantfly inference pullinto~/.antfly/inference/models.~/.termite, the pre-0.2 Bakin-managed server state, and brew binaries (suggestion-only — Bakin never runs brew). Settings URLs matching old defaults are auto-corrected (minimal-partial writes only).setup.ts520→34 lines (composition overpin/installer/models/legacy-cleanup);search.tssplit intodefaults/query-translation/adapter; log machinery →server-logs.ts(now parses the zig server's JSON log lines).Upstream workarounds (all annotated with #456, all live-verified)
version:nullbug fixedTERMITE_PREFERRED_BACKEND=onnxpinXenova/clip-vit-base-patch32_allpopulation fixed in swarm modeTest plan
~/.termite— covered by unit tests, exercisable anytime viabakin install search.🤖 Generated with Claude Code