Skip to content

feat(search): migrate to antfly v0.2 (zig) — direct-download install, private instance, brew removal#457

Open
markhayden wants to merge 24 commits into
mainfrom
feat/antfly-zig-migration
Open

feat(search): migrate to antfly v0.2 (zig) — direct-download install, private instance, brew removal#457
markhayden wants to merge 24 commits into
mainfrom
feat/antfly-zig-migration

Conversation

@markhayden

Copy link
Copy Markdown
Owner

Summary

Replaces the Homebrew-based Antfly v0.1 install with direct binary download of the new Zig-based v0.2.0-rc.2, and upgrades the adapter to the new /db/v1 protocol. The brew/xcode onboarding failure mode is gone entirely — bakin install search now needs no brew, no xcode CLT, no node, no python, no sudo.

Spec: .claude/specs/antfly-zig-migration.md · Plan: .claude/specs/antfly-zig-migration-plan.md · Upstream findings: #456

What changed

  • Install: pinned, SHA256-verified tarball download from releases.antfly.io → ~/.antfly/bin/antfly, with strict --version checking, verify-then-commit installs, wrong-version replacement, and a running-server guard. Upgrades = bump pin.ts, re-run install.
  • Private instance: Bakin runs its own antfly on 127.0.0.1:3738 with data under ~/.bakin/antfly/ — never the shared ~/.antfly/data. Future antfly upgrades and Bakin uninstalls can no longer touch other projects' tables by construction. External servers are an explicit opt-in via settings.url (guest mode: connect-only).
  • Protocol: vendored new-protocol @antfly/sdk (file: dep until upstream publishes to npm — vendor/README.md has the swap-out), provider rename termite→antfly, field-driven query strategy, nested chunker config, explicit embedding dims, models via antfly inference pull into ~/.antfly/inference/models.
  • Legacy state: detection + optional consented disk-reclaim (interactive-only, default No, ALL-data warning) for ~/.termite, the pre-0.2 Bakin-managed server state, and brew binaries (suggestion-only — Bakin never runs brew). Settings URLs matching old defaults are auto-corrected (minimal-partial writes only).
  • Data safety: search SCHEMA_VERSION 2→4 (dims, then schemaless recreate), memory-plugin migration v4 with a fixed defer-when-search-down semantics (the old bump-anyway path would have silently hidden rows forever).
  • File decomposition: setup.ts 520→34 lines (composition over pin/installer/models/legacy-cleanup); search.ts split into defaults/query-translation/adapter; log machinery → server-logs.ts (now parses the zig server's JSON log lines).

Upstream workarounds (all annotated with #456, all live-verified)

Workaround Reverts when
Tables created schemaless create-time schema version:null bug fixed
Sequential single-query fan-out for multiQuery NDJSON multiquery endpoint fixed
TERMITE_PREFERRED_BACKEND=onnx pin Metal inference backend stabilizes
Reranking disabled by default mxbai rerank stops SIGABRTing the server
Visual embedder = Xenova/clip-vit-base-patch32 upstream blesses a multimodal ref
Bare-term FTS returns nothing (rrf rides the semantic leg) _all population fixed in swarm mode

Test plan

  • 4,635 tests green across 450 files; ~80 new/rewritten tests for installer (synthetic tarballs, real tar + spawn pipelines, no network), models, legacy cleanup, URL correction, server supervision, protocol shapes, memory migration.
  • Live-verified end-to-end on real hardware: real download + checksum + install, wrong-version replace, legacy detection against actual machine state, private-instance boot, schema migrations, reindex, semantic search returning correct ranked results, facets, doctor green, sustained-query stability.
  • Pre-merge five-axis review: merge-ready; its three polish items are in the final commit.
  • Outstanding (gated on a human at a TTY): interactive cleanup consent prompts against the real ~1.4GB ~/.termite — covered by unit tests, exercisable anytime via bakin install search.

🤖 Generated with Claude Code

markhayden and others added 17 commits June 5, 2026 16:00
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lation modules

Pure move, zero behavior change. defaults.ts owns AntflySettings +
DEFAULT_SETTINGS + mergeSettings; query-translation.ts owns the pure
Bakin⇄Antfly shape translation (query request building — consolidated
from the previously duplicated query()/multiQuery() blocks — table
config, response mapping). search.ts keeps the adapter class only.

Prepares the v0.2 protocol migration to land as small reviewable diffs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Inert: dependency unchanged until the protocol migration lands. Built
from antflydb/antfly@1d9e8c040 (tag v0.2.0-rc.2), ts/packages/sdk, via
npm install/build/pack. Temporary until upstream publishes to npm under
the next dist-tag; provenance + swap-out steps in vendor/README.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rovider, field-driven query strategy)

- @antfly/sdk -> vendored v0.2.0-rc.2 build (file: dep); the openapi-fetch
  CJS interop patch is deleted — upstream now bundles it inline.
- Provider rename termite -> antfly in embedder + reranker defaults
  (adapter + core settings).
- Default URL -> http://localhost:3738 (Bakin's private instance; the SDK
  owns the /db/v1 prefix, so no path suffix).
- QueryRequest: vector `indexes` only sent alongside semantic_search;
  RRF is implied by both search modes being present (v0.2 has no strategy
  field). Bakin's strategy setting maps to which fields are populated.
- Chunking now nests as ChunkerConfig (provider antfly, model fixed,
  text.target_tokens/overlap_tokens) instead of flat chunk_size/overlap.
- reranker.threshold removed from settings types and never sent — the
  v0.2 RerankerConfig has no such field (old default 0 was a no-op);
  regression test guards legacy settings.json files carrying one.

Checkpoint α: protocol switched; without the new binary installed the
adapter degrades to file-only mode. Reverting commits 1-3 restores the
old world losslessly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fly/readyz, richer availability health)

Bakin now runs its own antfly: swarm spawns bound to 127.0.0.1:3738
(health 3739) with --data-dir ~/.bakin/antfly (new getBakinPaths().antfly
entry) and --models-dir under the antfly home. The old --metadata-api
flag and ANTFLY_DATA_DIR env are gone in v0.2.

- Guest mode is now an explicit branch: a non-default settings.url means
  an externally managed server — connect only, never spawn, never touch
  its disk, no takeover restarts.
- Readiness via GET /antfly/readyz (replaces /api/v1/status); a
  readyz-404 + legacy-status-200 signature identifies pre-0.2 servers
  and the availability health check says so explicitly.
- Binary discovery: ANTFLY_PATH -> ~/.antfly/bin/antfly (brew candidate
  paths removed); new paths.ts owns antfly-home resolution (ANTFLY_HOME
  honored, doubling as the test seam).
- Log machinery extracted to server-logs.ts; the optional-model-registry
  demotion no longer requires Go-era termite/ callers (zig server).
  Final re-baseline against live output happens in the smoke task.
- antfly.availability health check now reports mode (private/external),
  URL, and a targeted remediation per failure shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…download (rip out brew)

The brew/xcode install path is gone. `bakin install search` now:
download the pinned release tarball from releases.antfly.io -> verify
SHA256 against the in-repo pin -> extract -> atomic rename into
~/.antfly/bin/antfly -> verify `antfly --version` matches the pin.
No brew, no xcode CLT, no node, no python, no sudo.

- pin.ts: single source of truth (version + per-platform checksums +
  URL/archive naming). Upgrades = bump the pin, re-run install.
- installer.ts: check() now version-checks against the pin (a stale
  binary is `error` with remediation, not silently accepted); install()
  handles wrong-version replacement, refuses to swap under a running
  server (`bakin stop` first), refuses when a wrong-version ANTFLY_PATH
  override would mask the managed install, fails loudly on checksum
  mismatch without writing anything, and reports unsupported platforms
  (darwin-x64 has no upstream zig build — Bakin doesn't ship it either).
- setup.ts: brew machinery deleted (findBrew, brew spawn, xcode error
  matching, compact brew status renderer); now composes installer +
  models. Models flow (termite pull) unchanged here — migrated next.
- Tests rebuilt high-fidelity: real tar extraction, real spawn of a
  scripted fake binary for --version, synthetic tarballs with real
  checksums; only fetch is mocked. No network, no child_process mocks.

Checkpoint β: fresh-machine onboarding is brew-free end to end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eat missing models as degraded

`antfly termite pull` is gone in v0.2 — models now come from
`antfly inference pull <owner/name>` into the consolidated
~/.antfly/inference/models/{owner}/{name}/ layout (no per-kind buckets,
~/.termite is dead). Same three models.

- New models.ts owns the prefetch component; termiteModelsRoot() is
  replaced by inferenceModelsRoot() (paths.ts) through the factory.
- Quick win: the v0.2 runtime lazy-downloads missing models on first
  use, so check() now reads "search still works; lazy-downloads on
  first use" with prefetch as the remediation — degraded, not broken.
  Skip/decline messages carry the same framing.
- Verification matches the new registry output: model_manifest.json
  presence + a real weight file (.onnx/.gguf/.safetensors), replacing
  the old files[]-with-sizes manifest contract that no longer exists.
- setup.ts is now a pure composition layer (installer + models).
- Models tests rebuilt real-fs style: the scripted fake binary's
  `inference pull` actually creates model dirs, so spawn -> verify runs
  end to end with zero module mocks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Legacy state is now pure housekeeping — the private instance means the
new world boots healthy whether or not the old dirs exist.

- legacy-cleanup.ts: detects ~/.termite, the pre-0.2 shared
  ~/.antfly/data, and brew-installed binaries. Deletions are per-item,
  interactive-only (a blanket --yes only reports), DEFAULT NO, and the
  shared-data prompt warns it removes ALL antfly data on this machine
  with a pointer at upstream backup/restore. The brew binary is only
  ever a printed `brew uninstall antfly` suggestion — Bakin never runs
  brew, even to clean up. Wired into install() after verify (both
  installed and noop paths) with outcomes in the install summary.
- search onboarding component auto-corrects settings.json URLs that
  exactly match a known pre-0.2 default (localhost/0.0.0.0/127.0.0.1
  :8080/api/v1 -> localhost:3738); deliberate non-default URLs are
  never rewritten.
- MEMORY_SCHEMA_VERSION 3 -> 4: the index moved to a fresh data dir,
  and without the bump the offset-based memory indexers would silently
  skip already-read bytes forever. The existing migration mechanism
  resets the table + clears offsets on first healthy boot.
- setup.ts is down to a 34-line composition layer (was 520 pre-B1).

Checkpoint γ: code-complete. Live smoke + docs remain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Live-smoke drift fixes (every item verified against the real server on
real hardware; upstream findings filed as #456):

Installer/supervision:
- Release tarball layout: binary at archive root + share/ (antfarm
  assets installed alongside bin/), not antfly/antfly.
- Readiness endpoint is /readyz at root (both ports), not /antfly/readyz.
- Spawn env pins TERMITE_PREFERRED_BACKEND=onnx: the Metal inference
  backend SIGABRTs the server (concurrent reranks AND single embeds).
- Log parser handles the zig server's JSON log lines (level mapping,
  field extraction) with key=value fallback; empty-table EmptySegment
  text-merge noise demoted to debug.

Protocol/config (each tagged with #456 in code):
- Dense embeddings indexes carry explicit `dimension` (server has no
  auto-probe at this RC): embedder settings gain a dimension field
  (BGE=384, CLIP=512), included in the embedder rebuild hash.
- No `schema` at table create — a create-time schema permanently breaks
  query parsing on the table. Semantic/filters/facets verified working
  schemaless; revisit via PUT /schema when fixed upstream.
- multiQuery fans out as SEQUENTIAL single global queries: the NDJSON
  multiquery endpoint rejects its own framing, and concurrency makes
  inference crashes more likely. Per-table failure isolation included.
- reranker disabled by default: invoking mxbai SIGABRTs the server on
  both Metal and onnx backends. Plumbing intact for one-flag re-enable.
- Visual embedder ref: openai/clip-vit-base-patch32 has no ONNX exports
  (pull fails NoModelFilesFound) -> Xenova/clip-vit-base-patch32.
- Models messaging reframed: no index-time lazy download at this RC and
  failed backfills do not self-heal — prefetch is strongly recommended,
  with `bakin reindex --rebuild` as the late-install remediation.

Migrations & data safety:
- search SCHEMA_VERSION 2->4: recreate tables with dimensions (3) and
  schemaless (4).
- memory migration no longer bumps its marker when search is down —
  bumping there recorded the migration as done without clearing offsets,
  permanently hiding rows once search returned.
- Settings URL auto-correction writes ONLY the url key: it previously
  persisted fully-merged settings, freezing every then-current default
  (reranker on, stale CLIP ref, dimension-less embedders) into
  settings.json as explicit overrides.
- Legacy detection widened: bakin-managed.yaml (written by the pre-0.2
  adapter) marks the antfly home as Bakin-managed, unlocking cleanup of
  the full old footprint (data/, store/, metadata/, the yaml); without
  the marker only data/ is offered.

CLI:
- `bakin search` renderer reads the /api/search shape ({id, table,
  fields}); it previously expected {key, _table, document} and printed
  "undefined" for every hit.

End-to-end verified live: install -> boot (127.0.0.1:3738, CPU
inference) -> reindex -> semantic search returns correct results
("rename beacon to bakin" top-hits the beacon cleanup task) -> facets ->
doctor green -> server survives sustained querying.

Closes the C1 live-smoke task. Upstream issue tracker: #456.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… architecture

- .claude/knowledge/search-system.md: install/runtime section (direct
  download, private instance on 3738, CPU inference pin, guest mode),
  v0.2 protocol notes (no strategy field, no create-time schema, dims
  required, sequential multiquery), reranker default-off callout, new
  model table (antfly provider, Xenova CLIP, dims, inference models
  root), upstream-limitation notes refined per the deep root-cause
  analysis (#456): _all-population (not term extraction) with the
  field-qualified interim option, any-write-heals model recovery,
  override-hygiene rule for updateSettings, SCHEMA_VERSION 4 history.
- .claude/knowledge/multimodal-search.md: inference-runtime refs,
  v0.2 vs v0.1-era upstream issue split (#456 vs #72), embedder
  example updated (provider/dimension/ONNX-bearing repos).
- docs start/operation.md: new "Upgrading search from a pre-0.2
  install" section — guided path, manual path, shared-data warning,
  skip-models repair.
- Generated settings reference re-synced (bun run docs:generate).
- CLAUDE.md: runtime-dir map gains ~/.bakin/antfly; search bullet
  reflects v0.2 private instance + #456.
- models.ts messaging refined per deep analysis: any write heals a
  failed embeddings backfill once models land, so remediation is plain
  `bakin reindex` (not --rebuild); CLIP ref note re: upstream-blessed
  antflydb/clipclap pending their answer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- New memory-migration tests: the defer-when-search-unavailable fix now
  has a regression guard (marker must NOT bump while the index is
  unreachable; retry on next boot actually migrates), plus the happy
  path (reset + offsets cleared + marker bumped) and the no-op path.
- Installer test asserts the exact release artifact URL (uname-style
  naming) and that share/ (antfarm assets) installs alongside bin/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dder override dims, tmp cleanup

From the pre-merge five-axis review:

- installer: --version verification now runs against the extracted
  binary in the temp dir BEFORE share/ and the binary are moved into
  place — a verification failure leaves the existing install untouched
  (true verify-then-commit, per spec 4.1).
- installer: the empty ~/.antfly/tmp parent is removed when the install
  workdir was its only occupant.
- mergeSettings: per-embedder entries deep-merge over their defaults so
  a partial override (legacy provider+model-only settings.json) keeps
  the default `dimension` — dropping it would 400 every table create on
  the v0.2 server. Regression test included.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion

# Conflicts:
#	src/core/search-migration.ts
…lock

The resourceSummary filter keeps entries with `responseEnd >= bootStart-5`,
and under bun --isolate the performance.now() epoch spans the whole suite
run. The seeded absolute `responseEnd: 100_000` fell out of the window once
the suite had been running >100s — exactly the full-suite-on-CI condition,
which is why this test failed CI deterministically while passing locally
and in isolation (and why the earlier waitFor-headroom bump didn't help).
Relative timestamps make the seed epoch-invariant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…what-we-bind URL

Field finding from the second machine: an orphaned antfly held
127.0.0.1:3738 in a wedged state — it answered /readyz with 200 (so boot
logged "started and healthy" while our own spawn silently failed to bind
behind it) but served garbage on /db/v1/*, and the client fell to
file-only with an inscrutable "Failed to parse JSON".

- Adoption is now gated on a strict probe: a pre-existing listener must
  return parseable JSON with a `health` field from /db/v1/status before
  we treat it as our server. Failing that, boot logs an actionable
  error (lsof one-liner + kill + restart) instead of silently degrading.
- Default URL moves to http://127.0.0.1:3738 — dial exactly what the
  server binds. `localhost` resolution is machine-dependent (::1
  listeners, proxy env vars); not this incident's cause, but the same
  failure class. mergeSettings normalizes the localhost spelling,
  isLocalDefaultUrl accepts both, and the onboarding URL corrector
  rewrites the briefly-current localhost:3738 value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ad of failing cryptically

A checkout that skips bun install still loads the old npm @antfly/sdk,
whose relative paths 404 into 'Failed to parse JSON' against a healthy
server (field-verified on machine 2). The adapter now probes for the
v0.2-only tables.scanAll before spawning anything and fails fast with
the actual remediation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Machine-2 field lesson: setup steps get skipped mid-firefight, and the
resulting state — indexing works, every semantic query dead — was only
diagnosable via `bakin check search-models`, which nobody runs while
confused. Two guardrails:

- The antfly.availability health check stays WARN while models are
  missing, with the install+reindex remediation inline — so the health
  dashboard and doctor surface it persistently instead of reporting a
  green "connected".
- `bakin reindex` pre-flights the models check and warns up front that
  the reindex will populate tables while semantic search stays dead.
  Advisory only; never blocks the reindex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
markhayden and others added 3 commits June 5, 2026 23:16
~/.bakin/antfly (the migration's --data-dir, spec decision 10) sits inside
the chokidar-watched content root. The antfly server churns WAL/segment
files there constantly; the resulting event storm deadlocked Bun natively
within seconds of boot — main thread, all Bun Pool workers, and the File
Watcher thread parked on os_unfair_lock, every HTTP request hanging
forever. This was the "UI lockup": the whole server was wedged, /api/search
was just where it got noticed.

Diagnosed live via thread sampling (sample(1) on the wedged process);
boot logs show the watcher ingesting antfly's full_text_index_v0 segment
deletions immediately before the logs stop dead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…dder warmup

Two resilience guards for the cross-table search path:

- multiQuery races each sequential table query against a hard 15s ceiling
  (BAKIN_ANTFLY_QUERY_TIMEOUT_MS overrides). The fan-out is sequential by
  design (bakin#456), so one wedged query stalled every remaining table and
  hung /api/search indefinitely. A timed-out table now returns empty with a
  warn; the rest still answer.

- After connecting, the adapter fires one throwaway /ai/v1 embed per unique
  antfly-provider embedder model (fire-and-forget, debug-logged failures),
  so cold ONNX model loads are paid at boot instead of inside a user's first
  semantic query. Warmup failures are expected for models the current pin
  cannot text-embed (CLIP: InputArityMismatch, see bakin#456) and stay
  debug-level.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync process.on('exit') net for the orphan modes diagnosed in #459: exits
that bypass the async lifecycle shutdown (dev.ts signal handlers calling
process.exit, uncaught EADDRINUSE thrown at listen time) left the antfly
child holding 3738 across dev generations. 'exit' handlers must be
synchronous; ChildProcess.kill is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@markhayden

Copy link
Copy Markdown
Owner Author

Firefight session summary (machine-3, 2026-06-05/06)

Picked up the branch mid-firefight to chase the /api/search hang + UI lockups, the process-tree orphans, and the background UnsupportedEmbeddingProvider errors. All three are root-caused; fixes pushed here and to a separate PR. The branch gained 3 commits: dd856555, a5f43764, fa131847.

Problem 1 — "/api/search hangs" was actually the whole server deadlocking

The briefed hypothesis (CLIP cold-load >>10s at query time) was disproven by timed curls: antfly answered every direct query in <0.3s while /api/search hung forever — alongside /api/tasks, /, and everything else. Thread-sampling the wedged process showed the main thread, all 12 Bun Pool workers, and the File Watcher thread parked on os_unfair_lock — a native Bun deadlock, seconds after boot.

Root cause: the migration put the private instance's --data-dir at ~/.bakin/antfly/inside the chokidar-watched content root. antfly's WAL/segment churn floods the watcher and deadlocks the process. shouldIgnoreContentWatcherPath now ignores antfly/ (dd856555). Verified live: zero antfly watcher events, server responsive indefinitely, /api/search?q=cat → 200 with real hits.

The two briefed mitigations also shipped (a5f43764), and both promptly proved their worth live:

  • Per-table query timeout in multiQuery (15s, BAKIN_ANTFLY_QUERY_TIMEOUT_MS override): the residual slow leg turned out to be real — semantic queries against a table with an active embeddings backfill hang indefinitely (>60s for a limit-3 query; Antfly v0.2.0-rc.2 upstream issues: schema-at-create breaks queries, FTS dead in swarm mode, no index-time lazy model download #456 finding 10). After any reindex, bakin_memory (~7.9K docs, ~80 docs/min on ONNX/CPU ≈ 90 min) would have hung every search; now it costs that one table's 15s cap and returns everything else.
  • Boot-time embedder warmup: BGE warms at boot (Embedder warmed); CLIP warmup fails debug-only with the upstream InputArityMismatch (see below) — exactly the absorb-quietly behavior intended.

Problem 2 — orphans: diagnosed, filed #459, fixed in PR #460 (off main)

Full mechanism in #459: (a) scripts/dev.ts's signal handlers call process.exit(0) before the lifecycle's async shutdown can stop the antfly child; (b) unhandled EADDRINUSE crashes after full boot, bypassing cleanup; (c) when deadlocked (Problem 1), no JS handler runs at all → SIGKILL → orphans. (a)+(b) are fixed on fix/dev-loop-orphans (PR #460, e2e-verified). On this branch, fa131847 adds the adapter-side sync process.on('exit') net so the antfly child dies on any JS-level exit — verified live: SIGTERM the dev process → all processes gone, both ports free.

Problem 3 — UnsupportedEmbeddingProvider: upstream, root-caused behaviorally, posted as #456 findings 8 + 9 (+10)

Reproduced on a clean lab instance (temp data dir, provider antfly only) and bisected:

  • Finding 8: image enrichment via {{remoteMedia}} fails UnsupportedEmbeddingProvider regardless of model — even CLIP, which /ai/v1/models advertises with "inputs": ["text","image"]. The error name is misleading: it fires on image content, not provider config. One failing index poisons the whole table's enrichment worker — which is why bakin_assets' perfectly-configured assets_text is also stuck at backfill: failed/0 docs.
  • Finding 9: CLIP text embedding fails InputArityMismatch everywhere (enrichment, /ai/v1/embed, and query-time — any query listing assets_visual in indexes 400s; antfly's log even names it: public table query parse failed table=bakin_assets err=error.InputArityMismatch).
  • The watcher-event correlation was reverse causality: segment churn from the enrichment retries.

Net: bakin_assets semantic search is fully dead at this pin (multimodal path broken upstream at both index and query time); asset FTS still answers. No Bakin workaround exists — resilience shipped instead. Revisit checklist is in the #456 comment.

For the upstream antfly devs

Findings 8/9/10 above (lab repros included in the #456 comments), plus two minor ones: /ai/v1/models lists garbage duplicate entries whose id is a model path with an embedded \nbackend=native; constant div large tensor len=13369344 shape={ 8704, 1, 1536 } warns during enrichment retries.

Machine state as of this comment

One clean dev generation running with all fixes; suite green (4687 pass / 0 fail), typecheck + lint clean. bakin_memory backfill in progress (searches degrade gracefully meanwhile). Earlier commits from machine-2 (6d9e2560, a27b385c) were pulled in cleanly — no overlap with this work.

🤖 Generated with Claude Code

…x health

The health plugin's "Reindex All" looked like it hung forever. The POST
itself completed — the button's follow-up fetchData() died inside
/api/plugins/health/search-status, which counted documents per table via a
matchAll QUERY. Queries against a table with an active embeddings backfill
hang indefinitely at this pin (bakin#456 finding 10), so the whole health
page froze for every post-reindex backfill window (~90 min for
bakin_memory on CPU). Three fixes, live-verified:

- tables.stats now reads doc_count from the indexes GET (which never
  blocks) instead of running a query. search-status: hang -> 4ms during an
  active backfill.

- The single-table query() path gets the same 15s no-infinite-patience
  ceiling as multiQuery — it serves the per-plugin /search routes directly.

- indexes.list returns an ARRAY in the v0.2 SDK; Object.entries over it
  named every index "0"/"1" in health payloads and made rebuildIndexes
  drop nonexistent index names (silently breaking reindex?rebuild=true).
  Normalized both shapes; names come from config.name. While there:
  backfill_state 'failed' (the bakin#456 findings-8/9 state on
  bakin_assets) now surfaces as an unhealthy index with a named error
  instead of showing green while semantic search returns nothing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@markhayden

Copy link
Copy Markdown
Owner Author

Follow-up (9e7f9f9f): the UI "Reindex All" hang is fixed. The POST itself was completing — the button's follow-up fetchData() froze inside /api/plugins/health/search-status, whose per-table doc counts ran a matchAll query that hangs forever against any backfilling table (#456 finding 10). So the health page froze for every post-reindex backfill window.

  • tables.stats now reads doc_count from the indexes GET (never blocks): search-status hang → 4ms during an active backfill; full UI flow now POST ~1.3s + fetch ~10ms.
  • The single-table query() path got the same 15s ceiling as multiQuery (it serves the per-plugin /search routes).
  • Bonus bugs found in the payload while debugging: the v0.2 SDK's indexes.list returns an arrayObject.entries over it named every index "0"/"1" in health payloads and made rebuildIndexes drop nonexistent names (silently breaking reindex?rebuild=true). Fixed via shape normalization. And backfill_state: 'failed' (the findings-8/9 state on bakin_assets) now surfaces as an unhealthy index with a named error instead of green — live-verified: the health page now shows both assets indexes red with "embeddings backfill failed…".

Suite 4690/0, lint + typecheck clean, pushed.

🤖 Generated with Claude Code

markhayden and others added 3 commits June 6, 2026 14:40
…DDRINUSE (#459)

Two dev-loop defects that minted orphaned antfly processes across
generations (diagnosed live; full writeup in #459):

- scripts/dev.ts registered SIGINT/SIGTERM handlers before server.ts loads
  and called process.exit(0) — Node runs signal listeners in registration
  order, so the lifecycle's async shutdown (the only thing that stops the
  antfly child) never executed. The dev handler now defers once the
  lifecycle owns shutdown (globalThis flag set by registerShutdownHandlers;
  checked without importing lifecycle so an early Ctrl-C during the build
  phase doesn't load the server module graph).

- server.ts had no 'error' listener on listen(): EADDRINUSE threw as an
  uncaught exception AFTER the full boot (antfly spawned, watcher running),
  bypassing cleanup. It now logs the port + lsof remediation, routes
  through the lifecycle shutdown via self-SIGTERM, and exits 1 — the
  lifecycle's final exit honors a pre-set process.exitCode instead of
  stamping 0 over it.

Verified end-to-end with two isolated-BAKIN_HOME servers on one port:
second exits 1 with the named remediation and "Shutdown complete"; first
keeps serving and exits cleanly on SIGTERM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e log stream

The raw "enrichment worker failed: UnsupportedEmbeddingProvider" error sent
the operator debugging an embedder provider config that was fine — the real
cause is upstream image enrichment rejecting every embedder at this pin
(bakin#456 finding 8; lab-bisected). The log classifier now appends the
explanation + issue pointer and demotes the line to warn — it repeats on
every boot reconcile and asset write, and the health page already carries
the red per-index state. Same treatment for the CLIP InputArityMismatch
enrichment failure (finding 9).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nfig

Embedder settings entries may now carry antfly's documented EmbedderConfig
pass-throughs: api_url (route that embedder's calls over HTTP to a named
inference endpoint, e.g. the private instance's own /ai/v1) and multimodal
(declare non-text support for models outside the built-in registry).
Forwarded verbatim at table create; unset entries omit the keys — no
behavior change without explicit settings.

This is the Bakin half of running image search against a media-path-fixed
antfly: visual embedder -> antflydb/clipclap + api_url loopback. Verified
live end-to-end on a locally patched antfly main: assets enrich (text+
visual), and "a photo of a cat" via /api/search returns the actual pet
portrait image (visual score 0.79). See #456 verification comments and
#466 for the pin-bump plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant