Skip to content

Semantic search branche#12

Open
Yugi-2 wants to merge 32 commits into
mainfrom
testing/mxbai-tweaks
Open

Semantic search branche#12
Yugi-2 wants to merge 32 commits into
mainfrom
testing/mxbai-tweaks

Conversation

@Yugi-2

@Yugi-2 Yugi-2 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Pulling in the work from #11 and a few other adjustments

yug and others added 30 commits May 14, 2026 22:41
- Rewrites main.py with one ES index per embedding model (openai-small-en,
  openai-small-multi, nomic, mxbai, embeddinggemma) for independent indexing
- Adds incremental indexing, multilingual support (openai-small-multi indexes
  all 180k Arabic + English docs), and lexical-only index mode
- Adds testbed search UI (templates/search.html) with mode/model toggle pills
  and side-by-side comparison vs lexical baseline
- Updates docker-compose.yml (port 5001, tei-gemma profile) and .env.sample

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional path that pre-computes mxbai-embed-large vectors via a
TEI-backed HF Inference Endpoint and ships them inline in the bulk
payload, so ES `semantic_text` skips its own inference call during
indexing. Query-time embedding still goes through the Ollama-bound ES
inference endpoint. Configured via HUGGING_FACE_KEY + HF_DEDICATED_URL;
unset either to fall back to Ollama for indexing.

Empty-text hadiths are filtered out so TEI doesn't reject whole batches.
RateLimiter lives in utils/ for reuse and easier unit testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three risks the code-review surfaced:

1. urllib.error.URLError / socket.timeout were not caught, so a single
   transient network blip would kill an in-progress 48K-doc rebuild
   after potentially minutes of successful work. Now retried like 5xx.

2. Backoff `wait = max(parsed, floor, min(2**attempt, 30))` let a
   server-supplied Retry-After dominate without bound — a 503 with
   `Retry-After: 600` would stall a worker for hours. Cap parsed at
   60s before combining with floor and exp backoff.

3. uwsgi `harakiri = 35` would kill `/index` long before the ~9-minute
   embedding phase finishes. Added a per-route override
   (`route = ^/index harakiri:1800`) so search endpoints keep the
   strict 35s limit but admin-triggered rebuilds get the headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop empty hadithText rows in the SQL SELECT so the rest of the pipeline
  doesn't have to guard against them. Removes the empty-text filter we
  previously added in _attach_semantic_field as a downstream bandaid.

- Pull the inline-chunk doc construction into _inline_chunk_doc so the
  rewrite loop is a one-line list comprehension. Hoists model_settings
  out of the per-doc loop (was rebuilt 48K+ times per rebuild).

- Iterate ThreadPoolExecutor futures with as_completed instead of
  submission order, so one slow batch can't idle the other workers.

- Simplify _attach_semantic_field to a single comprehension now that
  empty text is impossible by the time it runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename MXBAI_ENABLED → SEMANTIC_ENABLED. The toggle is now a top-level
  module constant rather than a nested per-model field, since there's only
  one semantic model. Adding more models in the future means a new
  EMBEDDING_MODELS entry; the on/off switch stays where it is.

- EMBEDDING_MODELS becomes pure data with no env coupling. _ENABLED_MODELS
  is now the catalog gated on SEMANTIC_ENABLED, no dict-comp filter.

- README: emphasize that /index without model= builds both lexical and
  semantic by default (it always did, but the prose was burying it).
  Drop &model=mxbai from the default search example since it's the only
  enabled model — leave it only on the explicit-pin examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… order

_resolve_model_key was picking next(iter(_ENABLED_MODELS), None) when no
?model= was passed, which is fragile: adding a second model entry above
mxbai in the catalog dict would silently change the default for every
existing client. Set DEFAULT_SEMANTIC_MODEL = "mxbai" as a top-level
constant and read from that.

Also collapses the two-branch resolver into one — "key or default" then
membership check — since the default goes through the same validation
path as a user-supplied one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BULK_REQUEST_TIMEOUT = 300 if SEMANTIC_ENABLED else 60 was misleading:
the lexical bulk path inside _rebuild_index and _incremental_index already
selects via `timeout = BULK_REQUEST_TIMEOUT if model else 60`, so the
ternary's "else 60" branch was effectively dead when SEMANTIC_ENABLED
was false (only lexical reachable, but it takes the inline 60 anyway).

Replace with two named constants — LEXICAL_BULK_TIMEOUT_S and
SEMANTIC_BULK_TIMEOUT_S — and drop the now-dead `timeout or constant`
fallback in _bulk_index (both call sites always pass a value).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old shape had:

    SEARCH_MODES = ("lexical", "semantic")
    SEMANTIC_MODES = ("semantic",)

with a _resolve_mode that ended in `if mode in SEMANTIC_MODES: mode = "semantic"`
— a literal no-op since SEMANTIC_MODES is a one-element tuple of "semantic".
That second `if` was probably a leftover from when multiple semantic modes
collapsed to one canonical label.

Replace both tuples with LEXICAL_MODE / SEMANTIC_MODE string constants and
fold the resolver into one check: return SEMANTIC_MODE iff the input is
"semantic" AND SEMANTIC_ENABLED is on, else LEXICAL_MODE. Call sites that
did `mode in SEMANTIC_MODES` become `mode == SEMANTIC_MODE`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
str-Enum mixin so SearchMode.SEMANTIC == "semantic" (and serializes as
"semantic" in the access log JSON without extra plumbing), while
_resolve_mode gets free validation: SearchMode((args.get("mode") or
"").lower()) raises ValueError on junk input which we catch and fall
back to LEXICAL. The previous string-constant version did its own
membership check; the enum constructor handles it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The handler previously split into two `if mode == SearchMode.SEMANTIC:`
blocks with the lexical-only `build_lexical` helper wedged between them,
and threaded model/search_index through both paths even though only the
semantic branch ever used the model fields.

Hoist the semantic branch to the top as an early return: resolve the
model, log, dispatch to _semantic_search. The lexical path that follows
doesn't need model_key, model, or search_index — it uses LEXICAL_INDEX
directly. Also drops `model = _ENABLED_MODELS.get(model_key) if
model_key else None` in favor of `_ENABLED_MODELS[model_key]`, since
_resolve_model_key already guarantees the key is in the dict whenever
err is None.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old shape decided what to build via a nested ternary that fell
through to _ENABLED_MODELS as the default branch — so an unknown
?model= value (typo, stale doc, fat-finger) was interpreted as "build
everything" instead of "user typed something wrong." Also the if-block
above it skipped lexical for the same input, so a typo'd
/index?model=mxBai would silently rebuild semantic but not lexical.

Restructure into an explicit four-arm if/elif/else: None / "lexical" /
known model / everything else → 400 with the valid set. Each case
sets `build_lexical` and `models_to_index` directly, so the reader
doesn't have to evaluate a nested ternary to know which branch ran.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
?model= overloaded "which target" with "which embedding model" and made
?model=lexical a category error (lexical isn't a model). Replace with
?targets=lexical,mxbai — comma-separated subset of valid build targets.

Behavior matrix:
  /index                       → build everything (lexical + every enabled)
  /index?targets=lexical       → lexical only
  /index?targets=mxbai         → that semantic model only
  /index?targets=lexical,mxbai → both
  /index?targets=              → 400 (empty list, probably a bug)
  /index?targets=<unknown>     → 400 with the valid set

Validation moved up before the SQL fetch so a typo doesn't wait through
a 30-second MySQL roundtrip just to be rejected. The build block
collapses from a four-arm if/elif/else to two lines: `if "lexical" in
targets` + a dict-comprehension filter on _ENABLED_MODELS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the recent refactors (SEMANTIC_ENABLED single toggle, ?targets=
on /index, ./:/code bind mount in dev compose) two README sections were
inaccurate:

- "Adding a model" still told you to add NEWMODEL_ENABLED=false to
  .env.sample (per-model env var; no longer how it works) and to call
  /index?model=newmodel (renamed to targets=). Rewrote to reflect the
  single SEMANTIC_ENABLED toggle, the targets= subset, and that the
  default for /search comes from DEFAULT_SEMANTIC_MODEL.

- "Batch evaluation" had three docker cp commands that were no-ops in
  the dev workflow — the script writes to /code/ inside the container
  which is the host repo root via the bind mount. Collapsed to one
  docker exec and a note about where the outputs land.

Spot-verified the rest of the README against current main.py /
docker-compose.* / uwsgi.ini — architecture diagram, env-config note,
build-the-indexes examples, prod section, search-mode docs, and API
endpoint table all match the code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The line claimed mode=semantic "uses the only enabled embedding model
by default" — that described the older next(iter(_ENABLED_MODELS))
behavior. Current code uses DEFAULT_SEMANTIC_MODEL = "mxbai" as an
explicit constant: adding a second enabled model doesn't change which
one is the default. Reflect that in the prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kept the non-obvious why bits (HF pool 429s under TEI's stated limit,
batch×max_input vs max_batch_tokens, -1 disables throttling, ceiling
prevents pathological Retry-After). Cut the procedural narration
("all env-overridable", "16 × 512 = 8192", "1 hour idle" math).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant