Semantic search branche#12
Open
Yugi-2 wants to merge 32 commits into
Open
Conversation
git push
- Rewrites main.py with one ES index per embedding model (openai-small-en, openai-small-multi, nomic, mxbai, embeddinggemma) for independent indexing - Adds incremental indexing, multilingual support (openai-small-multi indexes all 180k Arabic + English docs), and lexical-only index mode - Adds testbed search UI (templates/search.html) with mode/model toggle pills and side-by-side comparison vs lexical baseline - Updates docker-compose.yml (port 5001, tei-gemma profile) and .env.sample Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an optional path that pre-computes mxbai-embed-large vectors via a TEI-backed HF Inference Endpoint and ships them inline in the bulk payload, so ES `semantic_text` skips its own inference call during indexing. Query-time embedding still goes through the Ollama-bound ES inference endpoint. Configured via HUGGING_FACE_KEY + HF_DEDICATED_URL; unset either to fall back to Ollama for indexing. Empty-text hadiths are filtered out so TEI doesn't reject whole batches. RateLimiter lives in utils/ for reuse and easier unit testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three risks the code-review surfaced: 1. urllib.error.URLError / socket.timeout were not caught, so a single transient network blip would kill an in-progress 48K-doc rebuild after potentially minutes of successful work. Now retried like 5xx. 2. Backoff `wait = max(parsed, floor, min(2**attempt, 30))` let a server-supplied Retry-After dominate without bound — a 503 with `Retry-After: 600` would stall a worker for hours. Cap parsed at 60s before combining with floor and exp backoff. 3. uwsgi `harakiri = 35` would kill `/index` long before the ~9-minute embedding phase finishes. Added a per-route override (`route = ^/index harakiri:1800`) so search endpoints keep the strict 35s limit but admin-triggered rebuilds get the headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop empty hadithText rows in the SQL SELECT so the rest of the pipeline doesn't have to guard against them. Removes the empty-text filter we previously added in _attach_semantic_field as a downstream bandaid. - Pull the inline-chunk doc construction into _inline_chunk_doc so the rewrite loop is a one-line list comprehension. Hoists model_settings out of the per-doc loop (was rebuilt 48K+ times per rebuild). - Iterate ThreadPoolExecutor futures with as_completed instead of submission order, so one slow batch can't idle the other workers. - Simplify _attach_semantic_field to a single comprehension now that empty text is impossible by the time it runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename MXBAI_ENABLED → SEMANTIC_ENABLED. The toggle is now a top-level module constant rather than a nested per-model field, since there's only one semantic model. Adding more models in the future means a new EMBEDDING_MODELS entry; the on/off switch stays where it is. - EMBEDDING_MODELS becomes pure data with no env coupling. _ENABLED_MODELS is now the catalog gated on SEMANTIC_ENABLED, no dict-comp filter. - README: emphasize that /index without model= builds both lexical and semantic by default (it always did, but the prose was burying it). Drop &model=mxbai from the default search example since it's the only enabled model — leave it only on the explicit-pin examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… order _resolve_model_key was picking next(iter(_ENABLED_MODELS), None) when no ?model= was passed, which is fragile: adding a second model entry above mxbai in the catalog dict would silently change the default for every existing client. Set DEFAULT_SEMANTIC_MODEL = "mxbai" as a top-level constant and read from that. Also collapses the two-branch resolver into one — "key or default" then membership check — since the default goes through the same validation path as a user-supplied one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BULK_REQUEST_TIMEOUT = 300 if SEMANTIC_ENABLED else 60 was misleading: the lexical bulk path inside _rebuild_index and _incremental_index already selects via `timeout = BULK_REQUEST_TIMEOUT if model else 60`, so the ternary's "else 60" branch was effectively dead when SEMANTIC_ENABLED was false (only lexical reachable, but it takes the inline 60 anyway). Replace with two named constants — LEXICAL_BULK_TIMEOUT_S and SEMANTIC_BULK_TIMEOUT_S — and drop the now-dead `timeout or constant` fallback in _bulk_index (both call sites always pass a value). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old shape had:
SEARCH_MODES = ("lexical", "semantic")
SEMANTIC_MODES = ("semantic",)
with a _resolve_mode that ended in `if mode in SEMANTIC_MODES: mode = "semantic"`
— a literal no-op since SEMANTIC_MODES is a one-element tuple of "semantic".
That second `if` was probably a leftover from when multiple semantic modes
collapsed to one canonical label.
Replace both tuples with LEXICAL_MODE / SEMANTIC_MODE string constants and
fold the resolver into one check: return SEMANTIC_MODE iff the input is
"semantic" AND SEMANTIC_ENABLED is on, else LEXICAL_MODE. Call sites that
did `mode in SEMANTIC_MODES` become `mode == SEMANTIC_MODE`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
str-Enum mixin so SearchMode.SEMANTIC == "semantic" (and serializes as
"semantic" in the access log JSON without extra plumbing), while
_resolve_mode gets free validation: SearchMode((args.get("mode") or
"").lower()) raises ValueError on junk input which we catch and fall
back to LEXICAL. The previous string-constant version did its own
membership check; the enum constructor handles it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The handler previously split into two `if mode == SearchMode.SEMANTIC:` blocks with the lexical-only `build_lexical` helper wedged between them, and threaded model/search_index through both paths even though only the semantic branch ever used the model fields. Hoist the semantic branch to the top as an early return: resolve the model, log, dispatch to _semantic_search. The lexical path that follows doesn't need model_key, model, or search_index — it uses LEXICAL_INDEX directly. Also drops `model = _ENABLED_MODELS.get(model_key) if model_key else None` in favor of `_ENABLED_MODELS[model_key]`, since _resolve_model_key already guarantees the key is in the dict whenever err is None. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old shape decided what to build via a nested ternary that fell through to _ENABLED_MODELS as the default branch — so an unknown ?model= value (typo, stale doc, fat-finger) was interpreted as "build everything" instead of "user typed something wrong." Also the if-block above it skipped lexical for the same input, so a typo'd /index?model=mxBai would silently rebuild semantic but not lexical. Restructure into an explicit four-arm if/elif/else: None / "lexical" / known model / everything else → 400 with the valid set. Each case sets `build_lexical` and `models_to_index` directly, so the reader doesn't have to evaluate a nested ternary to know which branch ran. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
?model= overloaded "which target" with "which embedding model" and made ?model=lexical a category error (lexical isn't a model). Replace with ?targets=lexical,mxbai — comma-separated subset of valid build targets. Behavior matrix: /index → build everything (lexical + every enabled) /index?targets=lexical → lexical only /index?targets=mxbai → that semantic model only /index?targets=lexical,mxbai → both /index?targets= → 400 (empty list, probably a bug) /index?targets=<unknown> → 400 with the valid set Validation moved up before the SQL fetch so a typo doesn't wait through a 30-second MySQL roundtrip just to be rejected. The build block collapses from a four-arm if/elif/else to two lines: `if "lexical" in targets` + a dict-comprehension filter on _ENABLED_MODELS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the recent refactors (SEMANTIC_ENABLED single toggle, ?targets= on /index, ./:/code bind mount in dev compose) two README sections were inaccurate: - "Adding a model" still told you to add NEWMODEL_ENABLED=false to .env.sample (per-model env var; no longer how it works) and to call /index?model=newmodel (renamed to targets=). Rewrote to reflect the single SEMANTIC_ENABLED toggle, the targets= subset, and that the default for /search comes from DEFAULT_SEMANTIC_MODEL. - "Batch evaluation" had three docker cp commands that were no-ops in the dev workflow — the script writes to /code/ inside the container which is the host repo root via the bind mount. Collapsed to one docker exec and a note about where the outputs land. Spot-verified the rest of the README against current main.py / docker-compose.* / uwsgi.ini — architecture diagram, env-config note, build-the-indexes examples, prod section, search-mode docs, and API endpoint table all match the code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The line claimed mode=semantic "uses the only enabled embedding model by default" — that described the older next(iter(_ENABLED_MODELS)) behavior. Current code uses DEFAULT_SEMANTIC_MODEL = "mxbai" as an explicit constant: adding a second enabled model doesn't change which one is the default. Reflect that in the prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kept the non-obvious why bits (HF pool 429s under TEI's stated limit,
batch×max_input vs max_batch_tokens, -1 disables throttling, ceiling
prevents pathological Retry-After). Cut the procedural narration
("all env-overridable", "16 × 512 = 8192", "1 hour idle" math).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
added 2 commits
May 28, 2026 13:34
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pulling in the work from #11 and a few other adjustments