Semantic search branche by Yugi-2 · Pull Request #12 · sunnah-com/search

Yugi-2 · 2026-05-28T02:08:34Z

Pulling in the work from #11 and a few other adjustments

git push

- Rewrites main.py with one ES index per embedding model (openai-small-en, openai-small-multi, nomic, mxbai, embeddinggemma) for independent indexing - Adds incremental indexing, multilingual support (openai-small-multi indexes all 180k Arabic + English docs), and lexical-only index mode - Adds testbed search UI (templates/search.html) with mode/model toggle pills and side-by-side comparison vs lexical baseline - Updates docker-compose.yml (port 5001, tei-gemma profile) and .env.sample Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an optional path that pre-computes mxbai-embed-large vectors via a TEI-backed HF Inference Endpoint and ships them inline in the bulk payload, so ES `semantic_text` skips its own inference call during indexing. Query-time embedding still goes through the Ollama-bound ES inference endpoint. Configured via HUGGING_FACE_KEY + HF_DEDICATED_URL; unset either to fall back to Ollama for indexing. Empty-text hadiths are filtered out so TEI doesn't reject whole batches. RateLimiter lives in utils/ for reuse and easier unit testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three risks the code-review surfaced: 1. urllib.error.URLError / socket.timeout were not caught, so a single transient network blip would kill an in-progress 48K-doc rebuild after potentially minutes of successful work. Now retried like 5xx. 2. Backoff `wait = max(parsed, floor, min(2**attempt, 30))` let a server-supplied Retry-After dominate without bound — a 503 with `Retry-After: 600` would stall a worker for hours. Cap parsed at 60s before combining with floor and exp backoff. 3. uwsgi `harakiri = 35` would kill `/index` long before the ~9-minute embedding phase finishes. Added a per-route override (`route = ^/index harakiri:1800`) so search endpoints keep the strict 35s limit but admin-triggered rebuilds get the headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop empty hadithText rows in the SQL SELECT so the rest of the pipeline doesn't have to guard against them. Removes the empty-text filter we previously added in _attach_semantic_field as a downstream bandaid. - Pull the inline-chunk doc construction into _inline_chunk_doc so the rewrite loop is a one-line list comprehension. Hoists model_settings out of the per-doc loop (was rebuilt 48K+ times per rebuild). - Iterate ThreadPoolExecutor futures with as_completed instead of submission order, so one slow batch can't idle the other workers. - Simplify _attach_semantic_field to a single comprehension now that empty text is impossible by the time it runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Rename MXBAI_ENABLED → SEMANTIC_ENABLED. The toggle is now a top-level module constant rather than a nested per-model field, since there's only one semantic model. Adding more models in the future means a new EMBEDDING_MODELS entry; the on/off switch stays where it is. - EMBEDDING_MODELS becomes pure data with no env coupling. _ENABLED_MODELS is now the catalog gated on SEMANTIC_ENABLED, no dict-comp filter. - README: emphasize that /index without model= builds both lexical and semantic by default (it always did, but the prose was burying it). Drop &model=mxbai from the default search example since it's the only enabled model — leave it only on the explicit-pin examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… order _resolve_model_key was picking next(iter(_ENABLED_MODELS), None) when no ?model= was passed, which is fragile: adding a second model entry above mxbai in the catalog dict would silently change the default for every existing client. Set DEFAULT_SEMANTIC_MODEL = "mxbai" as a top-level constant and read from that. Also collapses the two-branch resolver into one — "key or default" then membership check — since the default goes through the same validation path as a user-supplied one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BULK_REQUEST_TIMEOUT = 300 if SEMANTIC_ENABLED else 60 was misleading: the lexical bulk path inside _rebuild_index and _incremental_index already selects via `timeout = BULK_REQUEST_TIMEOUT if model else 60`, so the ternary's "else 60" branch was effectively dead when SEMANTIC_ENABLED was false (only lexical reachable, but it takes the inline 60 anyway). Replace with two named constants — LEXICAL_BULK_TIMEOUT_S and SEMANTIC_BULK_TIMEOUT_S — and drop the now-dead `timeout or constant` fallback in _bulk_index (both call sites always pass a value). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The old shape had: SEARCH_MODES = ("lexical", "semantic") SEMANTIC_MODES = ("semantic",) with a _resolve_mode that ended in `if mode in SEMANTIC_MODES: mode = "semantic"` — a literal no-op since SEMANTIC_MODES is a one-element tuple of "semantic". That second `if` was probably a leftover from when multiple semantic modes collapsed to one canonical label. Replace both tuples with LEXICAL_MODE / SEMANTIC_MODE string constants and fold the resolver into one check: return SEMANTIC_MODE iff the input is "semantic" AND SEMANTIC_ENABLED is on, else LEXICAL_MODE. Call sites that did `mode in SEMANTIC_MODES` become `mode == SEMANTIC_MODE`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

str-Enum mixin so SearchMode.SEMANTIC == "semantic" (and serializes as "semantic" in the access log JSON without extra plumbing), while _resolve_mode gets free validation: SearchMode((args.get("mode") or "").lower()) raises ValueError on junk input which we catch and fall back to LEXICAL. The previous string-constant version did its own membership check; the enum constructor handles it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The handler previously split into two `if mode == SearchMode.SEMANTIC:` blocks with the lexical-only `build_lexical` helper wedged between them, and threaded model/search_index through both paths even though only the semantic branch ever used the model fields. Hoist the semantic branch to the top as an early return: resolve the model, log, dispatch to _semantic_search. The lexical path that follows doesn't need model_key, model, or search_index — it uses LEXICAL_INDEX directly. Also drops `model = _ENABLED_MODELS.get(model_key) if model_key else None` in favor of `_ENABLED_MODELS[model_key]`, since _resolve_model_key already guarantees the key is in the dict whenever err is None. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The old shape decided what to build via a nested ternary that fell through to _ENABLED_MODELS as the default branch — so an unknown ?model= value (typo, stale doc, fat-finger) was interpreted as "build everything" instead of "user typed something wrong." Also the if-block above it skipped lexical for the same input, so a typo'd /index?model=mxBai would silently rebuild semantic but not lexical. Restructure into an explicit four-arm if/elif/else: None / "lexical" / known model / everything else → 400 with the valid set. Each case sets `build_lexical` and `models_to_index` directly, so the reader doesn't have to evaluate a nested ternary to know which branch ran. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

?model= overloaded "which target" with "which embedding model" and made ?model=lexical a category error (lexical isn't a model). Replace with ?targets=lexical,mxbai — comma-separated subset of valid build targets. Behavior matrix: /index → build everything (lexical + every enabled) /index?targets=lexical → lexical only /index?targets=mxbai → that semantic model only /index?targets=lexical,mxbai → both /index?targets= → 400 (empty list, probably a bug) /index?targets=<unknown> → 400 with the valid set Validation moved up before the SQL fetch so a typo doesn't wait through a 30-second MySQL roundtrip just to be rejected. The build block collapses from a four-arm if/elif/else to two lines: `if "lexical" in targets` + a dict-comprehension filter on _ENABLED_MODELS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After the recent refactors (SEMANTIC_ENABLED single toggle, ?targets= on /index, ./:/code bind mount in dev compose) two README sections were inaccurate: - "Adding a model" still told you to add NEWMODEL_ENABLED=false to .env.sample (per-model env var; no longer how it works) and to call /index?model=newmodel (renamed to targets=). Rewrote to reflect the single SEMANTIC_ENABLED toggle, the targets= subset, and that the default for /search comes from DEFAULT_SEMANTIC_MODEL. - "Batch evaluation" had three docker cp commands that were no-ops in the dev workflow — the script writes to /code/ inside the container which is the host repo root via the bind mount. Collapsed to one docker exec and a note about where the outputs land. Spot-verified the rest of the README against current main.py / docker-compose.* / uwsgi.ini — architecture diagram, env-config note, build-the-indexes examples, prod section, search-mode docs, and API endpoint table all match the code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The line claimed mode=semantic "uses the only enabled embedding model by default" — that described the older next(iter(_ENABLED_MODELS)) behavior. Current code uses DEFAULT_SEMANTIC_MODEL = "mxbai" as an explicit constant: adding a second enabled model doesn't change which one is the default. Reflect that in the prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kept the non-obvious why bits (HF pool 429s under TEI's stated limit, batch×max_input vs max_batch_tokens, -1 disables throttling, ceiling prevents pathological Retry-After). Cut the procedural narration ("all env-overridable", "16 × 512 = 8192", "1 hour idle" math). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yug and others added 30 commits May 14, 2026 22:41

hybrid search

1760025

Back to openai

babe5e2

Semantic only mode

3a3630c

incremental indexing

9891a04

git push

cleanup

37a17b5

testing scripts, results, reports and documentation

6f3a8f9

prod notes

c6b22c8

more query analyses

d343c1b

mxbai toggle for prod PR

4f2b863

readme update

37b2ad3

PR cleanup

57fc307

more cleanup for prod

18fc59c

Revert dev port to 5000 and remove test results directory

1e9cd71

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

minor tweaks

11e36b6

revert gitignore chagnes, minor readme changes

a76776e

yug added 2 commits May 28, 2026 13:34

formatting

6eabebb

cleanup

b87db4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic search branche#12

Semantic search branche#12
Yugi-2 wants to merge 32 commits into
mainfrom
testing/mxbai-tweaks

Yugi-2 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yugi-2 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant