Skip to content

Feature/query router#15

Open
codectified wants to merge 4 commits into
sunnah-com:mainfrom
codectified:feature/query-router
Open

Feature/query router#15
codectified wants to merge 4 commits into
sunnah-com:mainfrom
codectified:feature/query-router

Conversation

@codectified

Copy link
Copy Markdown
Collaborator
  • Query router (_route_query): classifies every incoming query before any ES call.
    Arabic text → BM25 full corpus; quoted strings → phrase BM25; queries ending in a
    number → reference BM25 with hadithNumber/collection boosts; uppercase AND/OR/NOT
    → BM25 (operators work correctly); everything else follows ?mode=.
    All four lexical rules override ?mode=semantic. _meta.route in every response names
    the path taken.

  • Spam filter (_is_spam): rejects URLs, bare phone numbers, long tokens, repeat
    characters, and Indonesian WhatsApp business spam (WA 08xx) before routing. Returns
    a 400 with {"error": "invalid query"}.

  • Collection boosts on semantic: build_semantic_query now wraps kNN in the same
    function_score / COLLECTION_BOOSTS that lexical already used, so authoritative
    collections surface consistently across both modes.

  • ROUTER_LOG env var: when set to true, emits one structured JSON line per request
    (route, variant, overridden) to the access log — intended for a day of prod audit
    before going live, then turned off.

  • Removed three dev/research scripts (tests/analyze_queries.py,
    tests/batch_search.py, tests/fetch_exact_knn.py) that were never part of core.

New files

  • tests/test_query_router.py — 53 integration checks against a live Flask/ES stack
    (all routes, rank-1 accuracy for reference queries, _meta.route values,
    mode-override priority). Run with TEST_BASE=http://localhost:5000 python3 tests/test_query_router.py.

  • docs/query_router_design.md — detection code, flow diagram, per-route rationale,
    known limitations (collection synonym gap at query time), production rollout steps.

Omar Ibrahim and others added 4 commits June 9, 2026 11:42
- _route_query(): classifies every query before ES — Arabic → BM25 full
  corpus, quoted → phrase BM25, number-ending → reference BM25,
  boolean operators → BM25; everything else follows ?mode=
- _is_spam(): filters URLs, phone numbers, long tokens, repeat chars,
  Indonesian WhatsApp spam (WA 08xx pattern) before routing
- build_semantic_query(): wraps kNN in function_score with
  COLLECTION_BOOSTS so authoritative collections surface the same way
  lexical does
- ROUTER_LOG env var: structured per-request logging of routing decisions
  for prod audit before going live
- Remove three research/dev scripts that were never part of core

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
53 checks covering all route branches: Arabic BM25, quoted phrase,
reference (number-ending) with rank-1 accuracy, boolean operators,
semantic passthrough, mode-override priority, and _meta.route values.
Runs against a live Flask/ES stack via TEST_BASE env var.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/query_router_design.md: new file — detection code, flow diagram,
  per-route rationale (phrase/Arabic/reference/boolean/semantic), known
  limitations (collection synonym gap), production rollout steps
- README: architecture diagram updated to reflect routing tree and
  english-lexical / english-<model> index split; add collection boosts
  table noting both lexical and semantic are covered; remove field
  mapping table (belongs in design doc); fix stale lexical_phrase refs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add spam filter step before query router
- Show "everything else" branching on ?mode= (lexical/semantic) instead
  of incorrectly hardcoding it to BM25
- Drop index alias names — they vary by deployment; just show search type
- Remove redundant Elasticsearch box that repeated what the router tree
  already showed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant