Feature/query router#15
Open
codectified wants to merge 4 commits into
Open
Conversation
- _route_query(): classifies every query before ES — Arabic → BM25 full corpus, quoted → phrase BM25, number-ending → reference BM25, boolean operators → BM25; everything else follows ?mode= - _is_spam(): filters URLs, phone numbers, long tokens, repeat chars, Indonesian WhatsApp spam (WA 08xx pattern) before routing - build_semantic_query(): wraps kNN in function_score with COLLECTION_BOOSTS so authoritative collections surface the same way lexical does - ROUTER_LOG env var: structured per-request logging of routing decisions for prod audit before going live - Remove three research/dev scripts that were never part of core Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
53 checks covering all route branches: Arabic BM25, quoted phrase, reference (number-ending) with rank-1 accuracy, boolean operators, semantic passthrough, mode-override priority, and _meta.route values. Runs against a live Flask/ES stack via TEST_BASE env var. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/query_router_design.md: new file — detection code, flow diagram, per-route rationale (phrase/Arabic/reference/boolean/semantic), known limitations (collection synonym gap), production rollout steps - README: architecture diagram updated to reflect routing tree and english-lexical / english-<model> index split; add collection boosts table noting both lexical and semantic are covered; remove field mapping table (belongs in design doc); fix stale lexical_phrase refs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add spam filter step before query router - Show "everything else" branching on ?mode= (lexical/semantic) instead of incorrectly hardcoding it to BM25 - Drop index alias names — they vary by deployment; just show search type - Remove redundant Elasticsearch box that repeated what the router tree already showed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Query router (
_route_query): classifies every incoming query before any ES call.Arabic text → BM25 full corpus; quoted strings → phrase BM25; queries ending in a
number → reference BM25 with
hadithNumber/collectionboosts; uppercase AND/OR/NOT→ BM25 (operators work correctly); everything else follows
?mode=.All four lexical rules override
?mode=semantic._meta.routein every response namesthe path taken.
Spam filter (
_is_spam): rejects URLs, bare phone numbers, long tokens, repeatcharacters, and Indonesian WhatsApp business spam (
WA 08xx) before routing. Returnsa 400 with
{"error": "invalid query"}.Collection boosts on semantic:
build_semantic_querynow wraps kNN in the samefunction_score/COLLECTION_BOOSTSthat lexical already used, so authoritativecollections surface consistently across both modes.
ROUTER_LOGenv var: when set totrue, emits one structured JSON line per request(
route,variant,overridden) to the access log — intended for a day of prod auditbefore going live, then turned off.
Removed three dev/research scripts (
tests/analyze_queries.py,tests/batch_search.py,tests/fetch_exact_knn.py) that were never part of core.New files
tests/test_query_router.py— 53 integration checks against a live Flask/ES stack(all routes, rank-1 accuracy for reference queries,
_meta.routevalues,mode-override priority). Run with
TEST_BASE=http://localhost:5000 python3 tests/test_query_router.py.docs/query_router_design.md— detection code, flow diagram, per-route rationale,known limitations (collection synonym gap at query time), production rollout steps.