Switch from big OR clauses to solr terms queries for better performance#12874
Open
cdrini wants to merge 2 commits into
Open
Switch from big OR clauses to solr terms queries for better performance#12874cdrini wants to merge 2 commits into
cdrini wants to merge 2 commits into
Conversation
…og Solr filter
The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT
works) and filters Solr results to that set. The previous query built a
BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery
in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the
dominant cost of the request.
{!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a
sorted term array and uses a binary search per candidate doc. It also
bypasses the maxBooleanClauses limit entirely, making key counts above
1024 safe without requiring -Dsolr.max.booleanClauses on the JVM.
Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs):
Keys OR BooleanQuery TermsQuery Speedup
------ --------------- ---------- -------
5,000 13 ms 6 ms 2.2×
10,000 193 ms 9 ms 21.4×
20,000 422 ms 17 ms 24.8×
30,000 352 ms 19 ms 18.5×
Production runs ~200× more documents (40M). A 20k-key OR query would
likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms.
The key format is "/works/OL{n}W" — no commas, so comma-delimited
TermsQuery values are safe. Results are semantically identical.
Trade-off: TermsQuery does not support relevance scoring per term. The
reading-log filter is used as an fq= (filter, not score), so there is
no scoring impact.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the reading-log Solr filtering path by replacing a large key:(... OR ...) BooleanQuery with Solr’s {!terms f=key} TermsQuery, reducing query rewrite overhead and avoiding maxBooleanClauses limits for users with large reading logs.
Changes:
- Switch reading-log work-key filtering from OR-chained clauses to a
{!terms f=key}query string. - Update inline comments to document the performance rationale and the requirement to keep this filter in
fq.
Comment on lines
+436
to
+438
| # {!terms f=key} uses Solr's TermsQuery which is O(n) vs BooleanQuery's | ||
| # O(n log n) rewriting at 20k+ clauses. Keys are "/works/OL{n}W" — no commas. | ||
| filter_query = "{!terms f=key}" + ",".join(work_to_edition_keys) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pulled off from #12699 .
Claude output:
The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT works) and filters Solr results to that set. The previous query built a BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the dominant cost of the request.
{!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a sorted term array and uses a binary search per candidate doc. It also bypasses the maxBooleanClauses limit entirely, making key counts above 1024 safe without requiring -Dsolr.max.booleanClauses on the JVM.
Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs):
Keys OR BooleanQuery TermsQuery Speedup
5,000 13 ms 6 ms 2.2×
10,000 193 ms 9 ms 21.4×
20,000 422 ms 17 ms 24.8×
30,000 352 ms 19 ms 18.5×
Production runs ~200× more documents (40M). A 20k-key OR query would likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms.
The key format is "/works/OL{n}W" — no commas, so comma-delimited TermsQuery values are safe. Results are semantically identical.
Trade-off: TermsQuery does not support relevance scoring per term. The reading-log filter is used as an fq= (filter, not score), so there is no scoring impact.
Technical
Testing
On testing this does appear to make very large reading logs (30k) go from around 12s to around 8s, so seems like a good improvement!
Screenshot
Stakeholders