Skip to content

Switch from big OR clauses to solr terms queries for better performance#12874

Open
cdrini wants to merge 2 commits into
internetarchive:masterfrom
cdrini:perf/bookshelves-termsquery-reading-log-filter
Open

Switch from big OR clauses to solr terms queries for better performance#12874
cdrini wants to merge 2 commits into
internetarchive:masterfrom
cdrini:perf/bookshelves-termsquery-reading-log-filter

Conversation

@cdrini

@cdrini cdrini commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Pulled off from #12699 .

Claude output:
The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT works) and filters Solr results to that set. The previous query built a BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the dominant cost of the request.

{!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a sorted term array and uses a binary search per candidate doc. It also bypasses the maxBooleanClauses limit entirely, making key counts above 1024 safe without requiring -Dsolr.max.booleanClauses on the JVM.

Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs):

Keys OR BooleanQuery TermsQuery Speedup


5,000 13 ms 6 ms 2.2×
10,000 193 ms 9 ms 21.4×
20,000 422 ms 17 ms 24.8×
30,000 352 ms 19 ms 18.5×

Production runs ~200× more documents (40M). A 20k-key OR query would likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms.

The key format is "/works/OL{n}W" — no commas, so comma-delimited TermsQuery values are safe. Results are semantically identical.

Trade-off: TermsQuery does not support relevance scoring per term. The reading-log filter is used as an fq= (filter, not score), so there is no scoring impact.

Technical

Testing

On testing this does appear to make very large reading logs (30k) go from around 12s to around 8s, so seems like a good improvement!

Screenshot

Stakeholders

Scott Barnes (sec audit) and others added 2 commits June 8, 2026 13:41
…og Solr filter

The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT
works) and filters Solr results to that set. The previous query built a
BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery
in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the
dominant cost of the request.

{!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a
sorted term array and uses a binary search per candidate doc. It also
bypasses the maxBooleanClauses limit entirely, making key counts above
1024 safe without requiring -Dsolr.max.booleanClauses on the JVM.

Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs):

  Keys      OR BooleanQuery   TermsQuery   Speedup
  ------    ---------------   ----------   -------
   5,000           13 ms          6 ms       2.2×
  10,000          193 ms          9 ms      21.4×
  20,000          422 ms         17 ms      24.8×
  30,000          352 ms         19 ms      18.5×

Production runs ~200× more documents (40M). A 20k-key OR query would
likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms.

The key format is "/works/OL{n}W" — no commas, so comma-delimited
TermsQuery values are safe. Results are semantically identical.

Trade-off: TermsQuery does not support relevance scoring per term. The
reading-log filter is used as an fq= (filter, not score), so there is
no scoring impact.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cdrini cdrini changed the title Switch from big OR clauses to solr terms queries for better profrmance Switch from big OR clauses to solr terms queries for better performance Jun 8, 2026
@cdrini cdrini marked this pull request as ready for review June 8, 2026 14:55
Copilot AI review requested due to automatic review settings June 8, 2026 14:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the reading-log Solr filtering path by replacing a large key:(... OR ...) BooleanQuery with Solr’s {!terms f=key} TermsQuery, reducing query rewrite overhead and avoiding maxBooleanClauses limits for users with large reading logs.

Changes:

  • Switch reading-log work-key filtering from OR-chained clauses to a {!terms f=key} query string.
  • Update inline comments to document the performance rationale and the requirement to keep this filter in fq.

Comment on lines +436 to +438
# {!terms f=key} uses Solr's TermsQuery which is O(n) vs BooleanQuery's
# O(n log n) rewriting at 20k+ clauses. Keys are "/works/OL{n}W" — no commas.
filter_query = "{!terms f=key}" + ",".join(work_to_edition_keys)
@mekarpeles mekarpeles added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants