Switch from big OR clauses to solr terms queries for better performance by cdrini · Pull Request #12874 · internetarchive/openlibrary

cdrini · 2026-06-08T13:47:31Z

Pulled off from #12699 .

Claude output:
The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT works) and filters Solr results to that set. The previous query built a BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the dominant cost of the request.

{!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a sorted term array and uses a binary search per candidate doc. It also bypasses the maxBooleanClauses limit entirely, making key counts above 1024 safe without requiring -Dsolr.max.booleanClauses on the JVM.

Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs):

Keys OR BooleanQuery TermsQuery Speedup

5,000 13 ms 6 ms 2.2×
10,000 193 ms 9 ms 21.4×
20,000 422 ms 17 ms 24.8×
30,000 352 ms 19 ms 18.5×

Production runs ~200× more documents (40M). A 20k-key OR query would likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms.

The key format is "/works/OL{n}W" — no commas, so comma-delimited TermsQuery values are safe. Results are semantically identical.

Trade-off: TermsQuery does not support relevance scoring per term. The reading-log filter is used as an fq= (filter, not score), so there is no scoring impact.

Technical

Testing

On testing this does appear to make very large reading logs (30k) go from around 12s to around 8s, so seems like a good improvement!

Screenshot

Stakeholders

…og Solr filter The reading-log page fetches a patron's full shelf (up to FILTER_BOOK_LIMIT works) and filters Solr results to that set. The previous query built a BooleanQuery with one clause per work key. Lucene rewrites BooleanQuery in O(n log n) and checks maxBooleanClauses; at 20k+ keys it became the dominant cost of the request. {!terms f=key} compiles to a Solr TermsQuery which is O(n) — it builds a sorted term array and uses a binary search per candidate doc. It also bypasses the maxBooleanClauses limit entirely, making key counts above 1024 safe without requiring -Dsolr.max.booleanClauses on the JVM. Empirical benchmark (local Solr 9.9.0, 198k docs, averaged over 5 runs): Keys OR BooleanQuery TermsQuery Speedup ------ --------------- ---------- ------- 5,000 13 ms 6 ms 2.2× 10,000 193 ms 9 ms 21.4× 20,000 422 ms 17 ms 24.8× 30,000 352 ms 19 ms 18.5× Production runs ~200× more documents (40M). A 20k-key OR query would likely exceed timeAllowed=10s there; TermsQuery is estimated at 100–300ms. The key format is "/works/OL{n}W" — no commas, so comma-delimited TermsQuery values are safe. Results are semantically identical. Trade-off: TermsQuery does not support relevance scoring per term. The reading-log filter is used as an fq= (filter, not score), so there is no scoring impact. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR optimizes the reading-log Solr filtering path by replacing a large key:(... OR ...) BooleanQuery with Solr’s {!terms f=key} TermsQuery, reducing query rewrite overhead and avoiding maxBooleanClauses limits for users with large reading logs.

Changes:

Switch reading-log work-key filtering from OR-chained clauses to a {!terms f=key} query string.
Update inline comments to document the performance rationale and the requirement to keep this filter in fq.

+            # {!terms f=key} uses Solr's TermsQuery which is O(n) vs BooleanQuery's
+            # O(n log n) rewriting at 20k+ clauses. Keys are "/works/OL{n}W" — no commas.
+            filter_query = "{!terms f=key}" + ",".join(work_to_edition_keys)


Scott Barnes (sec audit) and others added 2 commits June 8, 2026 13:41

Apply code review feedback

ce2e20d

cdrini changed the title ~~Switch from big OR clauses to solr terms queries for better profrmance~~ Switch from big OR clauses to solr terms queries for better performance Jun 8, 2026

cdrini marked this pull request as ready for review June 8, 2026 14:55

Copilot AI review requested due to automatic review settings June 8, 2026 14:55

Copilot started reviewing on behalf of cdrini June 8, 2026 14:55 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread openlibrary/core/bookshelves.py

Comment on lines +436 to +438

# {!terms f=key} uses Solr's TermsQuery which is O(n) vs BooleanQuery's

# O(n log n) rewriting at 20k+ clauses. Keys are "/works/OL{n}W" — no commas.

filter_query = "{!terms f=key}" + ",".join(work_to_edition_keys)

mekarpeles assigned cdrini and mekarpeles Jun 8, 2026

mekarpeles added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch from big OR clauses to solr terms queries for better performance#12874

Switch from big OR clauses to solr terms queries for better performance#12874
cdrini wants to merge 2 commits into
internetarchive:masterfrom
cdrini:perf/bookshelves-termsquery-reading-log-filter

cdrini commented Jun 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cdrini commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Technical

Testing

Screenshot

Stakeholders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cdrini commented Jun 8, 2026 •

edited

Loading