Skip to content

Add stemming, stopwords, and TF-IDF to search index#33

Merged
iwillspeak merged 4 commits into
mainfrom
copilot/support-stemming-stopwords
Mar 7, 2026
Merged

Add stemming, stopwords, and TF-IDF to search index#33
iwillspeak merged 4 commits into
mainfrom
copilot/support-stemming-stopwords

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 6, 2026

Search only matched exact words, making compile/compiler/compilers three separate non-overlapping terms. Common words bloated the index and all terms were weighted by raw frequency with no cross-document normalisation.

Index build (src/search.rs)

  • Stopwords: 70-word sorted list filtered via binary_search; tokens < 3 chars are also dropped
  • Stemming: Snowball English (Porter 2) via rust-stemmers; stored key is the stem (compilecompil, compilerscompil)
  • TF-IDF: write_search_indices collects all pages, builds a per-term document-frequency map, pre-computes an IDF map (ln((1+N)/(1+df))), then writes TF × IDF weights; terms appearing in every page get IDF ≤ 0 and are omitted entirely

Query side (assets/search.js)

Embeds a self-contained Porter 1 stemmer. Query tokens < 3 chars are dropped and the rest are stemmed before index lookup, keeping client and server in sync:

let terms = query.split(/\W+/)
    .map(term => term.trim().toLowerCase())
    .filter(term => term.length >= 3)
    .map(term => stemWord(term));   // "compilers" → "compil"

Tests

  • search.rs: new cases for stemming merging word forms, stopword exclusion, and short-token exclusion
  • doctree.rs: assertions updated to stemmed key forms ("distance""distanc", "orange""orang")
Original prompt

This section details on the original issue you should resolve

<issue_title>Support Stemming and Stopwords in Search</issue_title>
<issue_description>We should add in support for stemming and stopword removal. We should also consider switching from term frequency to
some variant of TFIDF for search too. This would normalise search term frequency accross documents to hopefully filter out common words. We could also consider some kind of cutoff to prevent words common to all texts from being included in the index.

Originally posted by @iwillspeak in #20 (comment)

  • Stemming and stopword removal
  • TFIDF
  • Drop insignificant terms?</issue_description>

<agent_instructions>Investigate the options for implementing this. The current search only works on full words. We should perform some stemming so that compiler compile and compilers all match. For large sites however we don't want to bloat the search DB too much.

What options are there and what seems like the best tradeoff for reasonably sized documentation sites. </agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: iwillspeak <1004401+iwillspeak@users.noreply.github.com>
Copilot AI changed the title [WIP] Add support for stemming and stopword removal in search Add stemming, stopwords, and TF-IDF to search index Mar 6, 2026
@iwillspeak iwillspeak marked this pull request as ready for review March 7, 2026 07:46
@iwillspeak iwillspeak merged commit 49fa454 into main Mar 7, 2026
3 of 4 checks passed
@iwillspeak iwillspeak deleted the copilot/support-stemming-stopwords branch March 7, 2026 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Stemming and Stopwords in Search

2 participants