minhashlib is a fast, minimal MinHash library for string similarity and
near-duplicate detection, inspired by Jeffrey Ullman's Mining Massive Datasets.
Signatures are computed with a Numba-compiled permutation kernel over xxh3
k-shingles, which makes signature generation considerably faster than
datasketch's default configuration. For finding near-duplicates across many
documents without paying the O(n²) cost of all-pairs comparison, it ships a
locality-sensitive hashing (LSH) index.
pip install minhashlib
from minhashlib import MinHash
mh = MinHash(num_perm=128, k=3)
mh.compare("near-duplicate detection", "near duplicate detection") # -> ~0.9MinHash is stateless: compare computes signatures and discards them. When
comparing one document against many, compute each signature once and reuse it:
from minhashlib import MinHash, similarity
mh = MinHash(num_perm=128, k=3)
sigs = [mh.signature(doc) for doc in documents] # O(n) signature generation
similarity(sigs[0], sigs[1]) # cheap pairwise estimateMinHashLSH buckets signatures by bands so that only likely-similar documents
are ever compared. Configure it with a similarity threshold (bands/rows are
derived automatically) or set bands/rows explicitly.
from minhashlib import MinHash, MinHashLSH
mh = MinHash(num_perm=128, k=3)
lsh = MinHashLSH(threshold=0.8, num_perm=128) # or MinHashLSH(num_perm=128, bands=32, rows=4)
for key, text in corpus.items():
lsh.insert(key, mh.signature(text))
candidates = lsh.query(mh.signature(query_text)) # keys likely similar to query_textinsert, query, remove, in, and len are supported. The MinHash used
for the index and for queries must share the same num_perm, k, p, and
seed (the default seed makes signatures comparable across instances).
| Object | Purpose |
|---|---|
MinHash(p, k, num_perm, seed) |
Signature engine. signature(doc), compare(a, b). |
similarity(sig_a, sig_b) |
Estimate the Jaccard similarity of two signatures. |
MinHashLSH(threshold, num_perm, bands=None, rows=None) |
LSH index. insert, query, remove. |
minhashlib aims to be a fast, lean MinHash core plus LSH search. It does not
target the full feature surface of datasketch (weighted MinHash, HyperLogLog,
storage backends, etc.).
Contributions are welcome.