Skip to content

ssavutu/minhashlib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minhashlib

minhashlib is a fast, minimal MinHash library for string similarity and near-duplicate detection, inspired by Jeffrey Ullman's Mining Massive Datasets.

Signatures are computed with a Numba-compiled permutation kernel over xxh3 k-shingles, which makes signature generation considerably faster than datasketch's default configuration. For finding near-duplicates across many documents without paying the O(n²) cost of all-pairs comparison, it ships a locality-sensitive hashing (LSH) index.

Installation

pip install minhashlib

Usage

Comparing two documents

from minhashlib import MinHash

mh = MinHash(num_perm=128, k=3)
mh.compare("near-duplicate detection", "near duplicate detection")  # -> ~0.9

MinHash is stateless: compare computes signatures and discards them. When comparing one document against many, compute each signature once and reuse it:

from minhashlib import MinHash, similarity

mh = MinHash(num_perm=128, k=3)
sigs = [mh.signature(doc) for doc in documents]   # O(n) signature generation
similarity(sigs[0], sigs[1])                       # cheap pairwise estimate

Near-duplicate search at scale (LSH)

MinHashLSH buckets signatures by bands so that only likely-similar documents are ever compared. Configure it with a similarity threshold (bands/rows are derived automatically) or set bands/rows explicitly.

from minhashlib import MinHash, MinHashLSH

mh = MinHash(num_perm=128, k=3)
lsh = MinHashLSH(threshold=0.8, num_perm=128)      # or MinHashLSH(num_perm=128, bands=32, rows=4)

for key, text in corpus.items():
    lsh.insert(key, mh.signature(text))

candidates = lsh.query(mh.signature(query_text))   # keys likely similar to query_text

insert, query, remove, in, and len are supported. The MinHash used for the index and for queries must share the same num_perm, k, p, and seed (the default seed makes signatures comparable across instances).

API

Object Purpose
MinHash(p, k, num_perm, seed) Signature engine. signature(doc), compare(a, b).
similarity(sig_a, sig_b) Estimate the Jaccard similarity of two signatures.
MinHashLSH(threshold, num_perm, bands=None, rows=None) LSH index. insert, query, remove.

Scope

minhashlib aims to be a fast, lean MinHash core plus LSH search. It does not target the full feature surface of datasketch (weighted MinHash, HyperLogLog, storage backends, etc.).

Contributing

Contributions are welcome.

About

A small set of utils for similarity checking, primarily via MinHashing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages