minhashlib

minhashlib is a fast, minimal MinHash library for string similarity and near-duplicate detection, inspired by Jeffrey Ullman's Mining Massive Datasets.

Signatures are computed with a Numba-compiled permutation kernel over xxh3 k-shingles, which makes signature generation considerably faster than datasketch's default configuration. For finding near-duplicates across many documents without paying the O(n²) cost of all-pairs comparison, it ships a locality-sensitive hashing (LSH) index.

Installation

pip install minhashlib

Usage

Comparing two documents

from minhashlib import MinHash

mh = MinHash(num_perm=128, k=3)
mh.compare("near-duplicate detection", "near duplicate detection")  # -> ~0.9

MinHash is stateless: compare computes signatures and discards them. When comparing one document against many, compute each signature once and reuse it:

from minhashlib import MinHash, similarity

mh = MinHash(num_perm=128, k=3)
sigs = [mh.signature(doc) for doc in documents]   # O(n) signature generation
similarity(sigs[0], sigs[1])                       # cheap pairwise estimate

Near-duplicate search at scale (LSH)

MinHashLSH buckets signatures by bands so that only likely-similar documents are ever compared. Configure it with a similarity threshold (bands/rows are derived automatically) or set bands/rows explicitly.

from minhashlib import MinHash, MinHashLSH

mh = MinHash(num_perm=128, k=3)
lsh = MinHashLSH(threshold=0.8, num_perm=128)      # or MinHashLSH(num_perm=128, bands=32, rows=4)

for key, text in corpus.items():
    lsh.insert(key, mh.signature(text))

candidates = lsh.query(mh.signature(query_text))   # keys likely similar to query_text

insert, query, remove, in, and len are supported. The MinHash used for the index and for queries must share the same num_perm, k, p, and seed (the default seed makes signatures comparable across instances).

API

Object	Purpose
`MinHash(p, k, num_perm, seed)`	Signature engine. `signature(doc)`, `compare(a, b)`.
`similarity(sig_a, sig_b)`	Estimate the Jaccard similarity of two signatures.
`MinHashLSH(threshold, num_perm, bands=None, rows=None)`	LSH index. `insert`, `query`, `remove`.

Scope

minhashlib aims to be a fast, lean MinHash core plus LSH search. It does not target the full feature surface of datasketch (weighted MinHash, HyperLogLog, storage backends, etc.).

Contributing

Contributions are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmarks		benchmarks
dist		dist
minhashlib		minhashlib
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minhashlib

Installation

Usage

Comparing two documents

Near-duplicate search at scale (LSH)

API

Scope

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

minhashlib

Installation

Usage

Comparing two documents

Near-duplicate search at scale (LSH)

API

Scope

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages